Human Language Technology
Challenges For Computer Science and Lingusitics
Springer
1 Introduction
How does one find the best Automatic Speech-to-Phoneme Alignment (ASPA) sys-tem? It is easy to generate many hidden Markov models (HMMs) for a given corpus
of training data. However, even if the sequence of phonemes is considered to be
known, the graph of states within each phoneme is largely arbitrary and the acoustic
model for each state can be specified in many different ways. So there are many pos-
sible ASPA systems for a given corpus, and thus many possible alignments.
To date, people have typically approached finding the best ASPA system by manu-
ally creating a “gold standard” set of labels, then running several systems on the la-
belled corpus and computing an error measure of the difference between the system
and the human-generated labels (e.g. [3, 9]).
The definitions of the both the error measure and the “gold standard” labels them-
selves are critical to this process. Furthermore, neither of these can be defined unam-
biguously in any mathematically optimal sense.
To define the “gold standard” labels, humans typically segment speech into pho-
nemes via a composite visual-auditory task, working from a spectrogram on a computer
Enregistrer un commentaire