|
The definition in the Macquarie Dictionary follows the Oxford Dictionary in giving the original (1873) usage of this term from Philology. The example, of course, presents no challenge to Text-To-Speech systems. In a TTS context, homograph usually refers to words with the same spelling but different pronunciation. So the original usage is restricted by excluding homographs with the same pronunciation, and extended by including words (with the same spelling but different pronunciation) which are related in meaning. These additional homographs typically have the stress on a different syllable depending on their part of speech in a sentence.
Homographs in the original sense are essentially accidents of English orthography, many dating back to Dr Johnson's time, when English spelling was finally standardised. The orthography is unsystematic, sometimes reflecting a word's origin and sometimes its pronunciation (which may no longer be current). Homographs whose pronunciation differs only in their stress pattern obviously arise from the lack of a notation for stress in the orthography.
A TTS system does not really need an explanation of how the current set of homographs came to be present in the English language. For TTS purposes, a more pragmatic classification of homographs is introduced—weak/strong and easy/hard:-
weak | pronunciations differ only in stress or vowel reduction |
an unstressed or intermediate pronunciation would be acceptable | |
strong | pronunciations differ in one or more phonemes |
only the correct pronunciation would be acceptable | |
easy | different pronunciations apply to different parts of speech |
hard | different pronunciations apply to the same parts of speech |
The weak/strong classification assumes that in some cases disambiguation of homographs may not be necessary to achieve an acceptable pronunciation. The easy/hard classification assumes that disambiguation of easy homographs could be readily achieved by pre-processing each sentence with POS tagger or a syntax parser to obtain the part of speech of each word. Both of these assumptions will be examined below.
Here is the list of 642 words labelled as homographs in the current TTS dictionary (not the one currently attached to Mu-Talk, but the one currently under development). Each entry in the list defines a headword, with sub-entries #1, #2, #3, ... indicating the pronunciation (\P), part of speech (\G), and root word (\R) for its homographs:-
The list can be displayed in more legible form as a table, with the sub-entries for each headword placed underneath one another:-
This makes it easier to compare their pronunciations and parts of speech. The pronunciations utilise the ANDOSL machine-readable phoneme symbols. Here is a key to the pronunciation of the symbols. The corresponding symbols used in the Festival speech synthesiser are shown as well:-
vowels | consonants |
Classification is straightforward when each sub-entry has a single pronunciation:-
Multiple pronunciations for a sub-entry indicate alternative acceptable pronunciations. The TTS dictionary appears to be based on the Macquarie Dictionary and to follow the definitions and pronunciations found in that work. In the Macquarie Dictionary, for headwords where more than one pronunciation is given, the first of these is the one more widely used. For speech synthesis purposes, the TTS dictionary does not need to store alternative pronunciations for a sub-entry. It seems obvious that the more widely used pronunciation should be selected by the synthesiser. The existence of alternatives, however, allows the classification to be finessed by the following strategy. In the examples, the selected pronunciation for each sub-entry is in bold type:-
These cases will be weak or strong, depending on the way the first pronunciations differ.
When the original list of 642 words is classified by applying these selection rules, it breaks down as follows:-
null 56 |
weak/easy 422 |
strong/easy 43 |
weak/hard 31 |
strong/hard 90 |
Thus of the 586 remaining headwords with homographs, 79% are easy to disambiguate if the part of speech can be determined by the synthesiser, and 21% are not.
These cases have been classed as easy on the assumption that the part of speech can efficiently and accurately be determined by the synthesiser. The SHLRC Corpus contains a version of the ACE corpus which has been tagged using Eric Brill's POS tagger. The tags appear to be the Penn Treebank tags. A spot check of the tagged ACE corpus for some examples from the strong/easy list showed that several occurrences had been misclassifed:-
The on-line µ-TBL Brill tagger, however, correctly tagged abuse
as a verb in B04 869 and E38 7630. All of the above except A16 3272 were tagged correctly by the on-line CLAWS POS tagger.
According to Jurafsky and Martin (p308), it should be possible for a POS tagger to achieve 96-97% accuracy.
For the weak cases, where pronunciations differ only in stress or vowel reduction, it was claimed that an unstressed or intermediate pronunciation would be acceptable. This was tested using the Festival speech synthesiser. The following script defines three pronunciations of content
, discount
, and second
with
It then uses all three pronunciations to record utterances where one of the stressed pronunciations would be correct. For each entry added to the lexicon, the pronunciation of each syllable is specified using the symbols defined above. After each syllable, the value 1 or 0 indicates the presence or absence of stress on that syllable.
(lex.add.entry '("content00" n (((k o n) 0) ((t e n t) 0)))) (lex.add.entry '("content10" n (((k o n) 1) ((t e n t) 0)))) (lex.add.entry '("content01" n (((k @ n) 0) ((t e n t) 1)))) (lex.add.entry '("discount00" v (((d i s) 0) ((k au n t) 0)))) (lex.add.entry '("discount10" v (((d i s) 1) ((k au n t) 0)))) (lex.add.entry '("discount01" v (((d i s) 0) ((k au n t) 1)))) (lex.add.entry '("second00" v (((s e) 0) ((k o n d) 0)))) (lex.add.entry '("second10" v (((s e) 1) ((k @ n d) 0)))) (lex.add.entry '("second01" v (((s @) 0) ((k o n d) 1)))) (set! utt (Utterance Text "The document is lacking in content00")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "content100.wav" "wav") (set! utt (Utterance Text "The document is lacking in content10")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "content110.wav" "wav") (set! utt (Utterance Text "The document is lacking in content01")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "content101.wav" "wav") (set! utt (Utterance Text "I am in a state of content00.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "content200.wav" "wav") (set! utt (Utterance Text "I am in a state of content01.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "content201.wav" "wav") (set! utt (Utterance Text "I am in a state of content10.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "content210.wav" "wav") (set! utt (Utterance Text "Please discount00 my bill by 10%.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "discount100.wav" "wav") (set! utt (Utterance Text "Please discount10 my bill by 10%.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "discount110.wav" "wav") (set! utt (Utterance Text "Please discount01 my bill by 10%.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "discount101.wav" "wav") (set! utt (Utterance Text "We will discount00 your unconvincing story.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "discount200.wav" "wav") (set! utt (Utterance Text "We will discount01 your unconvincing story.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "discount201.wav" "wav") (set! utt (Utterance Text "We will discount10 your unconvincing story.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "discount210.wav" "wav") (set! utt (Utterance Text "I will second00 your nomination.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "second100.wav" "wav") (set! utt (Utterance Text "I will second10 your nomination.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "second110.wav" "wav") (set! utt (Utterance Text "I will second01 your nomination.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "second101.wav" "wav") (set! utt (Utterance Text "Please second00 that officer to the special unit.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "second200.wav" "wav") (set! utt (Utterance Text "Please second10 that officer to the special unit.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "second210.wav" "wav") (set! utt (Utterance Text "Please second01 that officer to the special unit.")) (utt.synth utt)(utt.play utt)(utt.save.wave utt "second201.wav" "wav") (quit)
The utterances synthesised by Festival can be played from the following table. The correct pronunciation is shown after each utterance. Informal listening tests seem to confirm that the unstressed pronunciations are generally acceptable, whereas the pronunciations with incorrect stress are not.
  |   | [kOntEnt] | [k@n'tEnt] | ['kOntEnt] |
The document is lacking in content. | ['kOntEnt] | |||
I am in a state of content. | [k@n'tEnt] | |||
  |   | [dIskaunt] | [dIs'kaunt] | ['dIskaunt] |
Please discount my bill by 10%. | ['dIskaunt] | |||
We will discount your unconvincing story. | [dIs'kaunt] | |||
  |   | [sEkOnd] | ['sEk@nd] | [s@'kOnd] |
I will second your nomination. | ['sEk@nd]] | |||
Please second that officer to the special unit. | [s@'kOnd] |
Note that for some reason the stressed pronunciations synthesised by Festival for discount
are virtually indistinguishable.
Many of these homographs are obscure words whose meaning and pronunciation would be unknown to the average speaker:-
A more representative subset can be found by checking which of these definitions appears in the Macquarie Concise Dictionary. This is a distillation of the Macquarie Dictionary to meet the needs of the general reader, with about 46,000 of the original 110,000 headwords. The pronunciations included in the Concise dictionary have been highlighted in yellow in the above list. Restricting the TTS dictionary to the highlighted pronunciations would reduce the number of strong/hard homographs from 90 to 46. This restriction may appear somewhat arbitrary, but it allows attention to be focussed on the words which are more likely to be encountered in a practical application (or an evaluation) of the TTS system. The small number of homographs which remain after culling also suggests that a hand-coded approach to their disambiguation would be feasible.
There is a more compelling reason for exclusion of uncommon homographs. A TTS system must discriminate between pronunciations according to context, and any systematic analysis of usage of words in context must be performed on available corpora of texts. If the corpora contain no occurrence of a given homograph, then they can provide no basis for selecting the corresponding pronunciation. Spot checking of the ACE and ICE corpora at the SHLRC Corpus Web Site confirms the absence of the obscure terms given above. These terms presumably exist in the OZCORP corpus on which the Macquarie Dictionary is based.
The topic of Word Sense Disambiguation is discussed in some detail in chapter 17 of Jurafsky and Martin, which is paraphrased here. This is an important aspect of lexical semantic processing, of which homograph disambiguation is just one of many applications. WSD can either be performed as a side-effect of a fully-fledged semantic analysis or by a stand-alone approach which is performed independently. A full semantic analysis is unlikely to feasible in a practical speech synthesis application, in the absence of extensive semantic and commonsense knowledge and a complete and accurate parse of the input text. The robust stand-alone WSD systems are more promising. The input to these systems consists of a target word in its immediate context. The input can be pre-processed by POS tagging and stemming. It is then encoded in a feature vector which is suitable for processing by a learning algorithm.
Collocational features encode information about lexical items in specific positions to the left or right of the target, for example:-
An electric guitar and bass player stand of to one sidegives a feature vector consisting of two words to the left and right of the target with their parts of speech
[guitar/NN and/CC player/NN stand/VB]
Co-occurrence vectors encode the occurrence of specific keywords within a window centred on the target word.
In a supervised learning approach, a WSD system is trained with a set of feature vectors for which the correct labels have been provided. The result is a classifier which will assign labels to newly presented feature vectors. Using a naive Bayes classifier or a neural network has reportedly achieved 73% accuracy.
In a decision list classifier, a series of tests are applied to each feature vector, until a test succeeds, when the corresponding label is selected. In the WSD system described by Yarowsky, various tests are evaluated over the training set by calculating the log-likelihood ratio of the conditional probabilities of the alternative senses, given a matching feature vector. About 96% accuracy is claimed.
Effective training of a WSD requires a large corpus of labelled examples, which can be laborious to produce. It is possible to reduce the workload by using a bootstrapping approach in which an initial classifier is trained on a relatively small number of carefully chosen and hand-labelled examples. This initial classifier is then used to label examples from a larger training set, which is then used for another round of training.
In a quite different unsupervised approach, the features vectors can be clustered by similarity, and the clusters then labelled by hand. Accuracy in the 90% range is claimed for this method.
Finally, there is another different approach based on the use of machine-readable dictionaries. This approach leverages the considerable investment of lexicographic effort which dictionary compilation has entailed. The various senses of the target word are retrieved from the dictionary, and each is compared to the dictionary definitions of all of the words in the context of the target word. The sense with the highest overlap is selected. This approach sounds attractive for MU-Talk, where the machine-readable dictionary could be the same as the source of the TTS dictionary (presumably the Macquarie Dictionary). Unfortunately, accuracy seems to be only 50-70%.
Of the WSD approaches described by Jurafsky and Martin, that of Yarowsky is most attractive, not only because it seems to be the most accurate, but because the use of decision lists makes it very straightforward to program the classification in a TTS system, once the feature vector tests have been ranked by training on a labelled corpus. It is not clear, however, whether his algorithm is readily available. It would also be useful to obtain the full decision lists which he obtained for problematic TTS homographs such as lead
and bass
. Only highly abbreviated lists are published in his paper.
As with any software project, the implementation costs of a fully-fledged WSD system need to be justified by considering whether the improvements in functionality contribute to the design goals of the system. It may be that a useful increment in homograph disambiguation capability can be achieved by implementing a few fairly simple rules inspired by the decision list approach.
Some useful lexical features can be discriminated using regular expressions. Since regular expressions are used to search the SHLRC corpora, their use confers the advantage that the decision rules can readily be tested against the corpora to confirm that only words with the specified pronunciation are retrieved.
The following decision lists have been derived by inspection of the ACE corpus:-
sake n. | |
---|---|
the sake of | ⇒ [seik] |
own sake | ⇒ [seik] |
('s|') sake | ⇒ [seik] |
(my|thy|his|our|your|their) sake | ⇒ [seik] |
(some|of|the) sake | ⇒ ['sa:ki:] |
sake | ⇒ [seik] |
bass n. | |
Bass (highway|Hill|Strait) | ⇒ [bAs] |
bass | ⇒ [beis] |
bow n v. | |
bow tie | ⇒ [b@u] |
take a bow | ⇒ [bau] |
bow and arrow | ⇒ [b@u] |
bow (\S+ ){0,9}?(boat|ship|vessel)('?s)? | ⇒ [bau] |
(boat|ship|vessel)('?s)? (\S+ ){0,9}?bow | ⇒ [bau] |
Bow/NNP | ⇒ [b@u] |
bow/VB | ⇒ [bau] |
bow/NN | ⇒ [b@u] |
invalid adj. | |
invalid pension | ⇒ ['Inv@lId] |
invalid | ⇒ [In'vAl@d] |
minute adj. | |
-minute | ⇒ ['mIn@t] |
\d+ minute | ⇒ ['mIn@t] |
minute | ⇒ [mai'nju:t] |
The aim was to choose the most general rules possible which would correctly classify the target words in the corpus. When these rules are inferred by human instead of machine intelligence, they can made more general than the corpus demands by bringing linguistic knowledge to bear. In the ACE corpus, for example, sake
is only preceded by possessive pronouns your
and their
, but this usage clearly applies to the remaining possessive pronouns as well.
The more complicated expressions for bow
are used to specify co-occurrence of bow
with semantically related words within a window of plus or minus ten words.
The decision list formalism can be used to combine the ideas discussed above and express them in a consistent manner.
For weak/easy cases, an unstressed pronunciation can be used as the default in case the POS tagger applies a tag which is not valid for the word in question:-
absent adj. v. | |
---|---|
absent/JJ | ⇒ ['Abs@nt] |
absent/VB | ⇒ [@b'sEnt] |
absent | ⇒ [AbsEnt] |
For strong/easy cases, the most frequent pronunciation can be used as the default for an invalid tag:-
abuse v. n. | |
---|---|
abuse/VB | ⇒ [@'bju:z] |
abuse/NN | ⇒ [@'bju:s] |
abuse | ⇒ [@'bju:s] |
For weak/hard cases, an unstressed or neutral pronunciation can be used as the default for contexts which lack evidence in the corpora or have an invalid tag:-
axes n. | |
---|---|
axes/NNS to grind | ⇒ ['Aks@z] |
(wood|pick) axes/NNS | ⇒ ['Aks@z] |
(major|minor|x|y|z) axes/NNS | ⇒ ['Aksi:z] |
axes | ⇒ ['AksIz] |
The decision list can also deal with cases where the pronunciation is unambiguous for some parts of speech but not for others:-
discount v. n. | |
---|---|
discount/NN | ⇒ ['dIskaunt] |
discount/VB by | ⇒ ['dIskaunt] |
discount/VB | ⇒ [dIs'kaunt] |
discount | ⇒ [dIskaunt] |
For strong/hard cases, the most frequent pronunciation can be used as the default for contexts which lack evidence in the corpora or have an invalid tag:-
gibber v. n. | |
---|---|
gibber/VB | ⇒ ['dZIb@] |
gibber/NN plain | ⇒ ['gIb@] |
gibber | ⇒ ['dZIb@] |