Speech Synthesis

demery's version from 2016-05-12 21:26

Section 1

Question Answer
23) What kind of contextual information is useful for determining the correct phoneme given grapheme in a word?The previous graphemes and phonemes as well as the next couple graphemes
24) What is prosody?Pitch Variance and length/rhythm of words in a sentence
25) In terms of prosody what is the difference between the statement "It's nice out." and the question "It's nice out?"The question raises pitch at the end
26) What is an intonation phrase? How does it differ from a syntatic phrase?An intonation phrase boundary occurs when the pitch contours in a speech phrase reset. Hence, it is not based on syntax like a syntatic phrase, so the boundaries may be different.
27) give an example of a feature that might be useful for a intonation phrase boundary classifier?Phrase length is often used. The longer a phrase the more liekly we are about to hit a phrase boundary, the shorter the phrase the less likely. Also, part of speech.
28) Why is it important to know which word in an intonation phrase is pitch accentedPitch accents can totally change the meeting of the sentence.
29) Why do most TTS systems limit themselves to the following prosodic contours: question rise, declarative final fall, and continuation rise?It would be really hard to do all of them, these three are the most important and are easier to indentify
30) Compared to unit selection, what additional steps are required in diphone synthesis to capture the prosody of the target utterance?For unit selection, you pick the units that match the prosody that you want (if you have them). In diphone synthesis, the diphones are passed thru some filter to change thier prosody.
31) Give an example of how prosodic context affects segment duration.Do you REALLY want an example? Unstressed syllables tend to be reduced where accented/stressed syllables are elongated.
32) What are pitch accent labels like H*, L*, etc. used for in TTS prosodic analysis? These act as markers for where to anchor the pitch accent contours.
33) In the context of speech synthesis, what is a diphone?End half of one phone, start half of another

Section 2

Question Answer
34) Why is the number of needed diphones for a diphone synthesis database typically less than n^2, where n is the number of sounds in the language?Not all possible transitions are used in the language.
35) What does it mean to run an ASR system in forced alignment mode?You tell the system what is being said and it forces it to match. So, if it knows a B is the first sound, it matches the first sound as B.
36) How can we shorten the duration of a diphone unit?Cut the diphone into pieces and look for stable (the same repeating) parts cut out some of those
37) Name one disadvantage to diphone synthesis It doesn't sound very good, the signal processing still sounds robotic and it still has joins
38) In unit selection synthesis, what are join costs?Joining any two segements runs the risk of sounding bad. Join cost is some metric for how bad, that is based off of how similar the end of the first segement and the beggining of the second are.
39) Why might it be important to know whether a unit was recorded as part of a function word (as opposed to a content word)?Function words are almost always reduced, where content words are not.
40) What does it mean for the join cost of a candidate unit to be zero? Is this good or bad?This means the join is perfect! It probably means that they were originally together in the recording.
41) Explain briefly how the Virterbi algorithm is used to find the best sequence of units to synthesize an utterance in unit selection?We build an HMM where the observed states are graphemes and the hidden states are phonemes. The transition probability is the join cost of joining the current one with the next one. We then execute the virterbi algorthim where the best path is the cheapest i.e. lowest combined joint cost and well matching phonemes. Note that a target cost is given to each node, which is how well the unit matches the grapheme.
42) In parametric/HMM synthesis, how is an HMM used to generate speech?Given a sequence of graphemes to synthesize, uses HMM to find mostly likely sequence of speech parameters
43) In parametric/HMM synthesis, why do we need to model excitation parameters in addition to specral parameters?In ASR, we were only concerned with the spectral features because we didn't care about the source. Now, the source is a critical part and hence we need excitation parameters for the source and still need spectral information for the filter

Section 3

Question Answer
What two measures are most commonly used to evaluate TTS systems?Intelligibility and Naturalness
45) What are rhyme tests (e.g., discriminating pairs like 'tense' and 'dense') used to evaluate?Intelligibility
46) Questioning, Promising, Expressing, and Declaring are all examples ofSpeech acts
47) Back-channel acknowledgments like 'yeah' are important for:Letting the speaker know you are still in the conversation
48) If you say there are 5 open seats in CS105 when ther is only 1, you have violated which maxim?Quality
49) Which type of dialogue system fills in blanks and allows the user to provide more than one answer at a time?Frame-Based
50) The exchange 'I want a flight to Dublin', 'When do you what to fly to dublin?' is an example ofimplicit confirmation


Question Answer
1) When is speaker adaptation needed?When the system encounters a speakers outside of what the recognizer was trained on, like someone with an accent
2) How is speaker adaptation accomplished?Basically, you give the HMM a small sample of KNOWN speech from that speaker, and than the model is shifted to better match that segment.
3) What is an OOV word?An out of vocabulary word, a word not in the lexicon
4) Which of the following is finite in an ASR system: acoustic model, lexicon, language model?The lexicon, which is the dictionary of words.
5) What kinds of words (i.e., parts of speech) are more likely to be OOV words.Content words, proper nouns (these are almost always the same the same)


Question Answer
6) What step(s) of TTS are part of the front end? What step(s) are part of the back end?The front end is where text processing and analysis happens. The back end is where it makes phonemes into waveforms
7) TF Both diphone and formant synthesis involve concatenating actual speech recordings.False, formant synthesis does not
8) TF Unit selection databases are smaller than diphone synthesis databases.False, unit selection databases are huge, pretty much the only problem with unit selection
9) TF A weakness of concatenative synthesis is potentially audible 'glitches' where recordings are joinedTrue
10) TF All parametric synthesis systems are implemented with HMMsFalse
11) TF Vary early approaches to speech synthesis involved manual replications of the human articulatory systemTrue
12) We discussed the use of classifiers in many aspects of TTS. Briefly describe what a classifier isA classifier is something that classifies things. That is kind of a redundant answer. A classifier may attempt to classify things by part of speech, and they are trained on a set of marked data.
13) Provide a least one feature that might be useful for a classifier in a sentence to tokenizepart of speech, intonation boundaries
14) What is a NSW (non-standard word)?It is a word that cannot be pronounced regurarly like some abbreviations (TGIF, Fri.) and other things such as Ke$ha
15) What is the difference between an abbreviation (like Fri.) and a letter sequence?Fri. expands to Friday (one word) but for TGIF each letter would just be pronounce individually
16) What is the difference between a letter sequence (like TGIF) and an acronym (like AWOL)?AWOL can be pronounced as a word and it is, TGIF cannot be pronounced as a word so the letters are read individually
17) What is homograph disambiguation?Sometimes two words are spelled the same but said differently depending on context. For example bass (instrument) and bass (fish)
18) What is G2P? What role does it play in a TTS system?Grapheme to phoneme. Makes the strings of letters it is given into a string of phones. It gets the 'sounds' ready to be converted into sounds
19) Why do some systems train separate G2P converters for names and non-names?Names are said differently since they are often from different languages
20) Write a letter to sound rule for the c in 'chemistry' (letter-to-sound rules are insufficient for G2P)C -> K like in chef... wait
21) Give one possible alignment between the word 'baseball' and its pronunciationB:B a:EY s:S e:<eps> b:B a:AA l:L l:<eps>

Recent badges