34) Why is the number of needed diphones for a diphone synthesis database typically less than n^2, where n is the number of sounds in the language?
Not all possible transitions are used in the language.
35) What does it mean to run an ASR system in forced alignment mode?
You tell the system what is being said and it forces it to match. So, if it knows a B is the first sound, it matches the first sound as B.
36) How can we shorten the duration of a diphone unit?
Cut the diphone into pieces and look for stable (the same repeating) parts cut out some of those
37) Name one disadvantage to diphone synthesis
It doesn't sound very good, the signal processing still sounds robotic and it still has joins
38) In unit selection synthesis, what are join costs?
Joining any two segements runs the risk of sounding bad. Join cost is some metric for how bad, that is based off of how similar the end of the first segement and the beggining of the second are.
39) Why might it be important to know whether a unit was recorded as part of a function word (as opposed to a content word)?
Function words are almost always reduced, where content words are not.
40) What does it mean for the join cost of a candidate unit to be zero? Is this good or bad?
This means the join is perfect! It probably means that they were originally together in the recording.
41) Explain briefly how the Virterbi algorithm is used to find the best sequence of units to synthesize an utterance in unit selection?
We build an HMM where the observed states are graphemes and the hidden states are phonemes. The transition probability is the join cost of joining the current one with the next one. We then execute the virterbi algorthim where the best path is the cheapest i.e. lowest combined joint cost and well matching phonemes. Note that a target cost is given to each node, which is how well the unit matches the grapheme.
42) In parametric/HMM synthesis, how is an HMM used to generate speech?
Given a sequence of graphemes to synthesize, uses HMM to find mostly likely sequence of speech parameters
43) In parametric/HMM synthesis, why do we need to model excitation parameters in addition to specral parameters?
In ASR, we were only concerned with the spectral features because we didn't care about the source. Now, the source is a critical part and hence we need excitation parameters for the source and still need spectral information for the filter