Expressive speech synthesis: Research and system design with hidden Markov models
While text-to-speech has long been centered on the production of an intelligible message of good quality, interest has recently shifted to the generation of more natural and expressive speech. This comes as an answer to the widespread criticism stating that current speech synthesizers lack fundamental human components. This thesis tackles that issue by considering three fundamental stages of HMM-based speech synthesis: the phonetic and prosodic annotations of the training corpus and their automatic alignment with the speech signal. We first propose a systematic step-by-step study of HMM-based phonetic alignment in which the models are directly trained on the corpus to align. Based on a detailed analysis of the errors made by this technique, we developed three fully-automatic improvement methods which are shown to significantly improve the alignment of highly variable and expressive corpora.

We then present a two-level prosody annotation of expressive corpora, describing accentual patterns and changes in speaking style. The integration of this manual annotation in the synthesis of sports commentaries positively impacts the naturalness of the expressivity. We also present an automatic annotator of accentual patterns in French and show that its integration in synthesis contributes to the naturalness of the voice.

Finally, our study points out that the choice for phonetic variants in French is influenced by the speaking style and that their consideration in the synthesis of sports commentaries improves the naturalness of the message. This indicates that phonetic changes should be considered, both at training and synthesis stages.

