Converting written text into spoken words.
Phonetics: The study of speech sounds and their production, classification, and transcription.
Prosody: The patterns of stress, rhythm, intonation, and timing in speech.
Speech synthesis: The artificial production of human speech through computer algorithms.
Text normalization: The process of converting written text into a form suitable for speech synthesis.
Pronunciation modeling: The methods for determining how to pronounce words in a given language.
Acoustic modeling: The creation of acoustic models for individual sounds or phonemes.
Natural language processing: The use of computational techniques to analyze, understand and generate human language.
Corpus linguistics: The study of large collections of text data used for language research.
Machine learning: The use of algorithms and statistical models to enable a computer to learn from data.
Linguistic data processing: The process of preparing linguistic data for use in natural language applications.
Formant synthesis: This type of TTS applies a set of mathematical algorithms to generate the speech output. It is highly customizable, but the voice generated is not very natural-sounding.
Concatenative synthesis: This method creates speech by assembling segments of recorded speech together. It can produce highly natural-sounding voices but requires a large database of recorded speech.
Parametric synthesis: This method dynamically generates speech by manipulating parameters like pitch, intonation, and duration. It is highly customizable and can create highly natural-sounding voices.
Articulatory synthesis: In this method, the movements of the articulators (tongue, lips, etc.) are modeled to produce highly realistic sounding speech. It is still in the experimental stage.
Hybrid synthesis: This method combines two or more of the above methods to produce natural-sounding voices with a reduced size of the database required.
Deep learning-based synthesis: It is an advanced form of parametric synthesis that operates using deep neural networks. The network is trained on large datasets to generate natural-sounding speech.
Rule-based synthesis: This type furnishes rules that govern how the synthesized voice should be produced, including spectrogram generation.
Singing synthesis: This method produces singing voices that resemble a human singer's voice.
Emotional TTS: It uses deep learning techniques to produce varying emotions in the synthesized voice.
Accent and dialect synthesis: It can generate customized accents and dialects for different regions.
Whisper synthesis: This method produces a whispering voice, which often mimics human whispering.
Shouting synthesis: This method produces a voice that emphasizes strong intonation and pitch, as observed in shout patterns.
Voice cloning: It can clone human voices, using a neural network to capture the target speaker's voice, intonation, and speaking style.
Audio morphing synthesis: It can blend a synthesized voice with a recorded one to transform the speaker's voice's sound.
Audiobook narration synthesis: It produces narrations of reading books by adjusting pace, pitch, and intonation.
Text-to-speech synthesis for gaming applications: It can generate voices for gaming characters by enhancing inflection and speech patterns, typically used in video games.
Multilingual TTS synthesis: It can communicate in multiple languages, improving the accessibility of applications for various global audiences.
Speech-to-speech translation (STST): This method translates spoken sentences into a different language to promote intercultural communication.