Voice recognition and text to speech: Difference between revisions
(2 intermediate revisions by the same user not shown) | |||
Line 19: | Line 19: | ||
Audapolis | Audapolis | ||
: offline | |||
: reads video, Audio | |||
: https://github.com/bugbakery/audapolis | : https://github.com/bugbakery/audapolis | ||
Line 28: | Line 30: | ||
Whisper (by OpenAI) | Whisper (by OpenAI) | ||
: https://openai.com/blog/whisper/ | : https://openai.com/blog/whisper/ | ||
: https://github.com/openai/whisper | |||
: speech-to-text, text-to-speech, and speech translation | : speech-to-text, text-to-speech, and speech translation | ||
Line 38: | Line 41: | ||
Julius | Julius | ||
: Open Source (BSD3) | |||
: https://github.com/julius-speech/julius | : https://github.com/julius-speech/julius | ||
Flashlight ASR | Flashlight ASR | ||
: Open source (MIT) | |||
: https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr | : https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr | ||
Latest revision as of 14:44, 16 April 2024
Voice recognition
VOSK
- offline
- Apache license
- https://alphacephei.com/vosk/
- 20 languages
DeepSpeech
- offline
- Mozilla license
- https://github.com/mozilla/DeepSpeech
- itself focused on English(verify) but trainable and there are various models out there
Audapolis
- offline
- reads video, Audio
- https://github.com/bugbakery/audapolis
PaddleSpeech
Whisper (by OpenAI)
- https://openai.com/blog/whisper/
- https://github.com/openai/whisper
- speech-to-text, text-to-speech, and speech translation
Kaldi
Julius
- Open Source (BSD3)
- https://github.com/julius-speech/julius
Flashlight ASR
- Open source (MIT)
- https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr
MOVI
- eSpeak plus Sphinx?
- speech recognizer, voice synthesizer
Voice synthesis and Text-to-speech
Analog hardware
The earliest way of pronouncing in a human-like are devices like the Voder [1] which you can see as half of a vocoder.
The earliest talking products, often toys, would basically be the most minimal viable record players, playing fixed phrases from fixed plastic records, wind-up and non-electronic.
Speech ICs - mostly early ones
Around the seventies, it became viable to create ICs that produces speech, though it was fairly basic.
Synthesizing arbitrary words is hard to do, because you then need
- some way to map all words to phonemes
- and languages like english have a lot of weird exceptions even just in common words
- unseen words
- plus phoneme blending
- just playing phonemes
- decent intonation
- is often contextual
Earlier ICs had a set of known words that mapped to which phonemes and transitions to use.
Some variants accepted only phonemes input in the first place (allophone style, usually).
There were also slightly more capable variants used in speech research.
TMC0280 / TMS5100 and related (1978)
- speak and spell, Apple II Echo 2, arcade games
- while these do linear predictive coding[2] style synthesis, these are essentially vocal tract parameters, not phonemes
- so these were typically driven from a fixed vocabulary of around 200 words, and letters
- though apparently a few things made it do more arbitrary things?
- Emulation: Yes, e.g. by MAME
TMS5220 (1980), TMS5220C (1983), TSP50C50 (1985), TSP50C40 (1986)
- improvements on the same idea, used in later products
- see also https://en.wikipedia.org/wiki/Texas_Instruments_LPC_Speech_Chips
- Emulation: Talkie (arduino) [3] seems to imitate TMS5220(verify)
Toshiba T6721A
- Commodore Magic Voice
- known vocabulary (also approx 200 words), refuses on anything else
- Emulation: Yes, e.g. by YAPE(verify)
Votrax SC-01, SC-01A (1980?)
- phoneme chips
- often with something else looking up words or following syntax rules
- http://www.redcedar.com/sc01.htm
- Emulation: yes, e.g. by MAME [4], and e.g. [5][6] puts that MAME code in an STM32 to be a replacement for a broken IC
SSI 263A (a.k.a. SC-02?) (1985?)
SP0256-AL2 [7] (1980s?)
- follows some basic english phonetic rules
- The -AL2 means there are english allophones in there - there are other variants
- interfacing
- 8 bits of data, plus some latching
- you would typically load an address that contains allophone data (the ROM actually controls pitch, amplitude, formants)
- (could you send this manually? Or would you have to emulate an external ROM?)
- Instruction set
- examples
- Emulation: Yes, MAME
CTS256A-AL2
MEA8000 [10]
When compared to e.g. a SP0256, this has neither the microprocessor or ROM - intended to be controlled by a separate microprocessor (which has its own ROM)
...so you send the saw/noise+formant model parameters.
- Emulator: Yes, in MAME [11]
PCF8200 - like the MEA8000 (seemingly based on it), but a little more capable.
DECTalk
- A fairly large board (verify), though the later Dectalk Express made it more portable.
- Emulation: Yes, e.g. DTC-01 (MAME), https://github.com/connornishijima/80speak
Steven Hawking's voice is an oddball, because it is actually the voice of Dennis Klatt, the engineer who initially made his system. At the time, it was the best automated speech you could get.
Klatt's work went into other products, like the DECtalk.
While the earlier variant Hawking sounded robotic, he refused upgrades, in part because he identified with the voice over time. He also seems to have appreciated the work of Klatt, who continued his work even when Klatt lost his voice.
Hardware-wise, the Speech Plus CallText 5010 (a model specific to Hawking) is basically a custom computer (quite old, based on an 80188), and the most interesting part of it is the DSP that translates formant descriptors to sound.
https://speechkit.io/blog/stephen-hawkings-voice/
unsorted
RoboVoice SP0-512
- english text to speech, relatively basic
- SP0-512-Datasheet.pdf
Franklin Language Master LM4000
DT1050 Digitalker
- http://vtda.org/docs/components/NatSemi/Digitalker/IM-FL30M120_DT1050_Digitalker_Datasheet_Dec80.pdf
V30120
- Emic 2
- Fonix DECtalk ?
more recent
6188 (2003?), SYN6288 (2010?)
- chinese, other?
WTS701
XFS5152CE (much more recent?)
- arbitrary text, chinese and english
Software (also starting early)