Voice recognition and text to speech

From Helpful
Jump to navigation Jump to search


Voice recognition

VOSK

offline
Apache license
https://alphacephei.com/vosk/
20 languages


DeepSpeech

offline
Mozilla license
https://github.com/mozilla/DeepSpeech
itself focused on English(verify) but trainable and there are various models out there


Audapolis

offline
reads video, Audio
https://github.com/bugbakery/audapolis


PaddleSpeech

https://github.com/PaddlePaddle/PaddleSpeech


Whisper (by OpenAI)

https://openai.com/blog/whisper/
speech-to-text, text-to-speech, and speech translation


Kaldi

FOSS
https://github.com/kaldi-asr/kaldi


Julius

https://github.com/julius-speech/julius


Flashlight ASR

https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr


MOVI

eSpeak plus Sphinx?
speech recognizer, voice synthesizer



Voice synthesis and Text-to-speech

Analog hardware

The earliest way of pronouncing in a human-like are devices like the Voder [1] which you can see as half of a vocoder.


The earliest talking products, often toys, would basically be the most minimal viable record players, playing fixed phrases from fixed plastic records, wind-up and non-electronic.


Speech ICs - mostly early ones

Around the seventies, it became viable to create ICs that produces speech, though it was fairly basic.


Synthesizing arbitrary words is hard to do, because you then need

  • some way to map all words to phonemes
and languages like english have a lot of weird exceptions even just in common words
unseen words
  • plus phoneme blending
just playing phonemes
  • decent intonation
is often contextual


Earlier ICs had a set of known words that mapped to which phonemes and transitions to use.

Some variants accepted only phonemes input in the first place (allophone style, usually).

There were also slightly more capable variants used in speech research.



TMC0280 / TMS5100 and related (1978)

speak and spell, Apple II Echo 2, arcade games
while these do linear predictive coding[2] style synthesis, these are essentially vocal tract parameters, not phonemes
so these were typically driven from a fixed vocabulary of around 200 words, and letters
though apparently a few things made it do more arbitrary things?
Emulation: Yes, e.g. by MAME

TMS5220 (1980), TMS5220C (1983), TSP50C50 (1985), TSP50C40 (1986)

improvements on the same idea, used in later products
see also https://en.wikipedia.org/wiki/Texas_Instruments_LPC_Speech_Chips
Emulation: Talkie (arduino) [3] seems to imitate TMS5220(verify)


Toshiba T6721A

Commodore Magic Voice
known vocabulary (also approx 200 words), refuses on anything else
Emulation: Yes, e.g. by YAPE(verify)

Votrax SC-01, SC-01A (1980?)

phoneme chips
often with something else looking up words or following syntax rules
http://www.redcedar.com/sc01.htm
Emulation: yes, e.g. by MAME [4], and e.g. [5][6] puts that MAME code in an STM32 to be a replacement for a broken IC


SSI 263A (a.k.a. SC-02?) (1985?)


SP0256-AL2 [7] (1980s?)

follows some basic english phonetic rules
The -AL2 means there are english allophones in there - there are other variants
interfacing
8 bits of data, plus some latching
you would typically load an address that contains allophone data (the ROM actually controls pitch, amplitude, formants)
(could you send this manually? Or would you have to emulate an external ROM?)
Instruction set
examples
Currah speech 64, a.k.a. Voice Messenger [8], Tandy Speech/Sound Cartridge (alongside a AY-3-8913, see PSGs) [9], Amstrad SSA-1
Emulation: Yes, MAME


CTS256A-AL2


MEA8000 [10] When compared to e.g. a SP0256, this has neither the microprocessor or ROM - intended to be controlled by a separate microprocessor (which has its own ROM) ...so you send the saw/noise+formant model parameters.

Emulator: Yes, in MAME [11]


PCF8200 - like the MEA8000 (seemingly based on it), but a little more capable.


DECTalk

A fairly large board (verify), though the later Dectalk Express made it more portable.
Emulation: Yes, e.g. DTC-01 (MAME), https://github.com/connornishijima/80speak



Steven Hawking's voice is an oddball, because it is actually the voice of Dennis Klatt, the engineer who initially made his system. At the time, it was the best automated speech you could get.

Klatt's work went into other products, like the DECtalk.

While the earlier variant Hawking sounded robotic, he refused upgrades, in part because he identified with the voice over time. He also seems to have appreciated the work of Klatt, who continued his work even when Klatt lost his voice.

Hardware-wise, the Speech Plus CallText 5010 (a model specific to Hawking) is basically a custom computer (quite old, based on an 80188), and the most interesting part of it is the DSP that translates formant descriptors to sound.


https://speechkit.io/blog/stephen-hawkings-voice/



unsorted


RoboVoice SP0-512

english text to speech, relatively basic
SP0-512-Datasheet.pdf


Franklin Language Master LM4000


DT1050 Digitalker

http://vtda.org/docs/components/NatSemi/Digitalker/IM-FL30M120_DT1050_Digitalker_Datasheet_Dec80.pdf



V30120

Emic 2
Fonix DECtalk ?


more recent


6188 (2003?), SYN6288 (2010?)

chinese, other?


WTS701


XFS5152CE (much more recent?)

arbitrary text, chinese and english

Software (also starting early)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)