Right now I’m experimenting with using KDE connect in combination with Google speech recognition on my android smartphone.
KDE connect allows you to use your android device as an input device for your Linux computer (there are also some other features). You need to install the KDE connect app from the Google play store on your smartphone/tablet and install both kdeconnect and indicator-kdeconnect on your Linux computer. For Ubuntu systems the install goes as follows:
sudo add-apt-repository ppa:vikoadi/ppa sudo apt update sudo apt install kdeconnect indicator-kdeconnect
The downside of this installation is that it installs a bunch of KDE packages that you don’t need if you don’t use the KDE desktop environment.
Once you pair your android device with your computer (they have to be on the same network) you can use the android keyboard and then click/press on the mic to use Google speech recognition. As you talk, text will start to appear where ever your cursor is active on your Linux computer.
As for the results, they are a bit mixed for me as I’m currently writing some technical astrophysics document and Google speech recognition is struggling with the jargon that you don’t typically read. Also forget about it figuring out punctuation or proper capitalization.
It supports 7+ languages.
First you convert the file to the required format and then you recognize it:
ffmpeg -i file.mp3 -ar 16000 -ac 1 file.wav
Then install vosk-api with pip:
pip3 install vosk
Then use these steps:
git clone https://github.com/alphacep/vosk-api cd vosk-api/python/example wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip unzip vosk-model-small-en-us-0.3.zip mv vosk-model-small-en-us-0.3 model python3 ./test_simple.py test.wav > result.json
The result is stored in JSON format.
The same directory also contains an SRT subtitle output example, which is more human readable and can be directly useful to people with that use case:
python3 -m pip install srt python3 ./test_srt.py test.wav
The sections below show some testing I did with it.
test.wav case study
test.wav example given in the repository says in perfect American English accent and perfect sound quality three sentences which I transcribe as:
one zero zero zero one nine oh two one oh zero one eight zero three
The “nine oh two one oh” is said very fast, but still clear. The “z” of the before last “zero” sounds a bit like an “s”.
The SRT generated above reads:
1 00:00:00,870 --> 00:00:02,610 what zero zero zero one 2 00:00:03,930 --> 00:00:04,950 no no to uno 3 00:00:06,240 --> 00:00:08,010 cyril one eight zero three
so we can see that several mistakes were made, presumably in part because we have the understanding that all words are numbers to help us.
Next I also tried with the
vosk-model-en-us-aspire-0.2 which was a 1.4GB download compared to 36MB of
vosk-model-small-en-us-0.3 and is listed at https://alphacephei.com/vosk/models:
mv model model.vosk-model-small-en-us-0.3 wget https://alphacephei.com/vosk/models/vosk-model-en-us-aspire-0.2.zip unzip vosk-model-en-us-aspire-0.2.zip mv vosk-model-en-us-aspire-0.2 model
and the result was:
1 00:00:00,840 --> 00:00:02,610 one zero zero zero one 2 00:00:04,026 --> 00:00:04,980 i know what you window 3 00:00:06,270 --> 00:00:07,980 serial one eight zero three
which got one more word correct.
IBM “Think” Speech case study
Now let’s have some fun, shall we. From https://en.wikipedia.org/wiki/Think_(IBM) (public domain in USA):
wget https://upload.wikimedia.org/wikipedia/commons/4/49/Think_Thomas_J_Watson_Sr.ogg ffmpeg -i Think_Thomas_J_Watson_Sr.ogg -ar 16000 -ac 1 think.wav time python3 ./test_srt.py think.wav > think.srt
The sound quality is not great, with a lot of microphone hissing noise due to the technology of the time. The speech is however very clear and paused. The recording is 28 seconds long, and the wav file is 900KB large.
Conversion took 32 seconds. Sample output of the three first sentences:
1 00:00:00,299 --> 00:00:01,650 and we must study 2 00:00:02,761 --> 00:00:05,549 reading listening name scott 3 00:00:06,300 --> 00:00:08,820 observing and thank you
and the Wikipedia transcription for the same segment reads:
1 00:00:00,518 --> 00:00:02,513 And we must study 2 00:00:02,613 --> 00:00:08,492 through reading, listening, discussing, observing, and thinking.
“We choose to go to the Moon” case study
https://en.wikipedia.org/wiki/We_choose_to_go_to_the_Moon (public domain)
OK, one more fun one. This audio has good sound quality, with occasional approval screams by the crowd, and a slight echo of the venue:
wget -O moon.ogv https://upload.wikimedia.org/wikipedia/commons/1/16/President_Kennedy%27s_Speech_at_Rice_University.ogv ffmpeg -i moon.ogv -ss 09:12 -to 09:29 -q:a 0 -map a -ar 16000 -ac 1 moon.wav time python3 ./test_srt.py moon.wav > moon.srt
Audio duration: 17s, wav file size 532K, conversion time 22s, output:
00:00:01,410 –> 00:00:16,800
we choose to go to the moon in this decade and do the other things not because they are easy but because they are hard because that goal will serve to organize and measure the best of our energies and skills
and the corresponding Wikipedia captions:
89 00:09:06,310 --> 00:09:18,900 We choose to go to the moon in this decade and do the other things, 90 00:09:18,900 --> 00:09:22,550 not because they are easy, but because they are hard, 91 00:09:22,550 --> 00:09:30,000 because that goal will serve to organize and measure the best of our energies and skills,
Perfect except for a missing “the” and punctuation!
Tested on vosk-api 7af3e9a334fbb9557f2a41b97ba77b9745e120b3, Ubuntu 20.04, Lenovo ThinkPad P51.
This answer is based on https://askubuntu.com/a/423849/52975 by Nikolay Shmyrev with additions by me.
https://github.com/ideasman42/nerd-dictation (wrapper for VOSK-API)
Try nerd-dictation, it’s a simple way to access VOSK-API, which is a high quality offline, open-source speech to text engine.
See demo video.
full disclosure, I couldn’t find any solutions that suited my use case, so I wrote this small utility to scratch my own itch.