Well-developed quality or development potential? – a task for the hackathon at Retresco
After the preselection of the test speech-recognition APIs had been chosen by certain parameters, an exciting race between the 5 finalists evolved.
During our mostly two-day intern hackathons, we try new ideas or evaluate the current state of technology. This time we took a look at the stage of development of speech-recognition software. Based on certain selected examples this rtr-hackathon wanted to see which speech APIs that are available at the market work well, which only partly and which not at all.
The key question was to check to what extend the quality of speech recognition is actually sufficient for real usage scenarios. As possible scenarios it was distinguished between “chat” and “text”. The difference between those two scenarios is the length of text, whereas our “chat-scenario” stands for short speech sequences (ca. 15 seconds), how it e.g. appears in voice commands during the interaction with assistance functions. The “text-scenario” however represents a use case of transcription of longer voice recordings (ca. 2 minutes).
Our focus during the hackathon was on the text-scenario, due to the fact that this is a more relevant approach to us as Retresco.
It was noticeable that the tested speech-recognition APIs not only differed in terms of quality and possible areas of application, but also in terms of different prices and cost models.
Selection of speech-recognition APIs
An existing web API or a local feasible option on Linux Basis as well as general costs and availability in German language were prerequisites. This led to a selection of 5 speech-recognition APIs:
- Google Cloud Speech API:
The speech API by Google is able to change spoken word into text in more than 80 languages or linguistic variants. As a cloud-based application, it can be used cross-platform. Technically Google Speech API is based on deep machine learning and therefore enables a constant improvement of recognition quality.
- Bing Speech API
Bing Speech API is part of Microsoft Cognitive Services. Besides speech-to-text, it conversely offers text-to-speech and a so called “Speech Intent Recognition”, which instead of transcribed audio-data delivers structured information. Just like Google, Microsoft also uses a cloud-based system.
Speechmatics is a 2009 founded British company, that has specialized on the application of artificial intelligence and deep machine learning. Like the two above mentioned speech-to-text-solutions, Speechmatics also used a cloud-based approach.
The story of Kaldi starts in 2009 with a workshop at John Hopkins University. Since then, Kaldi is in progress, a progress that is mainly propelled by Daniel Povey. In contrary to the above mentioned text-to-speech solutions, Kaldi is designated for local installation.
- CMU Sphinx
Carnegie Mellon University (CMU) develops this speech-recognition software since the 1980s and provides it as an open-source-software. Sphinx, like Kaldi, does not follow an cloud-based approach but runs local on different platforms.
Furthermore it was determined by the Retresco team, which examples and test-methods should lead to a final result.
The chosen audio-examples should both represent a best-case scenario in terms of quality of audio and content-related be out of an area that is to be seen as relevant for most of Retrescos customers.
For that reason segments without background noise and from the news area (Deutschland Radio and Tagesschau.de) were chosen. Both used voices that were clearly assignable. It was also taken care of, that every subject not only has a similar length but also a good variation of male and female speakers
As a quality criterion the Word Error Rate (WER) was determined based on the Needleman-Wunsch algorithm. An algorithm were first of all the similarity degree of two sequences and subsequently an alignment is being calculated.
A summary of our tested speech-recognition APIs
- Google Cloud Speech API admittedly provides a big volume of words, also its word recognition including the use of capitalization and the use of small letters was very good, however punctuation marks were missing. Altogether the API is rather slow-acting with longer texts.
- Bing Speech API could deal well with capitalization and the use of small letters. In terms of recognition quality this API was less effective than Google and Speechmatics.
- Speechmatics however convinced with a low effort of implementation together with a low error rate in word recognition. Also the fact that this API works with punctuation marks and considers capitalization and the use of small letters is a plus. It was interesting that compound nouns were often written in small letters, which suggests that they are not being recognized as such. In this case the algorithm of Speechmatics still seems to have room for improvement. Despite higher prices compared to the other tested APIs, Speechmatics convinced with extraordinarily quality.
- Kaldi is another no-cost API with a server that is connected with a GStreamer, implemented in python and configurable via yaml-data. Also the open source ASR, that supposedly has a two-times lower WER than Sphinx, is seemingly state-of-the-art. By implication the API was pretty slow due to the complexity on the here used system. Also the confusing interfaces and options were partly hard to use, aside from the fact that punctuation marks and capitalization just liked the use of small letters were missing.
- CMU Sphinx is a no-cost provider, who admittedly supports punctuation, however only provides capitalization without real word recognition. Additionally it reacted slow on the tested system.
There are definite variations in quality in terms of speech-recogniton. Due to the excellent quality of speech-recognition, Speechmatics is crowned as our favorite regarding the scenario “text”. Wort competitor in this catergory was CMU Sphinx due to its missing recognition-quality. However it needs to be said that the reason for that probably are to be found in the used training data. Same goes for Kaldi, whereas by comparison to CMU Sphing, their quality of recognition was significantly better.
It was noticeable that the WER of Google scales with the length of the audio file. This suggestes that Google API is optimized for short voice commands. Therefore Google API is the clear winner of our “chat” scenario. Speechmatics is rather aiming towards a transcription of spoken words. This would explain the good result in this scenario.