Automatic speech recognition, also known as speech recognition is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. This technology is known by various names such as computer speech recognition or speech to text (STT). Automatic speech recognition can further be defined as the independent, computer‐driven transcription of spoken language into readable text in real time. In other words, ASR is a technology that allows a computer to identify the words that a person speaks into a microphone or telephone and converts it to written text.
The ultimate goal of ASR research is to allow a computer to recognize all words that are intelligibly spoken by any person, with 100% accuracy, in real time. Moreover, this recognition is independent of vocabulary size, noise, speaker characteristics or accent. Commercially available ASR systems usually require only a short period of speaker training and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy.
The entire process that takes place in an automatic speech recognition is described below:
i) The process begins when a speaker actually speaks a sentence.
ii) The software then produces a speech wave form, which embodies the words of the sentence along with all the extraneous sounds and pauses in the spoken input.
iii) Next, the software converts the speech signal into a sequence of vectors which are measured throughout the duration of the speech signal.
v) Lastly, using a syntactic decoder it generates a valid sequence of representations.
Being an amazing and highly efficacious technology, automatic speech recognition offers various benefits. Some of those benefits include:
i) Accessibility for the deaf and hard of hearing
ii) Cost reduction through automation
iii) Searchable text capability