Accurate, real-time Automatic Speech Recognition (ASR) requires huge memory storage and computational power. The main bottleneck in state-of-the-art ASR systems is the Viterbi search on a Weighted Finite State Transducer (WFST). The WFST is a graph-based model created by composing an Acoustic Model (AM) and a Language Model (LM) offline. Offline composition simplifies the implementation of a speech recognizer as only one WFST has to be searched. However, the size of the composed WFST is huge, typically larger than a Gigabyte, resulting in a large memory footprint and memory bandwidth requirements.
In this paper, we take a completely different approach and propose a hardware accelerator for speech recognition that composes the AM and LM graphs on-the-fly. In our ASR system, the fully-composed WFST is never generated in main memory. On the contrary, only the subset required for decoding each input speech fragment is dynamically generated from the AM and LM models. In addition to the direct benefits of this on-the-fly composition, the resulting approach is more amenable to further reduction in storage requirements through compression techniques.
The resulting accelerator, called UNFOLD, performs the decoding in real-time using the compressed AM and LM models, and reduces the size of the datasets from more than one Gigabyte to less than 40 Megabytes, which can be very important in small form factor mobile and wearable devices.
Besides, UNFOLD improves energy-efficiency by orders of magnitude with respect to CPUs and GPUs. Compared to a state-of-the-art Viterbi search accelerators, the proposed ASR system outperforms by providing 31x reduction in memory footprint and 28% energy savings on average.
Automatic speech recognition (ASR) has become a core technology for mobile devices. Delivering real-time and accurate ASR has a huge computational cost, which is challenging to achieve in tightly energy-constrained platforms such as mobile devices. A state-of-the-art ASR pipeline consists of a deep neural network (DNN) that converts the audio signal into phonemes' probabilities, followed by a Viterbi search that uses these probabilities to generate a sequence of words. In this article, the authors propose an ASR system for low-power devices that combines a mobile GPU for the DNN with a dedicated hardware accelerator for the Viterbi search. DNN evaluation is easy to parallelize and, hence, it achieves high energy efficiency on a mobile GPU. On the other hand, the Viterbi search is difficult to parallelize, and it represents the main bottleneck for ASR, so the authors propose a hardware accelerator to dramatically reduce its energy requirements while increasing performance. Their proposal outperforms traditional solutions running on the CPU by orders of magnitude. Compared to a GPU-only system, their hybrid scheme combining the GPU and the accelerator improves performance by 5.25 times, while reducing energy by 2.05 times.
Yazdani, R.; Segura, A.; Arnau, J.; Gonzalez, A. Annual IEEE/ACM International Symposium on Microarchitecture p. 580-591 DOI: 10.1109/MICRO.2016.7783750 Data de presentació: 2016-10-17 Presentació treball a congrés
Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.