Yang, N.; Ba, H.; Cai, W.; Demirkol, I.; Heinzelman, W. IEEE transactions on audio speech and language processing Vol. 22, num. 12, p. 1833-1848 DOI: 10.1109/TASLP.2014.2352453 Data de publicació: 2014-08-27 Article en revista
Fundamental frequency (F0) is one of the essential features in many acoustic related applications. Although numerous F0 detection algorithms have been developed, the detection accuracy in noisy environments still needs improvement. We present a hybrid noise resilient F0 detection algorithm named BaNa that combines the approaches of harmonic ratios and Cepstrum analysis. A Viterbi algorithm with a cost function is used to identify the F0 value among several F0 candidates. Speech and music databases with eight different types of additive noise are used to evaluate the performance of the BaNa algorithm and several classic and state-of-the-art F0 detection algorithms. Results show that for almost all types of noise and signal-to-noise ratio (SNR) values investigated, BaNa achieves the lowest Gross Pitch Error (GPE) rate among all the algorithms. Moreover, for the 0 dB SNR scenarios, the BaNa algorithm is shown to achieve 20% to 35% GPE rate for speech and 12% to 39% GPE rate for music. We also describe implementation issues that must be addressed to run the BaNa algorithm as a real-time application on a smartphone platform.
Zelenak, M.; Segura, C.; Luque, J.; Hernando, J. IEEE transactions on audio speech and language processing Vol. 20, num. 2, p. 436-446 DOI: 10.1109/TASL.2011.2160167 Data de publicació: 2012-02 Article en revista
Simultaneous speech poses a challenging problem for conventional speaker diarization systems. In meeting data, a substantial amount of missed speech error is due to speaker overlaps, since usually only one speaker label per segment is assigned. Furthermore, simultaneous speech included in training data can lead to corrupt speaker models and thus worse segmentation performance. In this paper, we propose the use of three spatial cross-correlation-based features together with spectral information for speaker overlap detection on distant microphones. Different microphone-pair data are fused by means of principal component analysis. We have obtained an improvement of the speaker diarization system over the baseline by discarding overlap segments from model training and assigning two speaker labels to them according to likelihoods in Viterbi decoding. In experiments conducted on the AMI Meeting corpus, we achieve a relative DER reduction of 11.2% and 17.0% for single- and multi-site data, respectively. The improvement of clustering with techniques such as beamforming and TDOA-feature stream also leads to a higher effectiveness of the overlap labeling algorithm. Preliminary experiments with NIST RT data show DER improvement on the RT'09 meeting recordings as well.
Matusov, E.; Leusch, G.; Federico, M.; Mariño, J.B.; Ney, H.; Bertoldi, N. IEEE transactions on audio speech and language processing Vol. 16, num. 7, p. 1222-1237 DOI: 10.1109/TASL.2008.914970 Data de publicació: 2008-09-01 Article en revista
This paper describes a recently developed method for computing a consensus translation from the outputs of multiple machine translation (MT) systems. A possibly new translation hypothesis can be produced as a result of this system combination algorithm. The consensus translation is computed by creating a confusion network and performing weighted majority voting, similarly to the well-established ROVER approach of (Fiscus 1997) for combining speech recognition hypotheses. To create the confusion network, pairwise word alignments of the original machine
translation hypotheses are learned by using an enhanced statistical alignment algorithm that explicitly models word reordering. This is the first known application of this algorithm in the context of system combination. The context of a whole document of translations rather than a single sentence is taken into account to improve the alignment quality.
The proposed alignment and voting approach was evaluated on several machine translation tasks, including a large vocabulary task. The method was also tested in the framework of multi- source and speech translation. Significant improvements in translation quality were achieved on all tasks. Here, we report experimental results for combining MT systems participating in the TC-STAR (speech translation) Project.