Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 292 results
  • Channel selection measures for multi-microphone speech recognition

     Nadeu Camprubí, Climent; Wolf, Martin
    Speech communication
    Date of publication: 2014-02-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Automatic speech recognition in a room with distant microphones is strongly affected by noise and reverberation. In scenarios where the speech signal is captured by several arbitrarily located microphones the degree of distortion differs from one channel to another. In this work we deal with measures extracted from a given distorted signal that either estimate its quality or measure how well it fits the acoustic models of the recognition system. We then apply them to solve the problem of selecting the signal (i.e. the channel) that presumably leads to the lowest recognition error rate. New channel selection techniques are presented, and compared experimentally in reverberant environments with other approaches reported in the literature. Significant improvements in recognition rate are observed for most of the measures. A new measure based on the variance of the speech intensity envelope shows a good trade-off between recognition accuracy, latency and computational cost. Also, the combination of measures allows a further improvement in recognition rate.

  • Joint recognition and direction-of-arrival estimation of simultaneous meeting-room acoustic events

     Chakraborty, Rupayan; Nadeu Camprubí, Climent
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2013-08-28
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Acoustic scene analysis usually requires several sub-systems working in parallel for carrying out the various required functionalities. Focusing to a more integrated approach, in this paper we present an attempt to jointly recognize and localize several simultaneous acoustic events that take place in a meeting room environment, by developing a computationally efficient technique that employs multiple arbitrarily-located small microphone arrays. Assuming a set of simultaneous sounds, for each array a matrix is computed whose elements are likelihoods along the set of classes and a set of discretized directions of arrival. MAP estimation is used to decide about both the recognized events and the estimated directions. Experimental results with two sources, one of which is speech, and two three-microphone linear arrays are reported. The recognition results compare favorably with the ones obtained by assuming that the positions are known.

  • Channel selection using N-best hypothesis for multi-microphone ASR

     Wolf, Martin; Nadeu Camprubí, Climent
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2013-08-28
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    If speech is captured by several arbitrarily-located microphones in a room, the degree of distortion by noise and reverberation may vary strongly from one channel to another. Channel selection for automatic speech recognition aims to rank the signals according to their quality, and, in particular, to select the best one for further processing in the recognition system. To create this ranking, we propose here to use posterior probabilities estimated from the N-best hypothesis of each channel. When evaluated experimentally, this new channel selection technique outperforms the methods published so far. We also propose the combination of different channel selection techniques to further increase the recognition accuracy and to reduce the computational load without significant performance loss.

  • Real-time multi-microphone recognition of simultaneous sounds in a room environment

     Chakraborty, Rupayan; Nadeu Camprubí, Climent
    IEEE International Conference on Acoustics, Speech, and Signal Processing
    Presentation's date: 2013-05-29
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Time overlapping of acoustic signals, which so often occurs in real life, is a challenge for current state-of-the-art sound recognition systems. In this work, we propose an approach for detecting, identifying and positioning a set of simultaneous acoustic events in a room environment, using multiple arbitrarily-located microphone arrays, and working in real time. Assuming a set of estimated acoustic source positions, the use of a frequency invariant nullsteering beamformer for each position and each array yields a set of signals which show different balances among the various acoustic sources. For each signal, a model-based likelihood computation is carried out to obtain a matrix of likelihood scores. Then a MAP criterion is used to jointly detect the event classes and assign each of them to a given source position. Experimental results with two sources, one of which is speech, and two threemicrophone linear arrays are reported, and a comparison with alternatives approaches is carried out.

  • Acoustic Event Detection and Localization using Distributed Microphone Arrays

     Chakraborty, Rupayan
    Defense's date: 2013-12-18
    Department of Signal Theory and Communications, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    L¿anàlisi automatic d¿escenes acústiques és una tasca complexa que requereix unes quantes funcionalitats: detecció (temps), localització (espai), separació, reconeixement, etc. Aquesta tesi s¿enfoca tant cap a la detecció d¿esdeveniments acústics (AED) com a la localització de fonts acústiques (ASL), en el cas en què en una sala hi puguin coexistir diverses fonts acústiques simultàniament. En concret, el treball d¿experimentació es du a terme en un escenari de sala de reunions.El nucli del treball de la tesi rau en un plantejament eficient en termes de càlcul que es basa en tres etapes de processament. En la primera, s'utilitza un conjunt de comformadors de feix per dur a terme diverses separacions parcials de senyals, usant múltiples configuracions lineals de micròfons col¿locades arbitràriament, cada una composta d'un nombre petit de micròfons. En la segona etapa, cada una de les sortides dels comformadors passa per un classificador, el qual té models de totes les classes considerades. I llavors, en la tercera etapa, les puntuacions del classificador, ja siguin intra o inter-configuració, es combinen amb un criteri probabilistic (com MAP) o amb una tècnica de fusió amb aprenentatge automàtic (la integral difusa (FI), en els experiments).L'esquema de processament esmentat s'aplica en aquesta tesi a un conjunt de problemes de complexitat creixent, que queden definits per les suposicions que es fan en relació a les identitats (més els temps d'inici i final) i/o les posicions dels sons. En efecte, l'informe de la tesi comença amb el problema de l'assignació sense ambiguïtat de les identitats a les posicions, continua amb AED (suposant les posicions) i ASL (suposant les identitats), i acaba amb la integració de AED i ASL en un sistema únic que no necesita fer cap suposició respecte les identitats o les posicions.Els experiments tenen lloc en un escenari de sala de reunions, on hi ha dues fonts superposades en el temps; una és sempre parla i l'altra és un esdeveniment acústic d'entre un conjunt predefinit. S'usen dues bases de dades diferents, una s'ha produït barrejant senyals enregistrats realment en la sala intel¿ligent de la UPC, i l'altra consisteix en senyals de sons ensolapats que gravats directament en la sala i d'una manera més aviat espontània. S'observa dels resultats experimentals amb una sola configuració que el sistema proposat de detecció es comporta millor que el sistema basat en models o el sistema basat en separació cega de fonts. A més a més, tant la combinació basada en la regla producte com la fusió basada en FI de les puntuacions obtingudes dels múltiples arrays milloren encara més la precisió. D'altra banda, l'assignació posterior de posicions té lloc amb una taxa d'error molt petita. En relació amb ASL i suposant una sortida del sistema AED, les prestacions de localització del sistema proposat per a una sola font són lleugerament millors que les del sistema SRP-PHAT treballant en mode esdeveniment, i fins i tot són significativament millors que les d'aquest darrer sistema en el cas més complex de l'escenari de dues fonts. Finalment, tot i que amb el sistema conjunt s'observa una lleugera degradació en termes de precisió de classificació respecte del cas en què es coneixen les posicions de les fonts, aquest té l'avantatge de dur a terme les dues tasques, reconeixement i localització, amb un únic sistema, i permet la inclusió d'informació sobre les probabilitats a priori de les posicions de les fonts. Cal fer notar també que, tot i que l'escenari acústic que s'ha usat en l'experimentació és bastant limitat, el plantejament i el seu formalisme s'han desenvolupat per a un cas general, sense restriccions pel que fa al nombre i les identitats de les fonts.

  • Channel Selection and Reverberation-Robust Automatic Speech Recognition

     Wolf, Martin
    Defense's date: 2013-11-11
    Department of Signal Theory and Communications, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Les tecnologies de la parla: lloc de trobada, difícil però necessària, entre lingüística i tecnologia  Open access

     Nadeu Camprubí, Climent
    Date of publication: 2012
    Book chapter

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    La recerca i el desenvolupament d'aplicacions en tecnologies de la parla han deixat força de banda els coneixements lingüístics. Les raons són diverses, com veurem, però la dificultat del treball conjunt de lingüistes i tecnòlegs no significa que aquest no sigui necessari per arribar més lluny en els objectius de la pròpia tecnologia.

    Postprint (author’s final draft)

  • Pairwise likelihood normalization-based channel selection for multi-microphone ASR

     Wolf, Martin; Nadeu Camprubí, Climent
    IberSPEECH
    Presentation's date: 2012-11-23
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Binary position assignment of two known simultaneous acoustic sources

     Chakraborty, Rupayan; Nadeu Camprubí, Climent; Butko, Taras
    IberSPEECH
    Presentation's date: 2012-11-21
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Detection and positioning of overlapped sounds in a room environment

     Chakraborty, Rupayan; Nadeu Camprubí, Climent; Butko, Taras
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2012-09-13
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Voice Source Characterization for Prosodic and Spectral Manipulation  Open access

     Pérez Mayos, Javier
    Defense's date: 2012-07-03
    Department of Signal Theory and Communications, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification.

  • On building and evaluating a broadcast-news audio segmentation system

     Butko, Taras; Nadeu Camprubí, Climent
    Annual Conference of the International Speech Communication Association
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Audio segmentation of broadcast news : a hierarchical system with feature selection for the Albayzin-2010 evaluation

     Butko, Taras; Nadeu Camprubí, Climent
    IEEE International Conference on Acoustics, Speech and Signal Processing
    Presentation's date: 2011-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present an audio segmentation system for broadcast news, and its results in the Albayzin-2010 evaluation. First of all, the Albayzin-2010 evaluation setup, developed by the authors, is presented; in particular, the database and the metric are described. The reported hierarchical HMM-GMM-based system is composed of one binary detector for each of the five considered classes (music, speech, speech over music, speech over noise and other). A fast one-pass-training feature selection technique is adapted to the audio segmentation task to improve the results and to reduce the dimensionality of the input feature vector.

  • Access to the full text
    Two-source acoustic event detection and localization: online implementation in a smart-room  Open access

     Butko, Taras; Gonzalez Pla, Fran; Segura Perales, Carlos; Nadeu Camprubí, Climent; Hernando Pericas, Francisco Javier
    European Signal Processing Conference
    Presentation's date: 2011-08-30
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Real-time processing is a requirement for many practical signal processing applications. In this work we implemented online 2-source acoustic event detection and localization algorithms in a Smart-room, a closed space equipped with multiple microphones. Acoustic event detection is based on HMMs that enable to process the input audio signal with very low latency; acoustic source localization is based on the SRP-PHAT localization method which is known to per-form robustly in most scenarios. The experimental results from online tests show high recognition accuracy for most of acoustic events both isolated and overlapped with speech.

  • A multilingual corpus for rich audio-visual scene description in a meeting-room environment

     Butko, Taras; Nadeu Camprubí, Climent; Moreno Bilbao, M. Asuncion
    ICMI Workshop on Multimodal Corpora For Machine Learning
    Presentation's date: 2011-11-18
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present a multilingual database specifically designed to develop technologies for rich audio-visual scene description in meeting-room environments. Part of that database includes the already existing CHIL audio-visual recordings, whose annotations have been extended. A relevant objective in the new recorded sessions was to include situations in which the semantic content can not be extracted from a single modality. The presented database, that includes five hours of rather spontaneously generated scientific presentations, was manually annotated using standard or previously reported annotation schemes, and will be publicly available for the research purposes.

    Postprint (author’s final draft)

  • Extension of the remos concept to frequency-filtering-based features for reverberation-robust speech recognition

     Maas, Roland; Wolf, Martin; Sehr, Armin; Nadeu Camprubí, Climent; Kellermann, Walter
    Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA),
    Presentation's date: 2011-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The introduction of partly decorrelated features into the REMOS (REverberationMOdeling for Speech recognition) concept for distant-talking speech recognition [1] is discussed. REMOS combines a hidden Markov model (HMM), trained on clean speech, with a reverberation model capturing certain room characteristics. The most likely contributions of both models to a reverberant observation are determined by an inner optimization problem. In HMM frameworks, decorrelated features are assumed when diagonal covariance matrices are used in the output densities. However, in REMOS, only highly correlated logmelspec (logarithmic mel-spectral) features have been used so far, which has been limiting the recognition performance. In this work, we extend the RE-MOS concept and introduce a new set of partly decorrelated features derived from the frequency filtering [2]. Recognition experiments with connected digits show a consistent relative reduction in word error rate of up to 29% compared to the former logmelspec implementation.

  • Access to the full text
    Acoustic event detection based on feature-level fusion of audio and video modalities  Open access

     Butko, Taras; Canton Ferrer, Cristian; Segura, Carlos; Giro Nieto, Xavier; Nadeu Camprubí, Climent; Hernando Pericas, Francisco Javier; Casas Pla, Josep Ramon
    Eurasip journal on advances in signal processing
    Date of publication: 2011-03-15
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Acoustic event detection (AED) aims at determining the identity of sounds and their temporal position in audio signals. When applied to spontaneously generated acoustic events, AED based only on audio information shows a large amount of errors, which are mostly due to temporal overlaps. Actually, temporal overlaps accounted for more than 70% of errors in the realworld interactive seminar recordings used in CLEAR 2007 evaluations. In this paper, we improve the recognition rate of acoustic events using information from both audio and video modalities. First, the acoustic data are processed to obtain both a set of spectrotemporal features and the 3D localization coordinates of the sound source. Second, a number of features are extracted from video recordings by means of object detection, motion analysis, and multicamera person tracking to represent the visual counterpart of several acoustic events. A feature-level fusion strategy is used, and a parallel structure of binary HMM-based detectors is employed in our work. The experimental results show that information from both the microphone array and video cameras is useful to improve the detection rate of isolated as well as spontaneously generated acoustic events.

  • EEG signal description with spectral-envelope-based speech recognition features for detection of neonatal seizures

     Temko, Andrey A.; Nadeu Camprubí, Climent; Marnane, W.; Boylan, G.B.; Lightbody, G.
    IEEE transactions on information technology in biomedicine
    Date of publication: 2011-11-21
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, features which are usually employed in automatic speech recognition (ASR) are used for the detection of seizures in newborn EEG. In particular, spectral envelope-based features, composed of spectral powers and their spectral derivatives are compared to the established feature set which has been previously developed forEEGanalysis.The results indicate that the ASR featureswhich model the spectral derivatives, either full-band or localized in frequency, yielded a performance improvement, in comparison to spectral-power-based features. Indeed it is shown here that they perform reasonably well in comparison with the conventional EEG feature set. The contribution of the ASR features was analyzed here using the support vector machines (SVM) recursive feature elimination technique. It is shown that the spectral derivative features consistently appear among the top-rank features. The study shows that the ASR features should be given a high priority when dealing with the description of the EEG signal.

  • SPEAKER LOCALIZATION AND ORIENTATION IN MULTIMODAL SMART ENVIRONMENTS

     Segura Perales, Carlos
    Defense's date: 2011-05-13
    Department of Signal Theory and Communications, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • FEATURE SELECTION FOR MULTIMODAL ACOUSTIC EVENT DETECTION  Open access

     Butko, Taras
    Defense's date: 2011-07-08
    Department of Signal Theory and Communications, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

     The detection of the Acoustic Events (AEs) naturally produced in a meeting room may help to describe the human and social activity. The automatic description of interactions between humans and environment can be useful for providing: implicit assistance to the people inside the room, context-aware and content-aware information requiring a minimum of human attention or interruptions, support for high-level analysis of the underlying acoustic scene, etc. On the other hand, the recent fast growth of available audio or audiovisual content strongly demands tools for analyzing, indexing, searching and retrieving the available documents. Given an audio document, the first processing step usually is audio segmentation (AS), i.e. the partitioning of the input audio stream into acoustically homogeneous regions which are labelled according to a predefined broad set of classes like speech, music, noise, etc. Acoustic event detection (AED) is the objective of this thesis work. A variety of features coming not only from audio but also from the video modality is proposed to deal with that detection problem in meeting-room and broadcast news domains. Two basic detection approaches are investigated in this work: a joint segmentation and classification using Hidden Markov Models (HMMs) with Gaussian Mixture Densities (GMMs), and a detection-by-classification approach using discriminative Support Vector Machines (SVMs). For the first case, a fast one-pass-training feature selection algorithm is developed in this thesis to select, for each AE class, the subset of multimodal features that shows the best detection rate. AED in meeting-room environments aims at processing the signals collected by distant microphones and video cameras in order to obtain the temporal sequence of (possibly overlapped) AEs that have been produced in the room. When applied to interactive seminars with a certain degree of spontaneity, the detection of acoustic events from only the audio modality alone shows a large amount of errors, which is mostly due to the temporal overlaps of sounds. This thesis includes several novelties regarding the task of multimodal AED. Firstly, the use of video features. Since in the video modality the acoustic sources do not overlap (except for occlusions), the proposed features improve AED in such rather spontaneous scenario recordings. Secondly, the inclusion of acoustic localization features, which, in combination with the usual spectro-temporal audio features, yield a further improvement in recognition rate. Thirdly, the comparison of feature-level and decision-level fusion strategies for the combination of audio and video modalities. In the later case, the system output scores are combined using two statistical approaches: weighted arithmetical mean and fuzzy integral. On the other hand, due to the scarcity of annotated multimodal data, and, in particular, of data with temporal sound overlaps, a new multimodal database with a rich variety of meeting-room AEs has been recorded and manually annotated, and it has been made publicly available for research purposes.

  • Normalización estadística para fusión biométrica multimodal  Open access

     Ejarque, Pascual
    Defense's date: 2011-03-17
    Department of Signal Theory and Communications, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Los sistemas de reconocimiento biométrico utilizan ciertas características humanas como la voz, los rasgos faciales, la huella dactilar, el iris o la geometría de la mano para identificar a un individuo o verificar su identidad. Dichos sistemas se han desarrollado de forma individual para cada una de estas modalidades biométricas hasta llegar a obtener unos niveles notables de rendimiento. Los sistemas biométricos multimodales combinan diversas modalidades en un sistema de reconocimiento único. La fusión multimodal permite mejorar los resultados obtenidos por una sola característica biométrica y hacen el sistema más robusto a ruidos e interferencias y más resistente a posibles ataques. La fusión se puede realizar a nivel de las señales adquiridas por los distintos sensores, de los parámetros obtenidos para cada modalidad, de las puntuaciones proporcionadas por expertos unimodales o de la decisión tomada por dichos expertos. En la fusión a nivel de parámetros o puntuaciones es necesario homogeneizar las características provenientes de las diferentes modalidades biométricas de manera previa al proceso de fusión. A este proceso de homogeneización se le denomina normalización y se ha demostrado determinante en la obtención de buenos resultados de reconocimiento en los sistemas multimodales. En esta tesis, se presentan diversos métodos de normalización que modifican la estadística de parámetros o puntuaciones. En primer lugar, se propone la normalización de la media y la varianza de las puntuaciones unimodales por medio de transformaciones afines que tienen en cuenta las estadísticas separadas de las puntuaciones de clientes e impostores. En este ámbito se presenta la normalización conjunta de medias, que iguala las medias de las puntuaciones de clientes e impostores para todas las modalidades biométricas. También se han propuesto técnicas que minimizan la suma de las varianzas de las puntuaciones multimodales de clientes e impostores. Estas técnicas han obtenido buenos resultados en un sistema bimodal de fusión de puntuaciones de espectro de voz e imágenes faciales y se ha demostrado que una reducción de las varianzas multimodales puede comportar un mejor resultado de reconocimiento. Por otro lado, se ha utilizado la ecualización de histograma, un método ampliamente utilizado en el tratamiento de imágenes, como técnica de normalización. Para ello, se han ecualizado los histogramas de las características unimodales sobre diversas funciones de referencia. En primer lugar, se ha utilizado el histograma de las puntuaciones de una de las modalidades biométricas como referencia en el proceso de ecualización. Esta técnica se ha mostrado especialmente efectiva al combinarla con métodos de fusión basados en la ponderación de las puntuaciones unimodales. En una segunda aproximación, se han ecualizado las características biométricas a funciones previamente establecidas, en concreto, a una gaussiana y a una doble gaussiana. La ecualización a gaussiana ha obtenido buenos resultados como normalización en sistemas de fusión de parámetros. La ecualización de doble gaussiana se ha diseñado específicamente para la normalización de puntuaciones. Las dos gaussianas representan los lóbulos de las puntuaciones de clientes e impostores que se pueden observar en los histogramas unimodales. Se han probado diferentes variantes para determinar las varianzas de dichas gaussianas. Las técnicas de normalización estadística presentadas en esta tesis se han probado utilizando diferentes estrategias y técnicas para la fusión, tanto para bases de datos quiméricas como para una base de datos multimodal. Además, la fusión se ha realizado a diferentes niveles, en concreto, a nivel de puntuaciones para diferentes escenarios multimodales incluyendo características de espectro voz, prosodia y caras, y a los niveles de parámetros, puntuaciones y decisión en el entorno del proyecto Agatha.

  • Reconocimiento de voz y audio para inteligencia ambiental

     Butko, Taras; Rodríguez Fonollosa, José Adrián; Nadeu Camprubí, Climent; Vallverdu Bayes, Francisco; Salavedra Moli, Josep; Nogueiras Rodriguez, Albino; Casar Lopez, Marta; Wolf, Martin; Zelenak, Martin; Hernando Pericas, Francisco Javier
    Participation in a competitive project

     Share

  • Enhancing the european Linguistic Infraestructure

     Bonafonte Cavez, Antonio Jesus; Nadeu Camprubí, Climent; Vallverdu Bayes, Francisco; Butko, Taras; Rodríguez Fonollosa, José Adrián; Wolf, Martin; Moreno Bilbao, M. Asuncion
    Participation in a competitive project

     Share

  • Access to the full text
    Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion  Open access

     Butko, Taras; Nadeu Camprubí, Climent
    EURASIP Journal on Audio, Speech, and Music Processing
    Date of publication: 2011-06-17
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Recently, audio segmentation has attracted research interest because of its usefulness in several applications like audio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Moreover, a previous audio segmentation stage may be useful to improve the robustness of speech technologies like automatic speech recognition and speaker diarization. In this article, we present the evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzín-2010 evaluation campaign. That evaluation consisted of segmenting audio from the 3/24 Catalan TV channel into five acoustic classes: music, speech, speech over music, speech over noise, and the other. The evaluation results displayed the difficulty of this segmentation task. In this article, after presenting the database and metric, as well as the feature extraction methods and segmentation techniques used by the submitted systems, the experimental results are analyzed and compared, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.

  • Access to the full text
    A hierarchical architecture with feature selection for audio segmentation in a broadcast news domain  Open access

     Butko, Taras; Nadeu Camprubí, Climent
    Jornadas en Tecnología del Habla and Iberian SLTech Workshop
    Presentation's date: 2010-11-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This work presents a hierarchical HMM-based audio segmentation system with feature selection designed for the Albayzin 2010 Evaluations. We propose an architecture that combines the outputs of individual binary detectors which were trained with a specific class-dependent feature set adapted to the characteristics of each class. A fast one-pass-training wrapper-based technique was used to perform a feature selection and an improvement in average accuracy with respect to using the whole set of features is reported.

  • Access to the full text
    Detection of overlapped acoustic events using fusion of audio and video modalities  Open access

     Butko, Taras; Nadeu Camprubí, Climent
    Jornadas en Tecnología del Habla and Iberian SLTech Workshop
    Presentation's date: 2010-11-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Acoustic event detection (AED) may help to describe acoustic scenes, and also contribute to improve the robustness of speech technologies. Even if the number of considered events is not large, that detection becomes a difficult task in scenarios where the AEs are produced rather spontaneously and often overlap in time with speech. In this work, fusion of audio and video information at either feature or decision level is performed, and the results are compared for different levels of signal overlaps. The best improvement with respect to an audio-only baseline system was obtained using the featurelevel fusion technique. Furthermore, a significant recognition rate improvement is observed where the AEs are overlapped with loud speech, mainly due to the fact that the video modality remains unaffected by the interfering sound.

  • Access to the full text
    Albayzin-2010 audio segmentation evaluation: evaluation setup and results  Open access

     Butko, Taras; Nadeu Camprubí, Climent; Schulz, Henrik
    Jornadas en Tecnología del Habla and Iberian SLTech Workshop
    Presentation's date: 2010-11-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present the audio segmentation task from the Albayzín-2010 evaluation, and the results obtained by the eight participants from Spanish and Portuguese universities. The evaluation task consisted of the segmentation of audio files from the Catalan 3/24 TV channel into 5 acoustic classes: music, speech, speech over music, speech over noise and other. The final results from all participants show that the problem of segmenting broadcast news is still challenging. We also present an analysis of the segmentation errors of the submitted systems. Additionally, the evaluation setup, including the database and the segmentation metric, is also described.

  • Access to the full text
    On the potential of channel selection for recognition of reverberated speech with multiple microphones  Open access

     Wolf, Martin; Nadeu Camprubí, Climent
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2010-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The performance of ASR systems in a room environment with distant microphones is strongly affected by reverberation. As the degree of signal distortion varies among acoustic channels (i.e. microphones), the recognition accuracy can benefit from a proper channel selection. In this paper, we experimentally show that there exists a large margin for WER reduction by channel selection, and discuss several possible methods which do not require any a-priori classification. Moreover, by using a LVCSR task, a significant WER reduction is shown with a simple technique which uses a measure computed from the sub-band time envelope of the various microphone signals.

  • Access to the full text
    A fast one-pass-training feature selection technique for GMM-based acoustic event detection with audio-visual data  Open access

     Butko, Taras; Nadeu Camprubí, Climent
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2010-09-26
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Acoustic event detection becomes a difficult task, even for a small number of events, in scenarios where events are produced rather spontaneously and often overlap in time. In this work, we aim to improve the detection rate by means of feature selection. Using a one-against-all detection approach, a new fast one-pass-training algorithm, and an associated highly-precise metric are developed. Choosing a different subset of multimodal features for each acoustic event class, the results obtained from audiovisual data collected in the UPC multimodal room show an improvement in average detection rate with respect to using the whole set of features.

  • Acció integrada hispanoalemanya - DE2009-0036

     Nadeu Camprubí, Climent; Hernando Pericas, Francisco Javier; Hohmann, Volker; Butko, Taras; Wolf, Martin; Zelenak, Martin; Schulz, Henrik; Bach, Jorg-Hendrick; Moritz, Niko
    Participation in a competitive project

     Share

  • Improving Detection of Acoustic Events Using Audiovisual Data and Feature Level Fusion

     Butko, Taras; Canton Ferrer, Cristian; Segura, C.; Giro Nieto, Xavier; Nadeu Camprubí, Climent; Hernando Pericas, Francisco Javier; Casas Pla, Josep Ramon
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2009-09
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency

     Yin, Hui; Nadeu Camprubí, Climent; Hohmann, Volker
    Eurasip journal on audio, speech, and music processing
    Date of publication: 2009-11-21
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Acoustic event detection and classification

     Temko, Andrey A.; Nadeu Camprubí, Climent; Macho Ciena, Dusan; Malkin, R; Zieger, C; Omologo, M
    Date of publication: 2009-05-31
    Book chapter

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    One of the most basic building blocks for the understanding of human actions and interactions is the accurate detection and tracking of persons in a scene. In constrained scenarios involving at most one subject, or in situations where persons can be confined to a controlled monitoring space or required to wear markers, sensors, or microphones, these tasks can be solved with relative ease. However, when accurate localization and tracking have to be performed in an unobtrusive or discreet fashion, using only distantly placed microphones and cameras, in a variety of natural and uncontrolled scenarios, the challenges posed are much greater. The problems faced by video analysis are those of poor or uneven illumination, low resolution, clutter or occlusion, unclean backgrounds, and multiple moving and uncooperative users that are not always easily distinguishable.

  • Automatic speech recognition

     Hernando Pericas, Francisco Javier; Macho Ciena, Dusan; Nadeu Camprubí, Climent
    Date of publication: 2009-05-31
    Book chapter

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    One of the most basic building blocks for the understanding of human actions and interactions is the accurate detection and tracking of persons in a scene. In constrained scenarios involving at most one subject, or in situations where persons can be confined to a controlled monitoring space or required to wear markers, sensors, or microphones, these tasks can be solved with relative ease. However, when accurate localization and tracking have to be performed in an unobtrusive or discreet fashion, using only distantly placed microphones and cameras, in a variety of natural and uncontrolled scenarios, the challenges posed are much greater. The problems faced by video analysis are those of poor or uneven illumination, low resolution, clutter or occlusion, unclean backgrounds, and multiple moving and uncooperative users that are not always easily distinguishable.

  • VEU: GRUP DE TRACTAMENT DE LA PARLA

     Bonafonte Cavez, Antonio Jesus; Casar Lopez, Marta; Ruiz Costa-jussa, Marta; Nogueiras Rodriguez, Albino; Esquerra Llucià, Ignasi; Salavedra Moli, Josep; Farrús Cabecerán, Mireia; Hernando Pericas, Francisco Javier; Rodríguez Fonollosa, José Adrián; Monte Moreno, Enrique; Mariño Acebal, Jose Bernardo; Nadeu Camprubí, Climent; Moreno Bilbao, M. Asuncion; Vallverdu Bayes, Francisco
    Participation in a competitive project

     Share

  • Audiovisual Event Detection Towards Scene Understanding

     Canton Ferrer, Cristian; Butko, Taras; Segura, C.; Giro Nieto, Xavier; Nadeu Camprubí, Climent; Hernando Pericas, Francisco Javier; Casas Pla, Josep Ramon
    IEEE Conference on Computer Vision and Pattern Recognition
    Presentation's date: 2009
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Acoustic event detection in a meeting-room environment

     Temko, Andrey A.; Nadeu Camprubí, Climent
    Pattern recognition letters
    Date of publication: 2009-10-15
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Pitch- and formant-based order adaptation of the fractional fourier transform and its application to speech recognition

     Yin, Hui; Nadeu Camprubí, Climent; Hohmann, Volker
    Eurasip journal on audio, speech, and music processing
    Date of publication: 2009
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Multispeaker localization and tracking in intelligent environments

     Segura, C; Abad, A; Hernando Pericas, Francisco Javier; Nadeu Camprubí, Climent
    Lecture notes in computer science
    Date of publication: 2008-06
    Journal article

     Share Reference managers Reference managers Open in new window

  • Detection of Acoustic Events in Interactive Seminar Data with Temporal Overlaps

     Temko, Andrey A.; Nadeu Camprubí, Climent
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2008-09
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Fusion of Audio and Video Modalities for Detection of Acoustic Events

     Butko, Taras; Temko, Andrey A.; Nadeu Camprubí, Climent; Canton Ferrer, Cristian
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2008-09
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Speaker orientation estimation based on hybridation of GCC-PHAT and HLBR

     Segura, C.; Abad, Alberto; Hernando Pericas, Francisco Javier; Nadeu Camprubí, Climent
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2008-09
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Inclusion of video information for detection of acoustic events using the fuzzy integral

     Butko, Taras; Temko, Andrey A.; Nadeu Camprubí, Climent; Canton Ferrer, Cristian
    Machine Learning for Multimodal Interaction: 5th International Workshop
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Fusion of audio and video modalities for detection of acoustic events

     Butko, Taras; Temko, Andrey A.; Nadeu Camprubí, Climent; Canton Ferrer, Cristian
    International Conference on Spoken Language Processing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Using pitch and formants for order adaptation of fractional Fourier transform is speech signal processing

     Yin, H; Nadeu Camprubí, Climent; Hohmann, V
    Jornadas en Tecnología del Habla
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Inclusion of video information for detection of acoustic events using the fuzzy integral  Open access

     Canton Ferrer, Cristian; Butko, Taras; Temko, Andrey A.; Nadeu Camprubí, Climent
    Lecture notes in computer science
    Date of publication: 2008-01-01
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    When applied to interactive seminars, the detection of acoustic events from only audio information shows a large amount of errors, which are mostly due to the temporal overlaps of sounds. Video signals may be a useful additional source of information to cope with that problem for particular events. In this work, we aim at improving the detection of steps by using two audio-based Acoustic Event Detection (AED) systems, with SVM and HMM, and a video-based AED system, which employs the output of a 3D video tracking algorithm. The fuzzy integral is used to fuse the outputs of the three detection systems. Experimental results using the CLEAR 2007 evaluation data show that video information can be successfully used to improve the results of audio-based AED.

  • Acoustic event detection and classification  Open access

     Temko, Andrey A.
    Defense's date: 2008-01-23
    Department of Signal Theory and Communications, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    L'activitat humana que té lloc en sales de reunions o aules d'ensenyament es veu reflectida en una rica varietat d'events acústics, ja siguin produïts pel cos humà o per objectes que les persones manegen. Per això, la determinació de la identitat dels sons i de la seva posició temporal pot ajudar a detectar i a descriure l'activitat humana que té lloc en la sala. A més a més, la detecció de sons diferents de la veu pot ajudar a millorar la robustes de tecnologies de la parla com el reconeixement automàtica a condicions de treball adverses. L'objectiu d'aquesta tesi és la detecció i classificació automàtica d'events acústics. Es tracta de processar els senyals acústics recollits per micròfons distants en sales de reunions o aules per tal de convertir-los en descripcions simbòliques que es corresponguin amb la percepció que un oient tindria dels diversos events sonors continguts en els senyals i de les seves fonts. En primer lloc, s'encara la tasca de classificació automàtica d'events acústics amb classificadors de màquines de vectors suport (Support Vector Machines (SVM)), elecció motivada per l'escassetat de dades d'entrenament. Per al problema de reconeixement multiclasse es desenvolupa un esquema d'agrupament automàtic amb conjunt de característiques variable i basat en matrius de confusió. Realitzant proves amb la base de dades recollida, aquest classificador obté uns millors resultats que la tècnica basada en models de barreges de Gaussianes (Gaussian Mixture Models (GMM)), i aconsegueix una reducció relativa de l'error mitjà elevada en comparació amb el millor resultat obtingut amb l'esquema convencional basat en arbre binari. Continuant amb el problema de classificació, es comparen unes quantes maneres alternatives d'estendre els SVM al processament de seqüències, en un intent d'evitar l'inconvenient de treballar amb vectors de longitud fixa que presenten els SVM quan han de tractar dades d'àudio. En aquestes proves s'observa que els nuclis de deformació temporal dinàmica funcionen bé amb sons que presenten una estructura temporal. A més a més, s'usen conceptes i eines manllevats de la teoria de lògica difusa per investigar, d'una banda, la importància de cada una de les característiques i el grau d'interacció entre elles, i d'altra banda, tot cercant l'augment de la taxa de classificació, s'investiga la fusió de lessortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústicsdesenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbitinternacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dinsd'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacionsesmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ellsva obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecciós'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.

    The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the resultsare reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.

  • 4.1.1 Descripción de las Técnicas Desarrolladas

     Bonafonte Cavez, Antonio Jesus; Hernando Pericas, Francisco Javier; Mariño Acebal, Jose Bernardo; Moreno Bilbao, M. Asuncion; Nadeu Camprubí, Climent
    Date: 2008-09
    Report

     Share Reference managers Reference managers Open in new window

  • Evaluation of different feature extraction methods for speech recognition in car environment

     Wolf, Martin; Nadeu Camprubí, Climent
    15th International Conference on Systems, Signals and Image Processing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Learning engineering ethics by debate

     Nadeu Camprubí, Climent; Mariño Acebal, Jose Bernardo; Farrús Cabecerán, Mireia
    International Conference on Ehtics and Human Values in Engineering.
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window