Nogueiras, A.; Colom, J. IEEE International Conference on Acoustics, Speech, and Signal Processing p. 448-452 DOI: 10.1109/ICASSP.2013.6637687 Data de presentació: 2013-05-29 Presentació treball a congrés
In this paper, a novel procedure for the estimation of the energy decay curve of the reverberation on rectangular non-diffusive rooms is presented. It is based on the calculation of the expected sound intensity using a room characteristic factor, the specific attenuation factor, also introduced in the paper. Complete knowledge of the probability density function of this factor leads to exact estimation of the energy decay curve of reverberation, even in the case of heavily irregular rooms and/or un-homogeneous walls.
The current main trend in paralinguistic information recognition is the so-called static classification. In this kind of
classification the low level descriptors are pooled togethr by means of statistical functionals and all, or almost all,
information about the temporal structure and evolution of speech is lost. Although this approach represents the state-ofthe-art, we believe that dynamic classification, where temporal information is kept, still deserves some attention due to its capability to handle aspects impossible to do by the static one. In this paper the INTERSPEECH 2011 Speaker State Challenged is addressed using the Automatic Speech Recognition system developed at UPC, which has already
been used in a similar task: emotion recognition. Although results fall below the baseline, we believe that they are close
enough to be taken into account
The usual approach to automatic continuous speech recognition is what can be called the acoustic-phonetic modelling approach. In this approach, voice is considered
to hold two different kinds of information acoustic and phonetic . Acoustic information is represented by some kind of feature extraction out of the voice signal, and phonetic information is extracted from the vocabulary of the task by means of a lexicon or some other procedure. The
main assumption in this approach is that models can be constructed that capture the correlation existing between
both kinds of information.
The main limitation of acoustic-phonetic modelling in speech recognition is its poor treatment of the variability
present both in the phonetic level and the acoustic one. In this paper, we propose the use of a slightly modified framework where the usual acoustic-phonetic modelling
is divided into two different layers: one closer to the voice signal, and the other closer to the phonetics of the sentence. By doing so we expect an improvement of
the modelling accuracy, as well as a better management of acoustic and phonetic variability. Experiments carried out so far, using a very simpli ed version of the proposed framework, show a signi cant improvement in the recognition of a large vocabulary continuous speech task, and represent a promising start point for
In this paper, three different techniques for building semicontinuousHMMbased
speech recognisers are compared:
the classical one, using Euclidean generated codebooks and independently trained acoustic models; jointly reestimating
the codebooks and models obtained with the classical method; and jointly creating codebooks and models growing their size from one centroid to the desired number
of them. The way this growth may be done is carefully addressed, focusing on the selection of the splitting direction and the way splitting is implemented. Results in a large vocabulary task show the ef ciency of the approach, with noticeable improvements both in accuracy and CPU consumption. Moreover, this scheme enables the use of the concatenation of features, avoiding the independence assumption usually needed in semi-continuous HMM modelling, and leading to further improvements in accuracy and CPU.
In this paper, multidialectal acoustic modeling based on shar-
ing data across dialects is addressed. A comparative study of
different methods of combining data based on decision tree
clustering algorithms is presented. Approaches evolved differ
in the way of evaluating the similarity of sounds between di-
alects, and the decision tree structure applied. Proposed systems
are tested with Spanish dialects across Spain and Latin Amer-
ica. All multidialectal proposed systems improve monodialectal
performance using data from another dialect but it is shown that
the way to share data is critical. The best combination between
similarity measure and tree structure achieves an improvement
of 7% over the results obtained with monodialectal systems.
TC-STAR is envisioned as a long- term effort focused on advanced research in all core technologies for speech-to-speech translation (SST): speech recognition, speech translation and speech synthesis. The objectives of the project are extremely ambitious: making a breakthrough in SST research to significantly reduce the gap between human and machine performance. The focus will be on the development of new, possibly revolutionary, algorithms and methods, integrating relevant human knowledge which is available at translation time into a data-driven framework. TC-STAR is planned for a duration of six years.
The first three years will target a selection of unconstrained conversational speech domains -i.e. broadcast news and speeches - and a few languages: native and non- native European English, European Spanish and Chinese. The second three years, will target more complex unconstrained conversational speech domains and European languages. Key actions will be: (i) the implementation of an evaluation infrastructure, (ii) the creation of a technological infrastructure, and (iii) the support of knowledge dissemination.
Iskra, D.; Siemund, R.; Borno, J.; Moreno, A.; Emam, O.; Choukri, K.; Gedge, O.; Tropf, H.; Nogueiras, A.; Zitouni, I.; Tsopanoglou, A.; Fakotakis, N. International Conference on Language Resources and Evaluation p. 591-594 Data de presentació: 2004 Presentació treball a congrés
Nogueiras, A.; Caballero, M.; Moreno, A. IEEE International Conference on Acoustics, Speech, and Signal Processing p. I-841-I-844 DOI: 10.1109/ICASSP.2002.5743870 Data de presentació: 2002-05 Presentació treball a congrés
Spanish is a global language, spoken in a big number of different countries with a big dialectal variability‥ This paper deals with the suitability of using a single multi-dialectal acoustic modeling for all the Spanish variants spoken in Europe and Latin America. This paper deals with the suitability of using a single multi-dialectal acoustic modeling for all the Spanish variants spoken in Europe and Latin America. The objective is two fold. First, it allows to use all the available databases to jointly train and improve the same system. Second, it allows to use a single system for all the Spanish speakers. The paper describes the rule- based phonetic transcription used for each dialectal variant, the selection of the shared and the specific phonemes to be modeled in a multi-dialectal recognition system, and the results of a multi-dialectal system dealing with dialects in and out of the training set.
The aim of IRDSS is an integrated set of telematics applications for isolated under-privileged communities to develop their full potential through network links to more central regional economies. Inter-regional economic cooperation will be based on downward decentralisation of work and services to these marginalised localities. The core technology will act as a base for thematic modules accessed via public user-friendly multimedia kiosks, or from home on the Internet. The project is expected to boost effective economic planning, interregional business activity and improved regional government for urban and local communities.
This paper introduces a first approach to emotion recognition using RAMSES, the UPC’s speech recognition system. The approach is based on standard speech recognition technology using hidden semi-continuous Markov models. Both the selection of low level features and the design of the recognition system are addressed. Results are given on speaker dependent
emotion recognition using the Spanish corpus of INTERFACE Emotional Speech Synthesis Database. The accuracy recognising
seven different emotions—the six ones defined in MPEG-4
plus neutral style—exceeds 80% using the best combination of
low level features and HMM structure. This result is very similar
to that obtained with the same database in subjective evaluation
by human judges.
In this paper, we introduce the demiphone as a context-dependent phonetic unit for continuous speech recognition. A phoneme is divided into two parts: a left demiphone that accounts for the left coarticulation and a right demiphone that copes with the right-hand side context. This unit discards the dependence between the effects of both side contexts, but it models the transition between phonemes as the triphone does. By concatenating a left demiphone and a right demiphone a triphone can be built, although the left and the right-context coarticulations are modeled independently. The main appeal of this unit stems from its reduced number (respect to the number of triphones) and its capability to model left and right contexts unseen together in the training material. Thus, the demiphone shares in a simple way the advantages of a smoothed parameter estimation with the ability of generalization. In the present work, the demiphone is motivated and experimentally supported. Furthermore, demiphones are compared with triphones smoothed and generalized by decision-tree state-tying, accepted as the most powerful tool for coarticulation modeling at the present state of the art. The main conclusion of our work is that the demiphone simplifies the recognition system and yields a better performance than the triphone, at least for small or moderate size databases. This result may be explained by the ability of the demiphone to provide an excellent trade-off between a detailed coarticulation modeling and a proper parameter estimation.
The project will develop an integrated framework (SEAMLESS-IF) which integrates approaches from economic, environmental and social sciences to enable assessment of the impact of policy and behavioural changes and innovations in agriculture and agroforestry. Contributions of agriculture to sustainable development and multifunctionality will be assessed at different spatial scales from farm to global, allowing consideration of both top-down and bottom-up approaches to land management change. Innovative software architecture will be used to facilitate the use of quantitative biophysical and economic models and databases in combination, and to ensure re-usability of these tools well beyond the lifetime of the project. Indicators will be operationalised to communicate key information to users and between scales or disciplines, and methods for establishing threshold values for these indicators will be clarified. Qualitative tools will also be integrated into SEAMLESS-IF, to take into account institutional and social contexts. SEAMLESS-IF will be developed reflexively, using selected test cases to evaluate and improve the tools and assess utility.
The development of SEAMLESS-IF will use participatory approaches to user involvement and dissemination, and to ensure applicability for prime users (EC DGs Research, Agriculture and Environment) and other users. SEAMLESS-IF will allow ex-ante analyses of the impacts of policy and behavioural changes, through clarification of the benefits, costs, and externalities associated with farming system management. Interactions between the EU, associated candidate countries and the rest of the world will be assessed by incorporating appropriate models. SEAMLESS-IF will rapidly become essential for integrated assessment of agricultural systems in the context of agro-ecological innovations, rural development, sustainability, agricultural policy reform, EU enlargement, and world trade liberalisation.
En esta tesis se aborda el entrenamiento discriminativo de unidades subléxicas utilizando bases de datos de propósito geneal. Las unidades subléxicas son la base de funcionamiento de los sistemas de reconocimiento de grandes vocabularios en habla continua, los cuales constituyen uno de los retos de máxima actualidad y la puerta de acceso a otras propuestas aún más ambiciosas como el dictado automático o los sistemas de diálogo.
Por su parte, el entrenamiento discriminativo ha demostrado ser una herramienta sumamente potente en el modelado acústico de sistemas de reconocimiento del habla. Su funcionamiento se basa en aumentar la probabilidad de que el sistema reconozca la frase correcta aplicando, más o menos, la misma regla de decisión empleada en condiciones reales de reconocimiento. Una limitación habitual de los sistemas de entrenamiento discriminativo propuestos hasta la fecha es la necesidad de bases de datos formadas por material específico de la tarea a reconocer.
En la primera parte de la tesis se presenta la propuesta propia de esta tesis para la aplicación de entrenamiento discriminativo a unidades subléxicas para su aplicación a tareas de reconocimiento del habla continua: el entrenamiento de mínima confusibilidad en segmentos acústicos de longitud limitada. Se proponen dos variantes. En la primera, el conocimiento del lenguaje de la tarea a reconocer es aprovechado para minimizar el número de errores de posible comisión en la tarea, utilizando segmentos acústicos extraídos de una base de datos de propósito general. A continuación, esta misma idea se extiende al caso en que la tarea es desconocida,obteniéndose modelos acústicos de propósito general. Se muestran resultados experimentales en el reconocimiento de las cadenas de dígitos en inglés TIDIGITS utilizando modelos de fonema y semifonema entrenados con TIMIT.
During the last years two different approaches have been widely used in order to improve the acoustic modeling in continuous speech recognition systems: discriminative training algorithms and context dependent subword units. However, while the use of each of these techniques leads to much better results than standard maximum likelihood trained phone models, their combination, i.e. discriminative training of context dependent units, has revealed to be a much more dificult task. In this paper we deal with minimum confusibility training of demiphones using TIMIT database. By applying this approach recently introduced by the authors, the string error rate in the recognition of TIDIGITS using demiphones is reduced some 24% with respect to maximum likelihood training. This improvement is added to the 8% reduction already provided by demiphones with respect to minimum confusibility trained phones.
This proposal seeks to advance the fields of evolutionary genetics and archaeology by considering a new source of ancient genetic material, i.e. food residues absorbed in archaeological ceramic vessels. Ancient DNA analysis is emerging common European research strength and it provides a 4th dimension to the study of evolutionary genetics. The development of methods to recover genetic information from archaeological artefacts would increase the scope of this approach.
The proposed research will evaluate the potential for DNA survival in ceramic vessels by applying the latest methods available at a world-class facility to previously prepared control materials. Through the scheme of research, the applicant will be trained in methods of DNA extraction, amplification and sequence analysis. The work plan will also include training in new methods of quantifying DNA and determining DNA damage. This will result in a comprehensive assessment of survival potential of DNA in ceramic vessels and describe the best methods for their analysis. In the final phase of the project the applicant will be used the derived methods to analyses a range of important archaeological samples. The application of the methods to forensic science will also be assessed.
The period of mobility will provide the opportunity for the transfer of knowledge of residue analysis to laboratories in Italy and Germany who are dedicated to the study of ancient DNA.
Although having revealed to be a very powerful tool
in acoustic modelling, discriminative training presents
a ma jor drawback: the lack of a formulation guaranteeing
convergence in no matter which initial conditions,
such as the Baum-Welch algorithm in maxi- mum likelihood training. For this reason, a gradient
descent search is usually used in this kind of problem.
Unfortunately, standard gradient descent algorithms
rely heavily on the election of the learning rates. This
dependence is specially cumbersome because it represents
that, at each run of the discriminative training
procedure, a search should be carried out over the parameters
ruling the algorithm. In this paper we describe
an adaptive procedure for determining the optimal
value of the step size at each iteration. While
the calculus and memory overhead of the algorithm
is negligible, results show less dependence on the initial
learning rate than standard gradient descent and,
using the same idea in order to apply self-scaling, it
clearly outperforms it.
Although having revealed to be a very powerful tool in acoustic modelling, discriminative training presents a major drawback: the lack of a formulation guaranteeing convergence in no matter which initial conditions, such as the Baum-Welch algorithm in maximum likelihood training. For this reason, a gradient descent search is usually used in this kind of problem. Unfortunately, standard gradient descent algorithms rely heavily on the election of the learning rates. This dependence is specially cumbersome because it represents that, at each run of the discriminative training procedure, a search should be carried out over the parameters ruling the algorithm. In this paper we describe an adaptive procedure for determining the optimal value of the step size at each iteration. While the calculus and memory overhead of the algorithm is negligible, results show less dependence on the initial learning rate than standard gradient descent and, using the same idea in order to apply self-scaling, it clearly outperforms it.
Discriminative training is a powerful tool in acoustic
modeling for automatic speech recognition. Its
strength is based on the direct minimisation of the number of errors committed by the system at recognition
time. This is usually accomplished by dening an
auxiliary function that characterises the behaviour of
the system, and adjusting the parameters of the system
in a way that this function is minimised. The
main drawback of this approach is that a task speci
c training database is needed. In this paper an
alternative procedure is proposed: task adaptation
using task independent databases. It consists in the
combination of acoustic information|estimated using
a general purpose training database|, and linguistic
information|taken from the denition of the task|.
In the experiments carried out, this technique has led
to great improvement in the recognition of two dierent
tasks: clean speech digit strings in English and
dates in Spanish over the telephone wire.
Discriminative training is a powerful tool in acoustic modeling for automatic speech recognition. Its strength is based on the direct minimisation of the number of errors committed by the system at recognition time. This is usually accomplished by de ning an auxiliary function that characterises the behaviour of the system, and adjusting the parameters of the system in a way that this function is minimised. The main drawback of this approach is that a task speci c training database is needed. In this paper an alternative procedure is proposed: task adaptation using task independent databases. It consists in the combination of acoustic information|estimated using a general purpose training database|, and linguistic information|taken from the de nition of the task|. In the experiments carried out, this technique has led to great improvement in the recognition of two di erent tasks: clean speech digit strings in English and dates in Spanish over the telephone wire.
Nogueiras, A.; Mariño, J.B. IEEE International Conference on Acoustics, Speech, and Signal Processing p. 477-480 DOI: 10.1109/ICASSP.1998.674471 Data de presentació: 1998-05 Presentació treball a congrés
In this paper, a task independent discriminative training
framework for subword units based continuous speech
recognition is presented. Instead of aiming at the
optimisation of any task independent figure, say the phone classification or recognition rates, we focus our attention to the reduction of the number of errors committed by the system when a task is defined. This consideration leads to the use of a segmental approach based on the minimisation of the confusibility over short chains of subword units.
Using this framework, a reduction of 32% in the string
error rate may be achieved in the recognition of unknown length digit strings using task independent phone like units.
In this paper, a task independent discriminative training framework for subword units based continuous speech recognition is presented. Instead of aiming at the optimisation of any task independent figure, say the phone classification or recognition rates, we focus our attention to the reduction of the number of errors committed by the system when a task is defined. This consideration leads to the use of a segmental approach based on the minimisation of the confusability over short chains of subword units. Using this framework, a reduction of 32% in the string error rate may be achieved in the recognition of unknown length digit strings using task independent phone like units.
This paper describes NaniBD, a set of tools designed for
transcribing and validating speech databases, developed at the
Signal Processing Group (GPS) of the Department of Signal
Theory and Communications of the Polytechnic University of
Catalonia (TSC/UPC). The main purpose of its development
was the need of a revision system in order to validate and
annotate the Spanish corpus of SpeechDat (II) in the speech
processing environment available at GPS. Despite of this,
NaniBD is designed as a general-purpose system that might fit
any other database, idiom or speech processing system. So far,
the system has been used to revise some 200,000 speech files
from three different corpora. In this paper we will focus our
attention to the actual implementation used in the transcription
of a SpeechDat (II) specifications compatible Catalonian corpus.
1000 speakers, each of them uttering 44 files, compose this
corpus. In this application, we use speech-noise detection,
automatic recognition of spontaneous prompts, digit and letter
to text translation and access to an external database in order to
minimise the amount of time spent by human operators in the