Speech, Audio and Language Processing are largely benefiting from deep learning architectures to achieve high levels of performance. Deep learning are the set of algorithms that allow to learn different levels of abstraction from given data. These algorithms have achieved great success in supervised environments. Supervised means that we have labelled data for training purposes. Deep learning algorithms typically need large quantities of labelled data to perform a task. Different architectures of these algorithms are combined and concatenated depending on the goal of the task. These can include recurrent neural networks that excel at modeling variable-length sequences, and convolutional neural networks that have typically been used to extract patterns from images. However, much more complex architectures such as the Transformer, which combine attention mechanisms and feed-forward networks, are so versatile that are able to succeed in multiple tasks. The purpose of this project is to focus on remaining challenges of advanced deep learning architectures in the context of speech, audio and language processing by continuing the intense research of our group. The project proposes to tackle big challenges in multilingual and multimodal machine translation, speaker recognition, natural language processing and speech regeneration. In machine translation, the project is dedicated to unsupervised and multilingual machine translation. On the one hand, while machine translation has classically been trained using parallel data at the level of sentence, it is possible to train using only monolingual data. On the other hand, multilingual machine translation if dealt pairwise, it can be computationally very expensive. This project proposes to use language independent encoders and decoders that can be trained on monolingual data. For this purpose, we need to work towards an automatically extracted language-independent representation, which has been typically been identified as an interlingua. Beyond machine translation, we propose to face speaker recognition, speech and audio translation in an end-to-end fashion. Regarding natural language processing, the project is oriented towards solving general deep learning limitations which are fairness and generalization. The high performance of deep learning is overshadowed by issues like unfair behaviours, which are typically arising from demographic biases (e.g., she is a doctor when translated to Turkish and then back-translated to English, becomes he is a doctor). These biases are often learned and amplified from training data. Similarly, the lack of generalization is observed when our proposed systems only learn compositions that have been observed in the training data. One way to tackle these problems is by properly making use of linguistic information. This project proposes to make use of dictionaries, dependency trees and semantic resources (e.g. Wordnet, Babelnet) to retrofit deep learning algorithms. Finally, the work in speech regeneration includes unsupervised speech problem agnostic representations as well as introducing generative adversarial networks to improve the quality of the synthetic voice in voice conversion and speech enhancement applications. Given the high interest both at the academy and industry level, this project will share results and benefit from multiple co-operations including universities and companies as supported by the corresponding letters of interest.
PLAN ESTATAL DE INVESTIGACIÓN CIENTÍFICA Y TÉCNICA Y DE INNOVACIÓN 2017-2020
PROGRAMA ESTATAL DE I+D+I ORIENTADA A LOS RETOS DE LA SOCIEDAD
Claveria, O.; Monte, E.; Torra Porras, Salvador Technological and economic development of economy (Spausdinta) Vol. 27, num. 1, p. 262-279 DOI: 10.3846/tede.2021.13989 Date of publication: 2021-01-18 Journal article
Escolano, C.; Costa-jussà, Marta R.; Fonollosa, José A. R. Journal of the Association for Information Science and Technology Vol. 72, num. 2, p. 190-203 DOI: 10.1002/asi.24395 Date of publication: 2020-08-02 Journal article