Costa-jussà, M. R.; Formiga, L.; Torrillas, O.; Petit, J.; Fonollosa, José A. R. International review of research in open and distance learning Vol. 16, num. 6, p. 174-205 DOI: 10.19173/irrodl.v16i6.2145 Date of publication: 2015-12-10 Journal article
This paper describes the design, development, and analysis of a MOOC entitled “Approaches to Machine Translation: Rule-based, statistical and hybrid”, and provides lessons learned and conclusions to be taken into account in the future. The course was developed within the Canvas platform, used by recognized European universities. It contains video-lectures, quizzes, and laboratory assignments. Evaluation was through on-line quizzes, programming assignments assessed by means of a specific code evaluation, and peer-to-peer strategies. This MOOC allowed people from various fields to be introduced to the theory and practice of Machine Translation. It also enabled us to internationally publicize various tools developed at the Universitat Politècnica de Catalunya.
La traducción automática ha recibido mucho interés en el campo del procesado del lenguaje natural porque es un tema de interés social. Al mismo tiempo se trata de un problema interesante a nivel académico porque engloba diferentes tareas del tratamiento del lenguaje textual
como la desambiguación léxica, el parseado o el reconocimiento de entidades.
Este artículo presenta los últimos avances en esta área para las dos lenguas que encabezan el ranking de número de habitantes nativos: el chino y español. La investigación en traducción automática para ambos incluye aproximaciones basadas en reglas como estadísticas. El hecho que ambas aproximaciones estén activas deja el camino abierto para hacer una aproximación híbrida.
De este modo, utilizando el caso particular del chino y español, este artículo: (1) describe las motivaciones económicas, sociales y académicas de este para ambos idiomas; (2) revisa, describe y muestra experimentos de
las dos aproximaciones más populares de traducción automática (basada en reglas y estadística); (3) y dibuja líneas futuras que están siendo muy populares en el campo como es la aproximación híbrida.
The use of morphology is particularly interesting in the context of statistical machine translation in order to reduce data sparseness and compensate any lack of training corpus. In this work, we propose several approaches to introduce morphology knowledge into a standard phrase-based machine translation system. We provide word segmentation using two different tools
(COGROO and MORFESSOR) which allow to reduce the vocabulary and data sparseness. Then, we add to these segmentations the morphological information of a POS language model. We combine all these approaches using a Minimum Bayes Risk strategy. Experiments show significant improvements from the enhanced system over the baseline system on Brazilian Portuguese/English language pair. Finally, we report a case study about the impact of enhancing the statistical machine translation system with morphology in a cross-language application system such as ONAIR
which allows users to look for information in video fragments through queries in natural language.
Costa-jussà, M. R.; Formiga, L.; Petit, J.; Fonollosa, José A. R. Lecture notes in computer science Vol. 8856, p. 92-98 DOI: 10.1007/978-3-319-13647-9_10 Date of publication: 2014-11-12 Journal article
This paper describes the design, development and execution of a MOOC entitled “Approaches to Machine Translation: rule-based, statistical and hybrid”. The course is launched from the Canvas platform used by recognized European universities. The course contains video-lecture, quizzes and laboratory assignments. Evaluation is done using a virtual learning environment for computer programming and peer-to-peer strategies. This MOOC allows to introduce people from various areas to the Machine Translation theory and practice. Also it allows to internationalize different tools developed at the Universitat Politècnica de Catalunya.
Machine translation can be considered a highly interdisciplinary and multidisciplinary field because it is approached from the point of view of human translators, engineers, computer scientists, mathematicians, and
linguists. One of the most popular approaches is the Statistical Machine Translation (SMT) approach, which tries to cover translation in a holistic manner by learning from parallel corpus aligned at the sentence level.
However, with this basic approach, there are some issues at each written linguistic level (i.e., orthographic, morphological, lexical, syntactic and semantic) that remain unsolved. Research in SMT has continuously been
focused on solving the different linguistic levels challenges. This article represents a survey of how the SMT has been enhanced to perform translation correctly at all linguistic levels.
We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catalan corpus is partially available via ELDA (Evaluations and Language Resources Distribution Agency) in catalog number ELRA-W0053.
Formiga, L.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R.; Barron-Cedeño, A.; Marquez, L. Workshop on Statistical Machine Translation p. 134-140 Presentation's date: 2013-08-08 Presentation of work at congresses
This paper describes the TALP participation in the WMT13 evaluation campaign. Our participation is based on the combination of several statistical machine translation systems: based on standard hrasebased
Moses systems. Variations include techniques such as morphology generation, training sentence filtering, and domain adaptation through unit derivation.
The results show a coherent improvement on TER, METEOR, NIST, and BLEU scores when compared to our baseline system.
Although, Chinese and Spanish are two of the most spoken languages in the world, not much research has been done in machine translation for this language pair. This paper focuses on investigating the state-of-the-art of Chinese-to-Spanish statistical machine translation (SMT), which nowadays is one of the most popular approaches to machine translation. We conduct experimental work with the largest of these three corpora to explore alternative SMT strategies by means of using a pivot language. Three alternatives are considered for pivoting: cascading, pseudo-corpus and triangulation. As pivot language, we use either English, Arabic or French. Results show that, for a phrase-based SMT system, English is the best pivot language between Chinese and Spanish. We propose a system output combination using the pivot strategies which is capable of outperforming the direct translation strategy. The main objective of this work is motivating and involving the research community to work in this important pair of languages given their demographic impact.
Pérez, J.; Bonafonte, A.; Costa-jussà, M. R.; Cardenal, A.; Fonollosa, José A. R.; Moreno, A.; Navas, E.; R.Banga, E. Jornadas en Tecnología del Habla and III Iberian SLTech Workshop p. 422-430 Presentation's date: 2012-11 Presentation of work at congresses
'Machine Translation (MT) is a highly interdisciplinary and multidisciplinary field since it is approached from the point of view of engineering, computer science, informatics, statistics and linguists. Unfortunately, the cooperation and interaction among these fields in relation to MT technologies is still very low. The goal of this research project is to approach the different profiles in the MT community by providing a new integrated MT paradigm which mainly includes linguistic technologies and statistical algorithms.
Basically, our research will be focused on the problem of dynamically integrating the two most popular MT paradigms: the rule-based and the statistical-based. We will use linguistic technologies developed either for the rule-based MT systems or other natural language processing tasks into statistical MT systems. Linguistic technologies include: bilingual dictionaries, transfer rules, statistical parsing, word sense disambiguation, morphological and syntactic analysis. The new paradigm will provide solutions to current MT challenges such as unknown words, reordering and semantic ambiguities.
The project will focus on the three most spoken languages in the world: Chinese, Spanish and English; and all translation combinations among them. These language pairs do not only involve many economical and cultural interests, but they also include some of the most relevant MT challenges such as morphological, syntactic and semantic variations.'
Adell, J.; Bonafonte, A.; Costa-jussà, M. R.; Cardenal, A.; Fonollosa, José A. R.; Moreno, A.; Navas, E.; R.Banga, E. International Conference on Language Resources and Evaluation p. 1705-1709 Presentation's date: 2012-05-24 Presentation of work at congresses
This paper presents a web-based multimedia search engine built within the Buceador (www.buceador.org) research project. A proof-of-concept tool has been implemented which is able to retrieve information from a digital library made of multimedia documents in the 4 official languages in Spain (Spanish, Basque, Catalan and Galician). The retrieved documents are presented in the user language after translation and dubbing (the four previous languages + English). The paper presents the tool functionality, the architecture, the digital library and provide some information about the technology involved in the fields of automatic speech recognition, statistical machine translation, text-to-speech synthesis and information retrieval. Each technology has been adapted to the purposes of the presented tool as well as to interact with the rest of the technologies involved.
Mariño, J.B.; Poch, M.; Costa-jussà, M. R.; Hernandez, A.; Herníquez, C.; Fonollosa, José A. R.; Farrús, M. Language resources and evaluation Vol. 45, num. 2, p. 181-208 DOI: 10.1007/s10579-011-9137-0 Date of publication: 2011-02-20 Journal article
This work aims to improve anN-gram-based statistical machine translation
system between the Catalan and Spanish languages, trained with an aligned Spanish–
Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system,
orthographic, morphological, lexical, semantic and syntactic problems are approached
using a set of techniques. The proposed solutions include the development and application
of additional statistical techniques, text pre- and post-processing tasks, and rules
based on the use of grammatical categories, as well as lexical categorization. The
performance of the improved system is clearly increased, as is shown in both human and
automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in
the Spanish-to-Catalan direction of translation, and a gain of about 0.5 points in the
reverse direction. The final system is freely available online as a linguistic resource
Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R. Annual Conference of the European Association for Machine Translation p. 167-173 Presentation's date: 2010-05-28 Presentation of work at congresses
Machine translation evaluation methods
are highly necessary in order to analyze the
performance of translation systems. Up to
now, the most traditional methods are the
use of automatic measures such as BLEU
or the quality perception performed by native
human evaluations. In order to complement
these traditional procedures, the
current paper presents a new human evaluation
based on the expert knowledge about
the errors encountered at several linguistic
levels: orthographic, morphological, lexical,
semantic and syntactic. The results obtained
in these experiments show that some
linguistic errors could have more influence
than other at the time of performing a perceptual evaluation.
This paper proposes to introduce a novel reordering model in the open-source Moses toolkit. The main idea is to provide
weighted reordering hypotheses to the SMT decoder. These hypotheses are built using a first-step Ngram-based SMT
translation from a source language into a third representation that is called reordered source language. Each hypothesis
has its own weight provided by the Ngram-based decoder. This proposed reordering technique offers a better and more
efficient translation when compared to both the distance-based and the lexicalized reordering. In addition to this reordering
approach, this paper describes a domain adaptation technique which is based on a linear combination of an specific indomain
and an extra out-domain translation models. Results for both approaches are reported in the Arabic-to-English
2008 IWSLT task. When implementing the weighted reordering hypotheses and the domain adaptation technique in the
final translation system, translation results reach improvements up to 2.5 BLEU compared to a standard state-of-the-art
Moses baseline system.
Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R. International Conference on Language Resources and Evaluation p. 1707-1711 Presentation's date: 2010-05-20 Presentation of work at congresses
Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core technology.
Since both paradigms have largely been used during the last years, one of the aims in the research community is to
know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of a
rule-based and a corpus-based (particularly, statistical) Catalan-Spanish machine translation systems, both of them freely
available in the web. The translation quality analysis is performed under two different domains: journalistic and medical.
The systems are evaluated by using standard automatic measures, as well as by native human evaluators. Automatic results
show that the statistical system performs better than the rule-based system. Human judgements show that in the Spanishto-
Catalan direction the statistical system also performs better than the rule-based system, while in the Catalan-to-Spanish
direction is the other way round. Although the statistical system obtains the best automatic scores, its errors tend to be more
penalized by human judgements than the errors of the rule-based system. This can be explained because statistical errors
are usually unexpected and they do not follow any pattern.
Costa-jussà, M. R.; Fonollosa, José A. R. IEICE transactions on information and systems Vol. E92-D, num. 11, p. 2179-2185 DOI: 10.1587/transinf.E92.D.2179 Date of publication: 2009-11-01 Journal article
This paper surveys several state-of-the-art reordering techniques employed in Statistical Machine Translation systems. Reordering is understood as the word-order redistribution of the translated words. In original SMT systems, this different order is ony modeled within the limits of translation units. Relying only in the reordering provided by translation units may not be good enought in most language pairs, whichmight require longer reorderings. Therefore, additional techniques may be deployed to face the reordering challenge. The Statistical Machine Translation community has been very active recently in deveoping reordering techniques. This paper gives a brief survey and classification of seevral well-known reordering approaches.
Poch, M.; Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Hernández, A.; Henríquez, C.; Fonollosa, José A. R. Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages p. 105 Presentation's date: 2009-09-04 Presentation of work at congresses
This paper describes in detail a novel approach to the reordering challenge in statistical machine translation (SMT).
This Ngram-based reordering (NbR) approach uses the powerful techniques of SMT systems to generate a weighted reordering graph. Thus, statistical criteria reordering constraints are supplied to an SMT system, and this allows an extension to the SMT decoding search.
The NbR approach is capable of generalizing reorderings that have been learned during training, through the use of word classes instead of words themselves.
Improvement in translation performance is demonstrated with the EPPS task (Spanish and German to English) and the BTEC task (Arabic to English).
Mejor artículo 2009 publicado en una revista internacional firmado en primer lugar por un joven investigador de una universidad española; otorgado por la Red Temática de Temática de Tecnologías del Habla
Fonollosa, José A. R.; Khalilov, M.; Costa-jussà, M. R.; Henríquez, C.; Hernández, A.; Banchs, R. E. Association for Computational Linguistics. European Chapter. Conference p. 85-89 Presentation's date: 2009-04-01 Presentation of work at congresses
This study presents the TALP-UPC submission
to the EACL Fourth Worskhop on Statistical Machine Translation 2009 evaluation campaign. It outlines the architecture and configuration of the 2009 phrase-based statistical machine translation (SMT) system, putting emphasis on the major novelty of this year: combination of SMT systems implementing different word reordering algorithms. Traditionally, we have concentrated on the Spanish-to-English and English-to-Spanish News Commentary translation tasks.
Lambert, P.; Gimenez, J.; Amigó, E.; Banchs, R.; Marquez, L.; Fonollosa, José A. R.; Costa-jussà, M. R. 2006 IEEE/ACL Workshop on Spoken Language Technology p. 246-249 Presentation of work at congresses