Costa-jussà, M. R.; Formiga, L.; Petit, J.; Fonollosa, José A. R. Lecture notes in computer science Vol. 8856, p. 92-98 DOI: 10.1007/978-3-319-13647-9_10 Date of publication: 2014-11-12 Journal article
This paper describes the design, development and execution of a MOOC entitled “Approaches to Machine Translation: rule-based, statistical and hybrid”. The course is launched from the Canvas platform used by recognized European universities. The course contains video-lecture, quizzes and laboratory assignments. Evaluation is done using a virtual learning environment for computer programming and peer-to-peer strategies. This MOOC allows to introduce people from various areas to the Machine Translation theory and practice. Also it allows to internationalize different tools developed at the Universitat Politècnica de Catalunya.
We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catalan corpus is partially available via ELDA (Evaluations and Language Resources Distribution Agency) in catalog number ELRA-W0053.
Formiga, L.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R.; Barron-Cedeño, A.; Marquez, L. Workshop on Statistical Machine Translation p. 134-140 Presentation's date: 2013-08-08 Presentation of work at congresses
This paper describes the TALP participation in the WMT13 evaluation campaign. Our participation is based on the combination of several statistical machine translation systems: based on standard hrasebased
Moses systems. Variations include techniques such as morphology generation, training sentence filtering, and domain adaptation through unit derivation.
The results show a coherent improvement on TER, METEOR, NIST, and BLEU scores when compared to our baseline system.
Although, Chinese and Spanish are two of the most spoken languages in the world, not much research has been done in machine translation for this language pair. This paper focuses on investigating the state-of-the-art of Chinese-to-Spanish statistical machine translation (SMT), which nowadays is one of the most popular approaches to machine translation. We conduct experimental work with the largest of these three corpora to explore alternative SMT strategies by means of using a pivot language. Three alternatives are considered for pivoting: cascading, pseudo-corpus and triangulation. As pivot language, we use either English, Arabic or French. Results show that, for a phrase-based SMT system, English is the best pivot language between Chinese and Spanish. We propose a system output combination using the pivot strategies which is capable of outperforming the direct translation strategy. The main objective of this work is motivating and involving the research community to work in this important pair of languages given their demographic impact.
Pérez, J.; Bonafonte, A.; Costa-jussà, M. R.; Cardenal, A.; Fonollosa, José A. R.; Moreno, A.; Navas, E.; R.Banga, E. Jornadas en Tecnología del Habla and III Iberian SLTech Workshop p. 422-430 Presentation's date: 2012-11 Presentation of work at congresses
'Machine Translation (MT) is a highly interdisciplinary and multidisciplinary field since it is approached from the point of view of engineering, computer science, informatics, statistics and linguists. Unfortunately, the cooperation and interaction among these fields in relation to MT technologies is still very low. The goal of this research project is to approach the different profiles in the MT community by providing a new integrated MT paradigm which mainly includes linguistic technologies and statistical algorithms.
Basically, our research will be focused on the problem of dynamically integrating the two most popular MT paradigms: the rule-based and the statistical-based. We will use linguistic technologies developed either for the rule-based MT systems or other natural language processing tasks into statistical MT systems. Linguistic technologies include: bilingual dictionaries, transfer rules, statistical parsing, word sense disambiguation, morphological and syntactic analysis. The new paradigm will provide solutions to current MT challenges such as unknown words, reordering and semantic ambiguities.
The project will focus on the three most spoken languages in the world: Chinese, Spanish and English; and all translation combinations among them. These language pairs do not only involve many economical and cultural interests, but they also include some of the most relevant MT challenges such as morphological, syntactic and semantic variations.'
Adell, J.; Bonafonte, A.; Costa-jussà, M. R.; Cardenal, A.; Fonollosa, José A. R.; Moreno, A.; Navas, E.; R.Banga, E. International Conference on Language Resources and Evaluation p. 1705-1709 Presentation's date: 2012-05-24 Presentation of work at congresses
Mariño, J.B.; Poch, M.; Costa-jussà, M. R.; Hernandez, A.; Herníquez, C.; Fonollosa, José A. R.; Farrús, M. Language resources and evaluation Vol. 45, num. 2, p. 181-208 DOI: 10.1007/s10579-011-9137-0 Date of publication: 2011-02-20 Journal article
Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R. Annual Conference of the European Association for Machine Translation p. 167-173 Presentation's date: 2010-05-28 Presentation of work at congresses
Machine translation evaluation methods
are highly necessary in order to analyze the
performance of translation systems. Up to
now, the most traditional methods are the
use of automatic measures such as BLEU
or the quality perception performed by native
human evaluations. In order to complement
these traditional procedures, the
current paper presents a new human evaluation
based on the expert knowledge about
the errors encountered at several linguistic
levels: orthographic, morphological, lexical,
semantic and syntactic. The results obtained
in these experiments show that some
linguistic errors could have more influence
than other at the time of performing a perceptual evaluation.
Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R. International Conference on Language Resources and Evaluation p. 1707-1711 Presentation's date: 2010-05-20 Presentation of work at congresses
Costa-jussà, M. R.; Fonollosa, José A. R. IEICE transactions on information and systems Vol. E92-D, num. 11, p. 2179-2185 DOI: 10.1587/transinf.E92.D.2179 Date of publication: 2009-11-01 Journal article
This paper surveys several state-of-the-art reordering techniques employed in Statistical Machine Translation systems. Reordering is understood as the word-order redistribution of the translated words. In original SMT systems, this different order is ony modeled within the limits of translation units. Relying only in the reordering provided by translation units may not be good enought in most language pairs, whichmight require longer reorderings. Therefore, additional techniques may be deployed to face the reordering challenge. The Statistical Machine Translation community has been very active recently in deveoping reordering techniques. This paper gives a brief survey and classification of seevral well-known reordering approaches.
Poch, M.; Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Hernández, A.; Henríquez, C.; Fonollosa, José A. R. Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages p. 105 Presentation's date: 2009-09-04 Presentation of work at congresses
This paper describes in detail a novel approach to the reordering challenge in statistical machine translation (SMT).
This Ngram-based reordering (NbR) approach uses the powerful techniques of SMT systems to generate a weighted reordering graph. Thus, statistical criteria reordering constraints are supplied to an SMT system, and this allows an extension to the SMT decoding search.
The NbR approach is capable of generalizing reorderings that have been learned during training, through the use of word classes instead of words themselves.
Improvement in translation performance is demonstrated with the EPPS task (Spanish and German to English) and the BTEC task (Arabic to English).
Fonollosa, José A. R.; Khalilov, M.; Costa-jussà, M. R.; Henríquez, C.; Hernández, A.; Banchs, R. E. Association for Computational Linguistics. European Chapter. Conference p. 85-89 Presentation's date: 2009-04-01 Presentation of work at congresses
This study presents the TALP-UPC submission
to the EACL Fourth Worskhop on Statistical Machine Translation 2009 evaluation campaign. It outlines the architecture and configuration of the 2009 phrase-based statistical machine translation (SMT) system, putting emphasis on the major novelty of this year: combination of SMT systems implementing different word reordering algorithms. Traditionally, we have concentrated on the Spanish-to-English and English-to-Spanish News Commentary translation tasks.
Lambert, P.; Gimenez, J.; Amigó, E.; Banchs, R.; Marquez, L.; Fonollosa, José A. R.; Costa-jussà, M. R. 2006 IEEE/ACL Workshop on Spoken Language Technology p. 246-249 Presentation of work at congresses