Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair.
This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules.
The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.
Statistical machine translation (SMT) is gaining interest given that it can easily be adapted to any pair of languages. One of the main challenges in SMT is domain adaptation because the performance in translation drops when testing conditions deviate from training conditions. Many research works are arising to face this challenge. Research is focused on trying to exploit all kinds of material, if available. This paper provides an overview of research, which copes with the domain adaptation challenge in SMT.
Costa-jussà, M. R.; Formiga, L.; Torrillas, O.; Petit, J.; Fonollosa, José A. R. International review of research in open and distance learning Vol. 16, num. 6, p. 174-205 DOI: 10.19173/irrodl.v16i6.2145 Date of publication: 2015-11-01 Journal article
This paper describes the design, development, and analysis of a MOOC entitled "Approaches to Machine Translation: Rule-based, statistical and hybrid", and provides lessons learned and conclusions to be taken into account in the future. The course was developed within the Canvas platform, used by recognized European universities. It contains video-lectures, quizzes, and laboratory assignments. Evaluation was through on-line quizzes, programming assignments assessed by means of a specific code evaluation, and peer-to-peer strategies. This MOOC allowed people from various fields to be introduced to the theory and practice of Machine Translation. It also enabled us to internationally publicize various tools developed at the Universitat Politecnica de Catalunya.
This paper describes the design, development, and analysis of a MOOC entitled “Approaches to Machine Translation: Rule-based, statistical and hybrid”, and provides lessons learned and conclusions to be taken into account in the future. The course was developed within the Canvas platform, used by recognized European universities. It contains video-lectures, quizzes, and laboratory assignments. Evaluation was through on-line quizzes, programming assignments assessed by means of a specific code evaluation, and peer-to-peer strategies. This MOOC allowed people from various fields to be introduced to the theory and practice of Machine Translation. It also enabled us to internationally publicize various tools developed at the Universitat Politècnica de Catalunya.
Rule-based and corpus-based machine translation (MT)have coexisted for more than 20 years. Recently, bound-aries between the two paradigms have narrowed andhybrid approaches are gaining interest from bothacademia and businesses. However, since hybridapproaches involve the multidisciplinary interaction oflinguists, computer scientists, engineers, and informa-tion specialists, understandably a number of issuesexist.While statistical methods currently dominate researchwork in MT, most commercial MT systems are techni-cally hybrid systems. The research community shouldinvestigate the bene¿ts and questions surrounding thehybridization of MT systems more actively. This paperdiscusses various issues related to hybrid MT includingits origins, architectures, achievements, and frustra-tions experienced in the community. It can be said thatboth rule-based and corpus- based MT systems havebene¿ted from hybridization when effectively integrated.In fact, many of the current rule/corpus-based MTapproaches are already hybridized since they do includestatistics/rules at some point.
La traducción automática ha recibido mucho interés en el campo del procesado del lenguaje natural porque es un tema de interés social. Al mismo tiempo se trata de un problema interesante a nivel académico porque engloba diferentes tareas del tratamiento del lenguaje textual
como la desambiguación léxica, el parseado o el reconocimiento de entidades.
Este artículo presenta los últimos avances en esta área para las dos lenguas que encabezan el ranking de número de habitantes nativos: el chino y español. La investigación en traducción automática para ambos incluye aproximaciones basadas en reglas como estadísticas. El hecho que ambas aproximaciones estén activas deja el camino abierto para hacer una aproximación híbrida.
De este modo, utilizando el caso particular del chino y español, este artículo: (1) describe las motivaciones económicas, sociales y académicas de este para ambos idiomas; (2) revisa, describe y muestra experimentos de
las dos aproximaciones más populares de traducción automática (basada en reglas y estadística); (3) y dibuja líneas futuras que están siendo muy populares en el campo como es la aproximación híbrida.
The use of morphology is particularly interesting in the context of statistical machine translation in order to reduce data sparseness and compensate any lack of training corpus. In this work, we propose several approaches to introduce morphology knowledge into a standard phrase-based machine translation system. We provide word segmentation using two different tools
(COGROO and MORFESSOR) which allow to reduce the vocabulary and data sparseness. Then, we add to these segmentations the morphological information of a POS language model. We combine all these approaches using a Minimum Bayes Risk strategy. Experiments show significant improvements from the enhanced system over the baseline system on Brazilian Portuguese/English language pair. Finally, we report a case study about the impact of enhancing the statistical machine translation system with morphology in a cross-language application system such as ONAIR
which allows users to look for information in video fragments through queries in natural language.
One of the major bottlenecks in the development of data-driven AI Systems is the cost of reliable human annotations. The recent advent of several crowdsourcing platforms such as Amazon’s Mechanical Turk, allowing requesters the access to affordable and rapid results of a global workforce, greatly facilitates the creation of massive training data. Most of the available studies on the effectiveness of crowdsourcing report on English data. We use Mechanical Turk annotations to train an Opinion Mining System to classify Spanish consumer comments. We design three different Human Intelligence Task (HIT) strategies and report high inter-annotator agreement between non-experts and expert annotators. We evaluate the advantages/drawbacks of each HIT design and show that, in our case, the use of non-expert annotations is a viable and cost-effective alternative to expert annotations.
A comparison of interlingua and query translation is proposed in a
particular cross-language information retrieval (CLIR) application which consists
on retrieving a book from the collection by using one of its chapters in a different
language as a query. The experiments are performed in three languages (English,
Chinese and Spanish) and all the possible combinations. It is shown that interlingua
is able to outperform the query translation approach in some cross-language
tasks. Results are further analysed and it is found that, for this particular task, the
quality of translation (in terms of BLEU and PER) is not directly correlated with
the query translation performance.
Costa-jussà, M. R.; Formiga, L.; Petit, J.; Fonollosa, José A. R. Lecture notes in computer science Vol. 8856, p. 92-98 DOI: 10.1007/978-3-319-13647-9_10 Date of publication: 2014-11-12 Journal article
This paper describes the design, development and execution of a MOOC entitled “Approaches to Machine Translation: rule-based, statistical and hybrid”. The course is launched from the Canvas platform used by recognized European universities. The course contains video-lecture, quizzes and laboratory assignments. Evaluation is done using a virtual learning environment for computer programming and peer-to-peer strategies. This MOOC allows to introduce people from various areas to the Machine Translation theory and practice. Also it allows to internationalize different tools developed at the Universitat Politècnica de Catalunya.
Centelles, J.; Costa-jussà, M. R.; Banchs, R.; Gelbukh, A. Computación y sistemas: revista iberoamericana de computación Vol. 18, num. 3, p. 603-610 DOI: 10.13053/CyS-18-3-2047 Date of publication: 2014-07-01 Journal article
We describe a Chinese-Portuguese
translation service, which is integrated in an Android
application. The application is also enhanced with
technologies such as Automatic Speech Recognition,
Optical Character Recognition, Image Retrieval, and
Language Detection. This mobile translation
application, which is deployed on a portable device,
relies by default on a server-based machine translation
service, which is not accessible when no Internet
connection is available. For providing translation
support under this condition, we have developed a
contextualized off-line search engine that allows the
users to continue using the application. The system
includes a search engine that is used to support our
Chinese-Portuguese machine translation services when
no Internet connection is available.
Centelles, J.; Costa-jussà, M. R.; Banchs, R.; Gelbukh, A. Lecture notes in artificial intelligence Vol. 8404, p. 324-330 DOI: 10.1007/978-3-642-54903-8_27 Date of publication: 2014-04-01 Journal article
This paper describes an Information Retrieval engine that is used to
support our Chinese-Portuguese machine translation services when no internet
connection is available. Our mobile translation app, which is deployed on a
portable device, relies by default on a server-based machine translation service,
which is not accessible when no internet connection is available. For providing
translation support under this condition, we have developed a contextualized
off-line search engine that allows the users to continue using the app.
We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catalan corpus is partially available via ELDA (Evaluations and Language Resources Distribution Agency) in catalog number ELRA-W0053.
Machine translation can be considered a highly interdisciplinary and multidisciplinary field because it is approached from the point of view of human translators, engineers, computer scientists, mathematicians, and
linguists. One of the most popular approaches is the Statistical Machine Translation (SMT) approach, which tries to cover translation in a holistic manner by learning from parallel corpus aligned at the sentence level.
However, with this basic approach, there are some issues at each written linguistic level (i.e., orthographic, morphological, lexical, syntactic and semantic) that remain unsolved. Research in SMT has continuously been
focused on solving the different linguistic levels challenges. This article represents a survey of how the SMT has been enhanced to perform translation correctly at all linguistic levels.
A nonlinear semantic mapping procedure is proposed for cross-language document retrieval. The method relies on a nonlinear space reduction technique for constructing semantic embeddings of multilingual document collections. In the proposed method, an independent embedding is constructed for each language in the multilingual collection and the similarities among the resulting semantic representations are used for cross-language document retrieval. Two variants of the proposed method are implemented and compared with a standard cross-language information retrieval technique. It is shown that the proposed method outperforms the conventional one.
When evaluating machine translation outputs, linguistics is usually taken into account implicitly. Annotators have to decide whether a sentence is better than another or not, using, for example, adequacy and fluency criteria or, as recently proposed, editing the translation output so that it has the same meaning as a reference translation, and it is understandable. Therefore, the important fields of linguistics of meaning (semantics) and grammar (syntax) are indirectly considered. In this study, we propose to go one step further towards a linguistic human evaluation. The idea is to introduce linguistics implicitly by formulating precise guidelines. These guidelines strictly mark the difference between the sub-fields of linguistics such as: morphology, syntax, semantics, and orthography. We show our guidelines have a high inter-annotation agreement and wide-error coverage. Additionally, we examine how the linguistic human evaluation data correlate with: among different types of machine translation systems (rule and statistical-based); and with adequacy and fluency.
Formiga, L.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R.; Barron-Cedeño, A.; Marquez, L. Workshop on Statistical Machine Translation p. 134-140 Presentation's date: 2013-08-08 Presentation of work at congresses
This paper describes the TALP participation in the WMT13 evaluation campaign. Our participation is based on the combination of several statistical machine translation systems: based on standard hrasebased
Moses systems. Variations include techniques such as morphology generation, training sentence filtering, and domain adaptation through unit derivation.
The results show a coherent improvement on TER, METEOR, NIST, and BLEU scores when compared to our baseline system.
Although, Chinese and Spanish are two of the most spoken languages in the world, not much research has been done in machine translation for this language pair. This paper focuses on investigating the state-of-the-art of Chinese-to-Spanish statistical machine translation (SMT), which nowadays is one of the most popular approaches to machine translation. We conduct experimental work with the largest of these three corpora to explore alternative SMT strategies by means of using a pivot language. Three alternatives are considered for pivoting: cascading, pseudo-corpus and triangulation. As pivot language, we use either English, Arabic or French. Results show that, for a phrase-based SMT system, English is the best pivot language between Chinese and Spanish. We propose a system output combination using the pivot strategies which is capable of outperforming the direct translation strategy. The main objective of this work is motivating and involving the research community to work in this important pair of languages given their demographic impact.
Pérez, J.; Bonafonte, A.; Costa-jussà, M. R.; Cardenal, A.; Fonollosa, José A. R.; Moreno, A.; Navas, E.; R.Banga, E. Jornadas en Tecnología del Habla and III Iberian SLTech Workshop p. 422-430 Presentation's date: 2012-11 Presentation of work at congresses
'Machine Translation (MT) is a highly interdisciplinary and multidisciplinary field since it is approached from the point of view of engineering, computer science, informatics, statistics and linguists. Unfortunately, the cooperation and interaction among these fields in relation to MT technologies is still very low. The goal of this research project is to approach the different profiles in the MT community by providing a new integrated MT paradigm which mainly includes linguistic technologies and statistical algorithms.
Basically, our research will be focused on the problem of dynamically integrating the two most popular MT paradigms: the rule-based and the statistical-based. We will use linguistic technologies developed either for the rule-based MT systems or other natural language processing tasks into statistical MT systems. Linguistic technologies include: bilingual dictionaries, transfer rules, statistical parsing, word sense disambiguation, morphological and syntactic analysis. The new paradigm will provide solutions to current MT challenges such as unknown words, reordering and semantic ambiguities.
The project will focus on the three most spoken languages in the world: Chinese, Spanish and English; and all translation combinations among them. These language pairs do not only involve many economical and cultural interests, but they also include some of the most relevant MT challenges such as morphological, syntactic and semantic variations.'
Adell, J.; Bonafonte, A.; Costa-jussà, M. R.; Cardenal, A.; Fonollosa, José A. R.; Moreno, A.; Navas, E.; R.Banga, E. International Conference on Language Resources and Evaluation p. 1705-1709 Presentation's date: 2012-05-24 Presentation of work at congresses
This paper presents a web-based multimedia search engine built within the Buceador (www.buceador.org) research project. A proof-of-concept tool has been implemented which is able to retrieve information from a digital library made of multimedia documents in the 4 official languages in Spain (Spanish, Basque, Catalan and Galician). The retrieved documents are presented in the user language after translation and dubbing (the four previous languages + English). The paper presents the tool functionality, the architecture, the digital library and provide some information about the technology involved in the fields of automatic speech recognition, statistical machine translation, text-to-speech synthesis and information retrieval. Each technology has been adapted to the purposes of the presented tool as well as to interact with the rest of the technologies involved.
Mariño, J.B.; Poch, M.; Costa-jussà, M. R.; Hernandez, A.; Herníquez, C.; Fonollosa, José A. R.; Farrús, M. Language resources and evaluation Vol. 45, num. 2, p. 181-208 DOI: 10.1007/s10579-011-9137-0 Date of publication: 2011-02-20 Journal article
This work aims to improve anN-gram-based statistical machine translation
system between the Catalan and Spanish languages, trained with an aligned Spanish–
Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system,
orthographic, morphological, lexical, semantic and syntactic problems are approached
using a set of techniques. The proposed solutions include the development and application
of additional statistical techniques, text pre- and post-processing tasks, and rules
based on the use of grammatical categories, as well as lexical categorization. The
performance of the improved system is clearly increased, as is shown in both human and
automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in
the Spanish-to-Catalan direction of translation, and a gain of about 0.5 points in the
reverse direction. The final system is freely available online as a linguistic resource
Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R. Annual Conference of the European Association for Machine Translation p. 167-173 Presentation's date: 2010-05-28 Presentation of work at congresses
Machine translation evaluation methods
are highly necessary in order to analyze the
performance of translation systems. Up to
now, the most traditional methods are the
use of automatic measures such as BLEU
or the quality perception performed by native
human evaluations. In order to complement
these traditional procedures, the
current paper presents a new human evaluation
based on the expert knowledge about
the errors encountered at several linguistic
levels: orthographic, morphological, lexical,
semantic and syntactic. The results obtained
in these experiments show that some
linguistic errors could have more influence
than other at the time of performing a perceptual evaluation.
This paper proposes to introduce a novel reordering model in the open-source Moses toolkit. The main idea is to provide
weighted reordering hypotheses to the SMT decoder. These hypotheses are built using a first-step Ngram-based SMT
translation from a source language into a third representation that is called reordered source language. Each hypothesis
has its own weight provided by the Ngram-based decoder. This proposed reordering technique offers a better and more
efficient translation when compared to both the distance-based and the lexicalized reordering. In addition to this reordering
approach, this paper describes a domain adaptation technique which is based on a linear combination of an specific indomain
and an extra out-domain translation models. Results for both approaches are reported in the Arabic-to-English
2008 IWSLT task. When implementing the weighted reordering hypotheses and the domain adaptation technique in the
final translation system, translation results reach improvements up to 2.5 BLEU compared to a standard state-of-the-art
Moses baseline system.
Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Fonollosa, José A. R. International Conference on Language Resources and Evaluation p. 1707-1711 Presentation's date: 2010-05-20 Presentation of work at congresses
Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core technology.
Since both paradigms have largely been used during the last years, one of the aims in the research community is to
know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of a
rule-based and a corpus-based (particularly, statistical) Catalan-Spanish machine translation systems, both of them freely
available in the web. The translation quality analysis is performed under two different domains: journalistic and medical.
The systems are evaluated by using standard automatic measures, as well as by native human evaluators. Automatic results
show that the statistical system performs better than the rule-based system. Human judgements show that in the Spanishto-
Catalan direction the statistical system also performs better than the rule-based system, while in the Catalan-to-Spanish
direction is the other way round. Although the statistical system obtains the best automatic scores, its errors tend to be more
penalized by human judgements than the errors of the rule-based system. This can be explained because statistical errors
are usually unexpected and they do not follow any pattern.
Costa-jussà, M. R.; Fonollosa, José A. R. IEICE transactions on information and systems Vol. E92-D, num. 11, p. 2179-2185 DOI: 10.1587/transinf.E92.D.2179 Date of publication: 2009-11-01 Journal article
This paper surveys several state-of-the-art reordering techniques employed in Statistical Machine Translation systems. Reordering is understood as the word-order redistribution of the translated words. In original SMT systems, this different order is ony modeled within the limits of translation units. Relying only in the reordering provided by translation units may not be good enought in most language pairs, whichmight require longer reorderings. Therefore, additional techniques may be deployed to face the reordering challenge. The Statistical Machine Translation community has been very active recently in deveoping reordering techniques. This paper gives a brief survey and classification of seevral well-known reordering approaches.
Poch, M.; Farrús, M.; Costa-jussà, M. R.; Mariño, J.B.; Hernández, A.; Henríquez, C.; Fonollosa, José A. R. Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages p. 105 Presentation's date: 2009-09-04 Presentation of work at congresses
This paper describes in detail a novel approach to the reordering challenge in statistical machine translation (SMT).
This Ngram-based reordering (NbR) approach uses the powerful techniques of SMT systems to generate a weighted reordering graph. Thus, statistical criteria reordering constraints are supplied to an SMT system, and this allows an extension to the SMT decoding search.
The NbR approach is capable of generalizing reorderings that have been learned during training, through the use of word classes instead of words themselves.
Improvement in translation performance is demonstrated with the EPPS task (Spanish and German to English) and the BTEC task (Arabic to English).
Mejor artículo 2009 publicado en una revista internacional firmado en primer lugar por un joven investigador de una universidad española; otorgado por la Red Temática de Temática de Tecnologías del Habla
Fonollosa, José A. R.; Khalilov, M.; Costa-jussà, M. R.; Henríquez, C.; Hernández, A.; Banchs, R. E. Association for Computational Linguistics. European Chapter. Conference p. 85-89 Presentation's date: 2009-04-01 Presentation of work at congresses
This study presents the TALP-UPC submission
to the EACL Fourth Worskhop on Statistical Machine Translation 2009 evaluation campaign. It outlines the architecture and configuration of the 2009 phrase-based statistical machine translation (SMT) system, putting emphasis on the major novelty of this year: combination of SMT systems implementing different word reordering algorithms. Traditionally, we have concentrated on the Spanish-to-English and English-to-Spanish News Commentary translation tasks.
Lambert, P.; Gimenez, J.; Amigó, E.; Banchs, R.; Marquez, L.; Fonollosa, José A. R.; Costa-jussà, M. R. 2006 IEEE/ACL Workshop on Spoken Language Technology p. 246-249 Presentation of work at congresses
TC-STAR is envisioned as a long- term effort focused on advanced research in all core technologies for speech-to-speech translation (SST): speech recognition, speech translation and speech synthesis. The objectives of the project are extremely ambitious: making a breakthrough in SST research to significantly reduce the gap between human and machine performance. The focus will be on the development of new, possibly revolutionary, algorithms and methods, integrating relevant human knowledge which is available at translation time into a data-driven framework. TC-STAR is planned for a duration of six years.
The first three years will target a selection of unconstrained conversational speech domains -i.e. broadcast news and speeches - and a few languages: native and non- native European English, European Spanish and Chinese. The second three years, will target more complex unconstrained conversational speech domains and European languages. Key actions will be: (i) the implementation of an evaluation infrastructure, (ii) the creation of a technological infrastructure, and (iii) the support of knowledge dissemination.