Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 136 results
  • Access to the full text
    Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español  Open access

     Alegria, Iñaki; Aranberri, Nora; Fresno, Víctor; Gamallo, Pablo; Padró Cirera, Lluís; San Vicente Roncal, Iñaki; Turmo Borras, Jorge; Zubiaga, Arkaitz
    Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation's date: 2013-09-20
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.

    En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.

  • The TALP-UPC approach to Tweet-Norm 2013

     Ageno Pulido, Alicia; Comas Umbert, Pere Ramon; Padró Cirera, Lluís; Turmo Borras, Jorge
    Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation's date: 2013-09-20
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose dierent corrections for each out-of-vocabulary word. The final correction is chosen by weighted voting according to each module accuracy.

  • A Constraint-Based Hypergraph Partitioning Approach to Coreference Resolution

     Sapena Masip, Emilio; Padró Cirera, Lluís; Turmo Borras, Jorge
    Computational linguistics
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This work is focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that refer to the same entity. The main contributions of this article are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use an entity-mention classification model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classifications without context, and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results at the state-of-the-art level, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second place in CoNLL-2011.

  • Learning Finite-State Machines: Statistical and Algorithmic Aspects  Open access

     De Balle Pigem, Borja
    Defense's date: 2013-07-12
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The present thesis addresses several machine learning problems on generative and predictive models on sequential data. All the models considered have in common that they can be de ned in terms of nite-state machines. On one line of work we study algorithms for learning the probabilistic analog of Deterministic Finite Automata (DFA). This provides a fairly expressive generative model for sequences with very interesting algorithmic properties. State-merging algorithms for learning these models can be interpreted as a divisive clustering scheme where the "dependency graph" between clusters is not necessarily a tree. We characterize these algorithms in terms of statistical queries and a use this characterization for proving a lower bound with an explicit dependency on the distinguishability of the target machine. In a more realistic setting, we give an adaptive state-merging algorithm satisfying the stringent algorithmic constraints of the data streams computing paradigm. Our algorithms come with strict PAC learning guarantees. At the heart of state-merging algorithms lies a statistical test for distribution similarity. In the streaming version this is replaced with a bootstrap-based test which yields faster convergence in many situations. We also studied a wider class of models for which the state-merging paradigm also yield PAC learning algorithms. Applications of this method are given to continuous-time Markovian models and stochastic transducers on pairs of aligned sequences. The main tools used for obtaining these results include a variety of concentration inequalities and sketching algorithms. In another line of work we contribute to the rapidly growing body of spectral learning algorithms. The main virtues of this type of algorithms include the possibility of proving nite-sample error bounds in the realizable case and enormous savings on computing time over iterative methods like Expectation-Maximization. In this thesis we give the rst application of this method for learning conditional distributions over pairs of aligned sequences de ned by probabilistic nite-state transducers. We also prove that the method can learn the whole class of probabilistic automata, thus extending the class of models previously known to be learnable with this approach. In the last two chapters we present works combining spectral learning with methods from convex optimization and matrix completion. Respectively, these yield an alternative interpretation of spectral learning and an extension to cases with missing data. In the latter case we used a novel joint stability analysis of matrix completion and spectral learning to prove the rst generalization bound for this type of algorithms that holds in the non-realizable case. Work in this area has been motivated by connections between spectral learning, classic automata theory, and statistical learning; tools from these three areas have been used.

  • Access to the full text
    Highlighting relevant concepts from Topic Signatures  Open access

     Cuadros Oller, Montserrat; Padró Cirera, Lluís; Rigau Claramunt, German
    International Conference on Language Resources and Evaluation
    Presentation's date: 2012-05-24
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents deepKnowNet, a new fully automatic method for building highly dense and accurate knowledge bases from existing semantic resources. Basically, the method applies a knowledge-based Word Sense Disambiguation algorithm to assign the most appropriate WordNet sense to large sets of topically related words acquired from the web, named TSWEB. This Word Sense Disambiguation algorithm is the personalized PageRank algorithm implemented in UKB. This new method improves by automatic means the current content of WordNet by creating large volumes of new and accurate semantic relations between synsets. KnowNet was our first attempt towards the acquisition of large volumes of semantic relations. However, KnowNet had some limitations that have been overcomed with deepKnowNet. deepKnowNet disambiguates the first hundred words of all Topic Signatures from the web (TSWEB). In this case, the method highlights the most relevant word senses of each Topic Signature and filter out the ones that are not so related to the topic. In fact, the knowledge it contains outperforms any other resource when is empirically evaluated in a common framework based on a similarity task annotated with human judgements

  • Access to the full text
    FreeLing 3.0: Towards Wider Multilinguality  Open access

     Padró Cirera, Lluís; Stanilovsky, Evgeny
    International Conference on Language Resources and Evaluation
    Presentation's date: 2012-05-24
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    FreeLing is an open-source multilingual language processing library pr oviding a wide range of analyzers for several languages. It offers text processing and language annotation facilities to NLP application developers, lowering the cost of building those applications. FreeLing is customizable, extensible, and has a strong orientation to real-world applications in terms of speed and robustness. Developers can use the default linguistic resources (dictionaries, lexico ns, grammars, etc.), extend/adapt them to specific domains, or ¿since the library is open source¿ develop new ones for specific langua ges or special application needs. This paper describes the general architecture of the library, presents the major changes and improvements included in FreeLing version 3.0, and summarizes some relevant industrial projects in which it has been used.

    FreeLing is an open-source multilingual language processing library providing a wide range of analyzers for several languages. It offers text processing and language annotation facilities to NLP application developers, lowering the cost of building those applications. FreeLing is customizable, extensible, and has a strong orientation to real-world applications in terms of speed and robustness. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.), extend/adapt them to specific domains, or –since the library is open source– develop new ones for specific languages or special application needs. This paper describes the general architecture of the library, presents the major changes and improvements included in FreeLing version 3.0, and summarizes some relevant industrial projects in which it has been used.

  • Factoid Question Answering for Spoken Documents  Open access

     Comas Umbert, Pere Ramon
    Defense's date: 2012-06-12
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic. The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

    En aquesta Tesi, presentem un sistema de Question Answering (QA) factual, especialment ajustat per treballar amb documents orals. En el desenvolupament explorem, per primera vegada, quines tècniques de les habitualment emprades en QA per documents escrit són suficientment robustes per funcionar en l'escenari més difícil de documents orals. Amb més especificitat, estudiem nous mètodes de Information Retrieval (IR) dissenyats per tractar amb la veu, i utilitzem diversos nivells d'informació linqüística. Entre aquests s'inclouen, a saber: detecció de Named Entities utilitzant informació fonètica, "parsing" sintàctic aplicat a transcripcions de veu, i també l'ús d'un sub-sistema de detecció i resolució de la correferència. La nostra aproximació al problema es recolza en gran part en tècniques supervisades de Machine Learning, estant aquestes enfocades especialment cap a la part d'extracció de la resposta, i fa servir la menor quantitat possible de coneixement creat per humans. En conseqüència, tot el procés de QA pot ser adaptat a altres dominis o altres llengües amb relativa facilitat. Un dels resultats addicionals de la feina darrere d'aquesta Tesis ha estat que hem impulsat i coordinat la creació d'un marc d'avaluació de la taska de QA en documents orals. Aquest marc de treball, anomenat QAst (Question Answering on Speech Transcripts), proporciona un corpus de documents orals multi-lingüe, uns conjunts de preguntes d'avaluació, i les respostes correctes d'aquestes. Aquestes dades han estat utilitzades en les evaluacionis QAst que han tingut lloc en el si de les conferències CLEF en els anys 2007, 2008 i 2009; d'aquesta manera s'ha promogut i ajudat a la creació d'un estat-de-l'art de tècniques adreçades a aquest problema en particular. El sistema de QA que presentem i tots els seus particulars sumbòduls, han estat avaluats extensivament utilitzant el corpus EPPS (transcripcions de les Sessions Plenaries del Parlament Europeu) en anglès, que cónté transcripcions manuals de tots els discursos i també transcripcions automàtiques obtingudes mitjançant tres reconeixedors automàtics de la parla (ASR) diferents. Els reconeixedors tenen característiques i resultats diferents que permetes una avaluació quantitativa i qualitativa de la tasca. Aquestes dades pertanyen a l'avaluació QAst del 2009. Els resultats principals de la nostra feina confirmen que la informació sintàctica és mol útil per aprendre automàticament a valorar la plausibilitat de les respostes candidates, millorant els resultats previs tan en transcripcions manuals com transcripcions automàtiques, descomptat que la qualitat de l'ASR sigui molt baixa. En general, el rendiment del nostre sistema és comparable o millor que els altres sistemes pertanyents a l'estat-del'art, confirmant així la validesa de la nostra aproximació.

  • A constraint-based hypergraph partitioning approach to coreference resolution  Open access

     Sapena Masip, Emili
    Defense's date: 2012-05-16
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The objectives of this thesis are focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that mention or refer to the same entity. The main contributions of this thesis are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use entity-mention classi cation model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classi cations without context and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and a research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results in the state of the art, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second position in CoNLL-2011.

    La resolució de correferències és una tasca de processament del llenguatge natural que consisteix en determinar les expressions d'un discurs que es refereixen a la mateixa entitat del mon real. La tasca té un efecte directe en la minería de textos així com en moltes tasques de llenguatge natural que requereixin interpretació del discurs com resumidors, responedors de preguntes o traducció automàtica. Resoldre les correferències és essencial si es vol poder “entendre” un text o un discurs. Els objectius d'aquesta tesi es centren en la recerca en resolució de correferències amb aprenentatge automàtic. Concretament, els objectius de la recerca es centren en els següents camps: + Models de classificació: Els models de classificació més comuns a l'estat de l'art estan basats en la classificació independent de parelles de mencions. Més recentment han aparegut models que classifiquen grups de mencions. Un dels objectius de la tesi és incorporar el model entity-mention a l'aproximació desenvolupada. + Representació del problema: Encara no hi ha una representació definitiva del problema. En aquesta tesi es presenta una representació en hypergraf. + Algorismes de resolució. Depenent de la representació del problema i del model de classificació, els algorismes de ressolució poden ser molt diversos. Un dels objectius d'aquesta tesi és trobar un algorisme de resolució capaç d'utilitzar els models de classificació en la representació d'hypergraf. + Representació del coneixement: Per poder administrar coneixement de diverses fonts, cal una representació simbòlica i expressiva d'aquest coneixement. En aquesta tesi es proposa l'ús de restriccions. + Incorporació de coneixement del mon: Algunes correferències no es poden resoldre només amb informació lingüística. Sovint cal sentit comú i coneixement del mon per poder resoldre coreferències. En aquesta tesi es proposa un mètode per extreure coneixement del mon de Wikipedia i incorporar-lo al sistem de resolució. Les contribucions principals d'aquesta tesi son (i) una nova aproximació al problema de resolució de correferències basada en satisfacció de restriccions, fent servir un hypergraf per representar el problema, i resolent-ho amb l'algorisme relaxation labeling; i (ii) una recerca per millorar els resultats afegint informació del mon extreta de la Wikipedia. L'aproximació presentada pot fer servir els models mention-pair i entity-mention de forma combinada evitant així els problemes que es troben moltes altres aproximacions de l'estat de l'art com per exemple: contradiccions de classificacions independents, falta de context i falta d'informació. A més a més, l'aproximació presentada permet incorporar informació afegint restriccions i s'ha fet recerca per aconseguir afegir informació del mon que millori els resultats. RelaxCor, el sistema que ha estat implementat durant la tesi per experimentar amb l'aproximació proposada, ha aconseguit uns resultats comparables als millors que hi ha a l'estat de l'art. S'ha participat a les competicions internacionals SemEval-2010 i CoNLL-2011. RelaxCor va obtenir la segona posició al CoNLL-2010.

  • Access to the full text
    A hybrid approach to treebank construction  Open access

     Marimon, Montserrat; Padró Cirera, Lluís
    Procesamiento del lenguaje natural
    Date of publication: 2012-09
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Este artículo describe investigación sobre los efectos de la desambiguación morfosintáctica usada como un preproceso de un analizador sintáctico profundo basado en HPSG, en el contexto del desarrollo de un treebank del español de código abierto, en el entorno de DELPH-IN. La anotación treebank se realiza manualmente tomando las decisiones apropiadas entre las opciones propuestas por el sistema y ordenadas por un módulo estadístico. Los experimentos presentados muestran que el uso de un etiquetador reduce la ambigüedad de las frases, y contribuye a limitar la cantidad de frases cuyo análisis sobrepasa el límite de tiempo, y ayuda a al módulo estadístico a clasificar el árbol correcto entre los mejores. Por un lado, nuestros resultados validan los beneficios ya reportados en la literatura de tal preproceso de análisis profundo con respecto a la velocidad, cobertura y precisión. Por otro lado, proponemos una estrategia basada en existentes herramientas de código abierto y recursos para desarrollar con alta consitencia treebanks de sintaxis profunda para idiomas con limitada disponibilidad de recursos lingüísticos.

  • RelaxCor participation in CoNLL shared task on coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Conference on Computational Natural Language Learning
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the participation of RelaxCor in the CoNLL-2011 shared task: "Modeling Unrestricted Coreference in Ontonotes". RELAXCOR is a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method.

  • Access to the full text
    Analizadores Multilingües en FreeLing  Open access

     Padró Cirera, Lluís
    Linguamática
    Date of publication: 2011-12
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    FreeLing es una librería de código abierto para el procesamiento multilíngüe automático, que proporciona una amplia gama de servicios de análisis lingüístico para diversos idiomas. FreeLing ofrece a los desarrolladores de aplicaciones de Procesamiento del Lenguaje Natural funciones de análisis y anotación lingüística de textos, con la consiguiente reducción del coste de construcción de dichas aplicaciones. FreeLing es personalizable y ampliable, y está fuertemente orientado a aplicaciones del mundo real en términos de velocidad y robustez. Los desarrolladores pueden utilizar los recursos lingüísticos por defecto (diccionarios, lexicones, gramáticas, etc), ampliarlos, adaptarlos a dominios particulares, o –dado que la librería es de código abierto– desarrollar otros nuevos para idiomas específicos o necesidades especiales de las aplicaciones. Este artículo presenta los principales cambios y mejoras incluidos en la versión 3.0 de FreeLing, y resume algunos proyectos industriales relevantes en los que se ha utilizado

  • Multilingual Acquisition of Large Scale Knowledge Resources

     Cuadros Oller, Montserrat
    Defense's date: 2011-11-22
    Department of Software, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • A global relaxation labeling approach to coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    International Conference on Computational Linguistics
    Presentation's date: 2010-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method. Experiments show that our approach significantly outperforms systems based on separate classification and chain formation steps, and that it achieves the best results in the state of the art for the same dataset and metrics.

  • Access to the full text
    Semantic Services in FreeLing 2.1: WordNet and UKB  Open access

     Padró Cirera, Lluís; Reese, Samuel; Agirre, Eneko; Soroa, Aitor
    International WordNet Conference
    Presentation's date: 2010-02
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    FreeLing is an open-source open-source multilingual language processing library providing a wide range of language analyzers for several languages. It offers text processing and language annotation facilities to natural language processing application developers, simplifying the task of building those applications. FreeLing is customizable and extensible. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.) directly, or extend them, adapt them to specific domains, or even develop new ones for specific languages. This paper presents the semantic services included in FreeLing, which are based on WordNet and EuroWordNet databases. The recent release of the UKB program under a GPL license made it possible to integrate a long awaited word sense disambiguation module into FreeLing. UKB provides state of the art all-words sense disambiguation for any language with an available WordNet.

  • Access to the full text
    Word-sense disambiguated multilingual Wikipedia corpus  Open access

     Reese, Samuel; Boleda Torrent, Gemma; Cuadros Oller, Montserrat; Padró Cirera, Lluís; Rigau Claramunt, German
    International Conference on Language Resources and Evaluation
    Presentation's date: 2010-05
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This article presents a new freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. To our knowledge, this is the largest such corpus that is freely available to the community: In its present version, it contains over 750 million words. The corpora have been annotated with lemma and part of speech information using the open source library FreeLing. Also, they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. As UKB assigns WordNet senses, and WordNet has been aligned across languages via the InterLingual Index, this sort of annotation opens the way to massive explorations in lexical semantics that were not possible before. We present a first attempt at creating a trilingual lexical resource from the sense-tagged Wikipedia corpora, namely, WikiNet. Moreover, we present two by-products of the project that are of use for the NLP community: An open source Java-based parser for Wikipedia pages developed for the construction of the corpus, and the integration of the WSD algorithm UKB in FreeLing.

    This article presents a new freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. To our knowledge, this is the largest such corpus that is freely available to the community: In its present version, it contains over 750 million words. The corpora have been annotated with lemma and part of speech information using the open source library FreeLing. Also, they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. As UKB assignsWordNet senses, andWordNet has been aligned across languages via the InterLingual Index, this sort of annotation opens the way to massive explorations in lexical semantics that were not possible before. We present a first attempt at creating a trilingual lexical resource from the sense-tagged Wikipedia corpora, namely, WikiNet. Moreover, we present two by-products of the project that are of use for the NLP community: An open source Java-based parser for Wikipedia pages developed for the construction of the corpus, and the integration of the WSD algorithm UKB in FreeLing.

  • Language technology challenges of a small language (Catalan)

     Melero, Maite; Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Padró Cirera, Lluís; Quixal, Martí; Rodríguez, Carlos; Saurí, Roser
    International Conference on Language Resources and Evaluation
    Presentation's date: 2010-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present a brief snapshot of the state of affairs in computational processing of Catalan and the initiatives that are starting to take place in an effort to bring the field a step forward, by making a better and more efficient use of the already existing resources and tools, by bridging the gap between research and market, and by establishing periodical meeting points for the community. In particular, we present the results of the First Workshop on the Computational Processing of Catalan, which succeeded in putting together a fair representation of the research in the area, and received attention from both the industry and the administration. Aside from facilitating communication among researchers and between developers and users, the Workshop provided the organizers with valuable information about existing resources, tools, developers and providers. This information has allowed us to go a step further by setting up a “harvesting” procedure which will hopefully build the seed of a portal-catalogue-observatory of language resources and technologies in Catalan

  • Access to the full text
    FreeLing 2.1: Five Years of Open-Source Language Processing Tools  Open access

     Padró Cirera, Lluís; Collado, Miquel; Reese, Samuel; Lloberes Salvatella, Marina; Castellón Masalles, Irene
    International Conference on Language Resources and Evaluation
    Presentation's date: 2010-05
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    FreeLing is an open-source multilingual language processing library providing a wide range of language analyzers for several languages. It offers text processing and language annotation facilities to natural language processing application developers, simplifying the task of building those applications. FreeLing is customizable and extensible. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.) directly, or extend them, adapt them to specific domains, or even develop new ones for specific languages. This paper overviews the recent history of this tool, summarizes the improvements and extensions incorporated in the latest version, and depicts the architecture of the library. Special focus is brought to the fact and consequences of the library being open-source: After five years and over 35,000 downloads, a growing user community has extended the initial three languages (English, Spanish and Catalan) to eight (adding Galician, Italian, Welsh, Portuguese, and Asturian), proving that the collaborative open model is a productive approach for the development of NLP tools and resources.

  • RelaxCor : a global relaxation labeling approach to coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    International Workshop on Semantic Evaluations
    Presentation's date: 2010
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the participation of RelaxCor in the Semeval-2010 task number 1: "Coreference Resolution in Multiple Languages". RelaxCor is a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method.

  • Multilingual On-Line Translation

     Rodriguez Hontoria, Horacio; Gonzalez Bermudez, Meritxell; España Bonet, Cristina; Farwell, David Loring; Carreras Perez, Xavier; Xambó Descamps, Sebastian; Màrquez Villodre, Lluís; Padró Cirera, Lluís; Saludes Closa, Jordi
    Participation in a competitive project

     Share

  • Access to the full text
    FreeLing: From a multilingual open-source analyzer suite to an EBMT platform.  Open access

     Padró Cirera, Lluís; Farwell, David Loring
    Workshop on Example-Based Machine Translation
    Presentation's date: 2009-11
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    FreeLing is an open-source library providing a wide range of language analysis utilities for several different languages. It is intended to provide NLP application developers with any text processing and language annotation tools they may need in order to simplify their development task. Moreover, FreeLing is customizable and extensible. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.), or extend them, adapt to particular domains, or even develop new resources for specific languages. Being open-source has enabled FreeLing to grow far beyond its original capabilities, especially with regard to linguistic data: contributions from its community of users, for instance, include morphological dictionaries and PoS tagger training data for Galician, Italian, Portuguese, Asturian, and Welsh. In this paper we present the basic architecture and the main services in FreeLing, and we outline how developers might use it to build competitive NLP systems and indicate how it might be extended to support the development of Example-Based Machine Translation systems.

  • Sobre la I Jornada del Processament Computacional del Català

     Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Melero, Maite; Padró Cirera, Lluís; Quixal, Martí; Rodríguez, Carlos
    Llengua i ús
    Date of publication: 2009
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    El processament computacional de la llengua abraça qualsevol activitat relacionada amb la creació, gestió i utilització de tecnologia i recursos lingüístics. En el pla científic, aquesta activitat és central en disciplines com ara la lingüística de corpus, l'enginyeria lingüística, o el processament del llenguatge natural escrit o parlat. En el pla quotidià, el processament s'inclou en un ampli ventall d'aplicacions cada cop més habituals: sistemes automàtics d'atenció telefònica, traducció automàtica, etc. La gran majoria d'aquestes aplicacions requereixen eines i recursos lingüístics específics per a cada llengua. Per a llengües amb un mercat ampli, com l'anglès o el castellà, l'oferta de productes i serveis basats en tecnologia lingüística és variada i habitual. Per al cas de llengües com el català, és més difícil trobar productes i serveis que s'ofereixin ja "de fàbrica" amb aquesta tecnologia. Per tal de reflectir l'estat actual de les tecnologies de la llengua aplicades al català, de posar en contacte els membres d'aquesta comunitat, i d'impulsar iniciatives que les potenciïn, el març del 2009 es va celebrar al Palau Robert de Barcelona la primera Jornada del Processament Computacional del Català (JPCC). La Jornada tenia l'objectiu d'esdevenir un punt de trobada i alhora un aparador per als grups de recerca de l'àrea, i encetar el debat sobre com articular la comunitat per tal de potenciar l'ús i el desenvolupament del català tant en la tecnologia lingüística com en els productes i serveis que en depenen. Aquest article presenta un resum del contingut i les conclusions de la Jornada.

  • Primera Jornada del Processament Computacional del Català

     Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Melero, Maite; Padró Cirera, Lluís; Quixal, Martí; Rodríguez, Carlos
    Procesamiento del lenguaje natural
    Date of publication: 2009-09
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Presentamos las conclusiones de la primera Jornada del Processament Computacional del Català, celebrado en Barcelona en marzo del 2009. We present the conclusions of the first Jornada del Processament Computacional del Català, held in Barcelona on March 2009

  • A Graph Partitioning Approach to Coreference Resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2009-01
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    KNOW: Developing large-scale multilingual technologies for language understanding  Open access

     Agirre, Eneko; Rigau Claramunt, German; Castellón Masalles, Irene; Alonso, Laura; Padró Cirera, Lluís; Cuadros Oller, Montserrat; Climent, Salvador; Coll-Florit, Marta
    Procesamiento del lenguaje natural
    Date of publication: 2009-09
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El proyecto KNOW pretende añadir significado, conocimiento y razonamiento a las tecnologías actuales de Procesamiento del Lenguaje Natural.

  • El català i les tecnologies de la llengua

     Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Melero, Maite; Padró Cirera, Lluís; Quixal, Martí; Rodríguez, Carlos
    Llengua, Societat i Comunicació
    Date of publication: 2009
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    El processament computacional de la llengua abraça qualsevol activitat relacionada amb la creació, gestió i utilització de tecnologia i recursos lingüístics. En el pla científic, aquesta activitat és central en disciplines com ara la lingüística de corpus, l'enginyeria lingüística, o el processament del llenguatge natural escrit o parlat. En el pla quotidià, s'inclou en un ampli ventall d'aplicacions cada cop més habituals: sistemes automàtics d'atenció telefònica, traducció automàtica, etc. La gran majoria d'aquestes aplicacions requereixen eines i recursos lingüístics específics per a cada llengua. Per a llengües amb un mercat ampli, com l'anglès o el castellà, l'oferta de productes i serveis basats en tecnologia lingüística és variada i habitual. Per al cas de llengües com el català, és més difícil trobar productes i serveis que s'ofereixin ja “de fàbrica” amb aquesta tecnologia. Aquest article presenta una panoràmica de l'estat actual de les tecnologies de la llengua per al català, així com diversos aspectes que avui dia es debaten en el si de la comunitat científica dedicada al processament del llenguatge natural parlat i escrit.

  • Sistema de recomendación para un uso inclusivo del lenguaje

     Fuentes Fort, Maria; Padró Cirera, Lluís; Padró Cirera, Muntsa; Turmo Borras, Jorge; Carrera, Jordi
    Procesamiento del lenguaje natural
    Date of publication: 2009-03
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Sistema que procesa un texto escrito en castellano detectando usos del lenguaje no inclusivos. Para cada sintagma nominal sospechoso el sistema propone una serie de alternativas. El sistema permite también la adquisición automática de ejemplos positivos a partir de documentos que hagan un uso inclusivo del lenguaje. Estos ejemplos serán usados, junto a su contexto, en la presentación de sugerencias. Abstract: System to detect exclusive language in spanish documents. For each noun phrase detected as exclusive, several alternative are suggested by the system. Moreover, the system allows the automatic adquisition of positive examples from inclusive documents to be presented within their context as alternatives.

  • A graph partitioning approach to entity disambiguation using uncertain information

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Lecture notes in artificial intelligence
    Date of publication: 2008-10
    Journal article

     Share Reference managers Reference managers Open in new window

  • Dependency Grammars in FreeLing

     Carrera, Jordi; Castellón Masalles, Irene; Lloberes Salvatella, Marina; Padró Cirera, Lluís; Tinkova Tincheva, Nevena
    Procesamiento del lenguaje natural
    Date of publication: 2008-09
    Journal article

     Share Reference managers Reference managers Open in new window

  • APPLYING CAUSAL-STATE SPLITTING RECONSTRUCTION ALGORITHM TO NATURAL LANGUAGE PROCESSING TASKS

     Padro Cirera, Montserrat
    Defense's date: 2008-07-18
    Department of Software, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • EuroOpenTrad: Traducción Automática de Código Abierto para la integración europea de las Lenguas del Estado Español

     Pichel ., Jose Ramon; Padró Cirera, Lluís
    Participation in a competitive project

     Share

  • A Question-driven Information System adaptable via Information Extraction techniques

     Sapena Masip, Emili; González, Manuel; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-06
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Coreference Resolution Survey

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-12
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Recomendador t-incluye para un uso inclusivo del lenguaje

     Fuentes Gomez, Melchor; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-11
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A graph partitioning approach to entity disambiguation using uncertain information

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    6th International Conference Advances in Natural Language Processing
    Presentation's date: 2008
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a method for Entity Disambiguation in Information Extraction from different sources in the web. Once entities and relations between them are extracted, it is needed to determine which ones are referring to the same real-world entity. We model the problem as a graph partitioning problem in order to combine the available information more accurately than a pairwise classifier. Moreover, our method handle uncertain information which turns out to be quite helpful. Two algorithms are trained and compared, one probabilistic and the other deterministic. Both are tuned using genetic algorithms to find the best weights for the set of constraints. Experiments show that graph-based modeling yields better results using uncertain information.

  • Studying CSSR Algorithm Applicability on NLP Tasks

     Padro Cirera, Montserrat; Padró Cirera, Lluís
    Procesamiento del lenguaje natural
    Date of publication: 2007-09
    Journal article

     Share Reference managers Reference managers Open in new window

  • Alias Assignment in Information Extraction

     Sapena, E; Padró Cirera, Lluís; Turmo Borras, Jorge
    Procesamiento del lenguaje natural
    Date of publication: 2007-09
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Alias assignment in information extraction

     Padró Cirera, Lluís; Turmo Borras, Jorge; Sapena Masip, Emili
    XXIII Congreso Anual de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a general method for alias assignment task in information extraction. We compared two approaches to face the problem and learn a classifier. The first one quantifies a global similarity between the alias and all the possible entities weighting some features about each pair alias-entity. The second is a classical classifier where each instance is a pair alias-entity and its attributes are their features. Both approaches use the same feature functions about the pair alias-entity where every level of abstraction, from raw characters up to semantic level, is treated in an homogeneous way. In addition, we propose an extended feature functions that break down the information and let the machine learning algorithm to determine the final contribution of each value. The use of extended features improve the results of the simple ones. ---------------------------------------Este artículo presenta un método general para la tarea de asignación de alias en extracción de información. Se comparan dos aproximaciones para encarar el problema y aprender un clasificador. La primera cuantifica una similaridad global entre el alias y todas las posibles entidades asignando pesos a las características sobre cada pareja alias-entidad. La segunda es el clásico clasificador donde cada instancia es una pareja alias-entidad y sus atributos son las características de ésta. Ambas aproximaciones usan las mismas funciones de características sobre la pareja alias-entidad donde cada nivel de abstracción, desde los carácteres hasta el nivel semántico, se tratan de forma homogénea. Ademés, se proponen unas funciones extendidas de características que desglosan la información y permiten al algoritmo de aprendizaje automático determinar la contribución final de cada valor. El uso de funciones extendidas mejora los resultados de las funciones simples.

  • EuroOpenTrad: Traducción Automática de Código Abierto para la integración europea de las Lenguas del Estado Español

     Padró Cirera, Lluís; Pichel ., Jose Ramon
    Participation in a competitive project

     Share

  • An Automata Based Approach to Biomedical Named Entity Identification

     Dowdall, J; Keller, B; Padro Cirera, Montserrat; Padró Cirera, Lluís
    Annual Meeting of the ISMB BioLINK Special Interest Goup on Text Data Mining
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • ME-CSSR: an Extension of CSSR using Maximum Entropy Models

     Padro Cirera, Montserrat; Padró Cirera, Lluís
    2007 Conference on Finite-State Methods for NLP
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Studyng CSSR Algorithm Applicability on NLP Tasks

     Padro Cirera, Montserrat; Padró Cirera, Lluís
    XXIII Congreso Anual de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • UPC: Experiments with Joint Learning within SemEval TAsk 9

     Màrquez Villodre, Lluís; Padró Cirera, Lluís; Surdeanu, Mihai; Villarejo Muñoz, Luis
    4th International Workshop on Semantic Evaluations
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Towards Robustness in Natural Language Understanding

     Atserias Batalla, Jordi
    Defense's date: 2006-10-09
    UPV/EHU
    Theses

     Share Reference managers Reference managers Open in new window

  • FreeLing 1.3: Syntactic and Semantic Services in an Open-Source NLP Library

     Atserias Batalla, Jordi; Casas Fernandez, Bernardino; Comelles Pujadas, Elisabet; Gonzalez Bermudez, Meritxell; Padró Cirera, Lluís; Padro Cirera, Montserrat
    International Conference on Language Resources and Evaluation
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • KNOW: Desarrollo de Tecnologías Multilíngües a Gran Escala para la Comprensión del Lenguaje. Análisis Semántico

     Padró Cirera, Lluís; Martin Escofet, Carme
    Participation in a competitive project

     Share

  • Developing large-scale multilingual technologies for language understanding

     Padró Cirera, Lluís; Boleda Torrent, Gemma
    Participation in a competitive project

     Share

  • An Empirical Study for Automatic Acquisition of Topic Signatures

     Cuadros Oller, Montserrat; Padró Cirera, Lluís; Rigau Claramunt, German
    Third International WorkNet Conference
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Representing Discourse for Automatic Text Summarization

     Alonso, Laura
    Defense's date: 2005-03-01
    UB
    Theses

     Share Reference managers Reference managers Open in new window

  • A Named Entity Recognition System Based on a Finite Automata Acquisition Algorithm

     Padro Cirera, Montserrat; Padró Cirera, Lluís
    21st Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window