Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 110 results
  • Access to the full text
    TweetNorm_es: an annotated corpus for Spanish microtext normalization  Open access

     Alegria, Iñaki; Aranberri, Nora; Comas Umbert, Pere Ramon; Fresno, Víctor; Gamallo, Pablo; Padró Cirera, Lluís; San Vicente Roncal, Iñaki; Turmo Borras, Jorge; Zubiaga, Arkaitz
    International Conference on Language Resources and Evaluation
    Presentation's date: 2014-05-29
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we introduce TweetNorm es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

    In this paper we introduce TweetNorm es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

  • Access to the full text
    A Constraint-Based Hypergraph Partitioning Approach to Coreference Resolution  Open access

     Sapena Masip, Emilio; Padró Cirera, Lluís; Turmo Borras, Jorge
    COMPUTATIONAL LINGUISTICS
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This work is focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that refer to the same entity. The main contributions of this article are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use an entity-mention classification model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classifications without context, and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results at the state-of-the-art level, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second place in CoNLL-2011.

    This work is focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that refer to the same entity. The main contributions of this article are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use an entity-mention classification model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classifications without context, and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results at the state-of-the-art level, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second place in CoNLL-2011.

  • Access to the full text
    The TALP-UPC approach to Tweet-Norm 2013  Open access

     Ageno Pulido, Alicia; Comas Umbert, Pere Ramon; Padró Cirera, Lluís; Turmo Borras, Jorge
    Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation's date: 2013-09-20
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose dierent corrections for each out-of-vocabulary word. The final correction is chosen by weighted voting according to each module accuracy.

    This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose di erent corrections for each out-of-vocabulary word. The fi nal correction is chosen by weighted voting according to each module accuracy.

    Postprint (author’s final draft)

  • Access to the full text
    UPC-CORE : What can machine translation evaluation metrics and Wikipedia do for estimating semantic textual similarity?  Open access

     Barron Cedeño, Luis Alberto; Màrquez Villodre, Lluís; Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Joint Conference on Lexical and Computational Semantics
    Presentation's date: 2013-06-13
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we discuss our participation to the 2013 Semeval Semantic Textual Similarity task. Our core features include (i) a set of metrics borrowed from automatic machine translation, originally intended to evaluate automatic against reference translations and (ii) an instance of explicit semantic analysis, built upon opening paragraphs of Wikipedia 2010 articles. Our similarity estimator relies on a support vector regressor with RBF kernel. Our best approach required 13 machine translation metrics + explicit semantic analysis and ranked 65 in the competition. Our postcompetition analysis shows that the features have a good expression level, but overfitting and ¿mainly¿ normalization issues caused our correlation values to decrease.

    In this paper we discuss our participation to the 2013 Semeval Semantic Textual Similarity task. Our core features include (i) a set of metrics borrowed from automatic machine translation, originally intended to evaluate automatic against reference translations and (ii) an instance of explicit semantic analysis, built upon opening paragraphs of Wikipedia 2010 articles. Our similarity estimator relies on a support vector regressor with RBF kernel. Our best approach required 13 machine translation metrics + explicit semantic analysis and ranked 65 in the competition. Our postcompetition analysis shows that the features have a good expression level, but overfitting and —mainly— normalization issues caused our correlation values to decrease.

  • Access to the full text
    The TALP participation at TAC-KBP 2013  Open access

     Ageno Pulido, Alicia; Comas Umbert, Pere Ramon; Mehdizadeh Naderi, Ali; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Text Analysis Conference
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its second participation at TAC-KBP 2013 in both the Entity Linking and the Slot Filling tasks.

    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its second participation at TAC-KBP 2013 in both the Entity Linking and the Slot Filling tasks.

  • Access to the full text
    Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español  Open access

     Alegria, Iñaki; Aranberri, Nora; Fresno, Víctor; Gamallo, Pablo; Padró Cirera, Lluís; San Vicente Roncal, Iñaki; Turmo Borras, Jorge; Zubiaga, Arkaitz
    Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation's date: 2013-09-20
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.

    En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.

  • TIN2012-38584-C06-01 - Adquisición de escenarios de conocimiento a través de la lectura de textos: inferencia de relaciones entre eventos (SKATeR)

     Rodriguez Hontoria, Horacio; Abad Soriano, Maria Teresa; Ageno Pulido, Alicia; Catala Roig, Neus; Comas Umbert, Pere Ramon; Farreres De La Morena, Javier; Fuentes Fort, Maria; Gatius Vila, Marta; Mehdizadeh Naderi, Ali; Padró Cirera, Lluís; Turmo Borras, Jorge
    Participation in a competitive project

     Share

  • Sibyl, a factoid question answering system for spoken documents

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge; Màrquez Villodre, Lluís
    ACM transactions on information systems
    Date of publication: 2012
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, comparing manual with automatic transcripts obtained by three different automatic speech recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts, unless the ASR quality is very low. At the same time, our experiments on coreference resolution reveal that the state-of-the-art technology is not mature enough to be effectively exploited for QA with spoken documents. Overall, the performance of Sibyl is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

  • Access to the full text
    The TALP participation at TAC-KBP 2012  Open access

     González Pellicer, Edgar; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Mehdizadeh Naderi, Ali; Ageno Pulido, Alicia; Sapena Masip, Emili; Vila Rigat, Marta; Martí, Maria Antònia
    Text Analysis Conference
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its first participation at TAC-KBP 2012 in both the Entity Linking and the Slot Filling tasks.

    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its first participation at TAC-KBP 2012 in both the Entity Linking and the Slot Filling tasks.

    Postprint (author’s final draft)

  • Access to the full text
    Summarizing a multimodal set of documents in a smart room  Open access

     Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    International Conference on Language Resources and Evaluation
    Presentation's date: 2012-05-23
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This article reports an intrinsic automatic summarization evaluation in the scientific lecture domain. The lecture takes place in a Smart Room that has access to different types of documents produced from different media. An evaluation framework is presented to analyze the performance of systems producing summaries answering a user need. Several ROUGE metrics are used and a manual content responsiveness evaluation was carried out in order to analyze the performance of the evaluated approaches. Various multilingual summarization approaches are analyzed showing that the use of different types of documents outperforms the use of transcripts. In fact, not using any part of the spontaneous speech transcription in the summary improves the performance of automatic summaries. Moreover, the use of semantic information represented in the different textual documents coming from different media helps to improve summary quality.

    Postprint (author’s final draft)

  • Access to the full text
    Unsupervised ensemble minority clustering  Open access

     González Pellicer, Edgar; Turmo Borras, Jorge
    Date: 2012-03
    Report

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Cluster a alysis lies at the core of most unsupervised learning tasks. However, the majority of clustering algorithms depend on the all-in assumption, in which all objects belong to some cluster, and perform poorly on minority clustering tasks, in which a small fraction of signal data stands against a majority of noise. The approaches proposed so far for minority clustering are supervised: they require the number and distribution of the foreground and background clusters. In supervised learning and all-in clustering, combination methods have been successfully applied to obtain distribution-free learners, even from the output of weak individual algorithms. In this report, we present a novel ensemble minority clustering algorithm, Ewocs, suitable for weak clustering combination, and provide a theoretical proof of its properties under a loose set of constraints. The validity of the assumptions used in the proof is empirically assessed using a collection of synthetic datasets.

  • Factoid Question Answering for Spoken Documents  Open access

     Comas Umbert, Pere Ramon
    Defense's date: 2012-06-12
    Department of Computer Science, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic. The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

    En aquesta Tesi, presentem un sistema de Question Answering (QA) factual, especialment ajustat per treballar amb documents orals. En el desenvolupament explorem, per primera vegada, quines tècniques de les habitualment emprades en QA per documents escrit són suficientment robustes per funcionar en l'escenari més difícil de documents orals. Amb més especificitat, estudiem nous mètodes de Information Retrieval (IR) dissenyats per tractar amb la veu, i utilitzem diversos nivells d'informació linqüística. Entre aquests s'inclouen, a saber: detecció de Named Entities utilitzant informació fonètica, "parsing" sintàctic aplicat a transcripcions de veu, i també l'ús d'un sub-sistema de detecció i resolució de la correferència. La nostra aproximació al problema es recolza en gran part en tècniques supervisades de Machine Learning, estant aquestes enfocades especialment cap a la part d'extracció de la resposta, i fa servir la menor quantitat possible de coneixement creat per humans. En conseqüència, tot el procés de QA pot ser adaptat a altres dominis o altres llengües amb relativa facilitat. Un dels resultats addicionals de la feina darrere d'aquesta Tesis ha estat que hem impulsat i coordinat la creació d'un marc d'avaluació de la taska de QA en documents orals. Aquest marc de treball, anomenat QAst (Question Answering on Speech Transcripts), proporciona un corpus de documents orals multi-lingüe, uns conjunts de preguntes d'avaluació, i les respostes correctes d'aquestes. Aquestes dades han estat utilitzades en les evaluacionis QAst que han tingut lloc en el si de les conferències CLEF en els anys 2007, 2008 i 2009; d'aquesta manera s'ha promogut i ajudat a la creació d'un estat-de-l'art de tècniques adreçades a aquest problema en particular. El sistema de QA que presentem i tots els seus particulars sumbòduls, han estat avaluats extensivament utilitzant el corpus EPPS (transcripcions de les Sessions Plenaries del Parlament Europeu) en anglès, que cónté transcripcions manuals de tots els discursos i també transcripcions automàtiques obtingudes mitjançant tres reconeixedors automàtics de la parla (ASR) diferents. Els reconeixedors tenen característiques i resultats diferents que permetes una avaluació quantitativa i qualitativa de la tasca. Aquestes dades pertanyen a l'avaluació QAst del 2009. Els resultats principals de la nostra feina confirmen que la informació sintàctica és mol útil per aprendre automàticament a valorar la plausibilitat de les respostes candidates, millorant els resultats previs tan en transcripcions manuals com transcripcions automàtiques, descomptat que la qualitat de l'ASR sigui molt baixa. En general, el rendiment del nostre sistema és comparable o millor que els altres sistemes pertanyents a l'estat-del'art, confirmant així la validesa de la nostra aproximació.

  • A constraint-based hypergraph partitioning approach to coreference resolution  Open access

     Sapena Masip, Emili
    Defense's date: 2012-05-16
    Department of Computer Science, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The objectives of this thesis are focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that mention or refer to the same entity. The main contributions of this thesis are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use entity-mention classi cation model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classi cations without context and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and a research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results in the state of the art, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second position in CoNLL-2011.

    La resolució de correferències és una tasca de processament del llenguatge natural que consisteix en determinar les expressions d'un discurs que es refereixen a la mateixa entitat del mon real. La tasca té un efecte directe en la minería de textos així com en moltes tasques de llenguatge natural que requereixin interpretació del discurs com resumidors, responedors de preguntes o traducció automàtica. Resoldre les correferències és essencial si es vol poder “entendre” un text o un discurs. Els objectius d'aquesta tesi es centren en la recerca en resolució de correferències amb aprenentatge automàtic. Concretament, els objectius de la recerca es centren en els següents camps: + Models de classificació: Els models de classificació més comuns a l'estat de l'art estan basats en la classificació independent de parelles de mencions. Més recentment han aparegut models que classifiquen grups de mencions. Un dels objectius de la tesi és incorporar el model entity-mention a l'aproximació desenvolupada. + Representació del problema: Encara no hi ha una representació definitiva del problema. En aquesta tesi es presenta una representació en hypergraf. + Algorismes de resolució. Depenent de la representació del problema i del model de classificació, els algorismes de ressolució poden ser molt diversos. Un dels objectius d'aquesta tesi és trobar un algorisme de resolució capaç d'utilitzar els models de classificació en la representació d'hypergraf. + Representació del coneixement: Per poder administrar coneixement de diverses fonts, cal una representació simbòlica i expressiva d'aquest coneixement. En aquesta tesi es proposa l'ús de restriccions. + Incorporació de coneixement del mon: Algunes correferències no es poden resoldre només amb informació lingüística. Sovint cal sentit comú i coneixement del mon per poder resoldre coreferències. En aquesta tesi es proposa un mètode per extreure coneixement del mon de Wikipedia i incorporar-lo al sistem de resolució. Les contribucions principals d'aquesta tesi son (i) una nova aproximació al problema de resolució de correferències basada en satisfacció de restriccions, fent servir un hypergraf per representar el problema, i resolent-ho amb l'algorisme relaxation labeling; i (ii) una recerca per millorar els resultats afegint informació del mon extreta de la Wikipedia. L'aproximació presentada pot fer servir els models mention-pair i entity-mention de forma combinada evitant així els problemes que es troben moltes altres aproximacions de l'estat de l'art com per exemple: contradiccions de classificacions independents, falta de context i falta d'informació. A més a més, l'aproximació presentada permet incorporar informació afegint restriccions i s'ha fet recerca per aconseguir afegir informació del mon que millori els resultats. RelaxCor, el sistema que ha estat implementat durant la tesi per experimentar amb l'aproximació proposada, ha aconseguit uns resultats comparables als millors que hi ha a l'estat de l'art. S'ha participat a les competicions internacionals SemEval-2010 i CoNLL-2011. RelaxCor va obtenir la segona posició al CoNLL-2010.

  • Unsupervised Learning of Relation Detection Patterns  Open access

     González Pellicer, Edgar
    Defense's date: 2012-06-01
    Department of Computer Science, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.

    Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.

  • RelaxCor participation in CoNLL shared task on coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Conference on Computational Natural Language Learning
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the participation of RelaxCor in the CoNLL-2011 shared task: "Modeling Unrestricted Coreference in Ontonotes". RELAXCOR is a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method.

  • Overview of QAST 2009

     Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Rosset, Sophie; Galibert, Olivier; Moreau, Nicolas; Mostefa, Djamel; Rosso, Paolo; Buscaldi, Davide
    Lecture notes in computer science
    Date of publication: 2010-09-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Robust question answering for speech transcripts: UPC experience in QAst 2009

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge
    Lecture notes in computer science
    Date of publication: 2010-09-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Using dependency parsing and machine learning for factoid question answering on spoken documents  Open access

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge; Màrquez Villodre, Lluís
    Annual Conference of the International Speech Communication Association
    Presentation's date: 2010-09-29
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents our experiments in question answering for speech corpora. These experiments focus on improving the answer extraction step of the QA process. We present two approaches to answer extraction in question answering for speech corpora that apply machine learning to improve the coverage and precision of the extraction. The first one is a reranker that uses only lexical information, the second one uses dependency parsing to score robust similarity between syntactic structures. Our experimental results show that the proposed learning models improve our previous results using only hand-made ranking rules with small syntactic information. Moreover, this results show also that a dependency parser can be useful for speech transcripts even if it was trained with written text data from a news collection. We evaluate the system on manual transcripts of speech from EPPS English corpus and a set of questions transcribed from spontaneous oral questions. This data belongs to the CLEF 2009 track on QA on speech transcripts (QAst).

    Postprint (author’s final draft)

  • A global relaxation labeling approach to coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    International Conference on Computational Linguistics
    Presentation's date: 2010-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method. Experiments show that our approach significantly outperforms systems based on separate classification and chain formation steps, and that it achieves the best results in the state of the art for the same dataset and metrics.

  • RelaxCor : a global relaxation labeling approach to coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    International Workshop on Semantic Evaluations
    Presentation's date: 2010
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the participation of RelaxCor in the Semeval-2010 task number 1: "Coreference Resolution in Multiple Languages". RelaxCor is a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method.

  • Evaluation protocol and tools for question-answering on speech transcripts

     Moreau, Nicolas; Hamon, Olivier; Mostefa, Djamel; Rosset, Sophie; Galibert, Olivier; Lamel, Lori; Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Rosso, Paolo; Buscaldi, Davide; Choukri, Khalid
    International Conference on Language Resources and Evaluation
    Presentation's date: 2010-05-19
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Overview of QAST 2008

     Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Rosset, Sophie; Lamel, Lori; Moreau, Nicolas; Mostefa, Djamel
    Lecture notes in computer science
    Date of publication: 2009-09-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Sistema de recomendación para un uso inclusivo del lenguaje

     Fuentes Fort, Maria; Padró Cirera, Lluís; Padró Cirera, Muntsa; Turmo Borras, Jorge; Carrera, Jordi
    Procesamiento del lenguaje natural
    Date of publication: 2009-03
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Sistema que procesa un texto escrito en castellano detectando usos del lenguaje no inclusivos. Para cada sintagma nominal sospechoso el sistema propone una serie de alternativas. El sistema permite también la adquisición automática de ejemplos positivos a partir de documentos que hagan un uso inclusivo del lenguaje. Estos ejemplos serán usados, junto a su contexto, en la presentación de sugerencias. Abstract: System to detect exclusive language in spanish documents. For each noun phrase detected as exclusive, several alternative are suggested by the system. Moreover, the system allows the automatic adquisition of positive examples from inclusive documents to be presented within their context as alternatives.

  • Robust Question Answering for Speech Transcripts: UPC Experience in QAst 2008

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge
    Lecture notes in computer science
    Date of publication: 2009
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Language technologies: question answering in speech transcripts

     Turmo Borras, Jorge; Galibert, Olivier; Rosset, Sophie; Surdeanu, Mihai
    Date of publication: 2009-05-31
    Book chapter

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The Question Answering (QA) task consists of providing short, relevant answers to natural language questions. Most QA research has focused on extracting information from text sources, providing a the shortest relevant text in response to a question. For example, the correct answer to the question ”How many groups participate in the CHIL project?” is ”16”, whereas the response to “who are the partners in CHIL?” is a list of them. This simple example illustrates the two main advantages of QA over current search engines: first, the input is a natural language question rather a keyword query; and second, the answer provides the desired information content and not simply a potentially large set of documents or URLs that the user must plow through. One of the aims of the CHIL project was to provide information about what has been said during interactive seminars. Since the information must be located in speech data, the QA systems have to be able to deal with transcripts (manual or automatic) of spontaneous speech. This is a departure from much of the QA research carried by natural language groups who have typically developed techniques for written texts which are assumed to have a correct syntactic and semantic structure. The structure of spoken language is different from that of written language, and some of the anchor points used in processing such as punctuation must be inferred and are therefore error prone. Other spoken language phenomena include disfluencies, repetitions, restarts and corrections. In the case that automatic processing is used to create the speech transcripts, an additional challenge is dealing with the recognition errors. The response can be a short string, as in text-based QA, or an audio segment containing the response. This chapter summarizes the CHIL efforts devoted to QA for spoken language carried out at UPC and at CNRS-LIMSI. Research at UPC adapted a QA system developed for written texts to manually and automatically created speech transcripts, whereas at LIMSI an interactive oral QA system developed for the French language was adapted to the English language. CHIL organized the pilot track on Question Answering in Speech Transcripts (QAst), as part of CLEF 2007, in order to compare and evaluate QA technology on both manually and automatically produced transcripts of spontaneous speech.

  • Access to the full text
    Unsupervised relation extraction by massive clustering  Open access

     González Pellicer, Edgar; Turmo Borras, Jorge
    IEEE International Conference On Data Mining
    Presentation's date: 2009-12
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The goal of Information Extraction is to automatically generate structured pieces of information from the relevant information contained in text documents. Machine Learning techniques have been applied to reduce the cost of Information Extraction system adaptation. However, elements of human supervision strongly bias the learning process. Unsupervised learning approaches can avoid these biases. In this paper, we propose an unsupervised approach to learning for Relation Detection, based on the use of massive clustering ensembles. The results obtained on the ACE Relation Mention Detection task outperform in terms of F1 score by 5 points the state of the art of unsupervised techniques for this evaluation framework, in addition to being simpler and more flexible.

  • A Graph Partitioning Approach to Coreference Resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2009-01
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • An analysis of bootstrapping for the recognition of temporal expressions

     Poveda Poveda, Jordi; Turmo Borras, Jorge
    NAACL 2009 Workshop on Semi-supervised Learning for NLP
    Presentation's date: 2009
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • The CHIL Audiovisual Corpus for Lecture and Meeting Analysis inside Smart Rooms

     Mostefa, D; Moreau, N; Choukri, K; Potamianos, G; Chu, S M; Tyagi, A; Casas Pla, Josep Ramon; Turmo Borras, Jorge; Cristoforetti, L; Tobia, F; Pnevmatikakis, A; Mylonakis, V; Talantzis, F; Burger, S; Stiefelhagen, R; Bernardin, K; Rochet, C
    Language resources and evaluation
    Date of publication: 2008-01
    Journal article

     Share Reference managers Reference managers Open in new window

  • A graph partitioning approach to entity disambiguation using uncertain information

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Lecture notes in artificial intelligence
    Date of publication: 2008-10
    Journal article

     Share Reference managers Reference managers Open in new window

  • TEXT-MESS: Minería de Textos Inteligente, Interactiva y Multilingüe basada en Tecnología del Lenguaje Humano

     Ageno Pulido, Alicia; Turmo Borras, Jorge
    Procesamiento del lenguaje natural
    Date of publication: 2008-09
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Non-parametric document clustering by ensemble methods  Open access

     González Pellicer, Edgar; Turmo Borras, Jorge
    Procesamiento del lenguaje natural
    Date of publication: 2008-03
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Los sesgos de los algoritmos individuales para clustering no paramétrico de documentos pueden conducir a soluciones no óptimas. Los métodos de consenso podrían compensar esta limitación, pero no han sido probados sobre colecciones de documentos. Este artículo presenta una comparación de estrategias para clustering no paramétrico de documentos por consenso. / The biases of individual algorithms for non-parametric document clustering can lead to non-optimal solutions. Ensemble clustering methods may overcome this limitation, but have not been applied to document collections. This paper presents a comparison of strategies for non-parametric document ensemble clustering.

  • Robust question answering for speech transcripts using minimal syntactic analysis

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge; Surdeanu, Mihai
    Lecture notes in computer science
    Date of publication: 2008
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Overview of QAST 2007

     Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Ayache, Christelle; Mostefa, Djamel; Rosset, Sophie; Lamel, Lori
    Lecture notes in computer science
    Date of publication: 2008
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Comparing non-parametric ensemble methods for document clustering

     González Pellicer, Edgar; Turmo Borras, Jorge
    International Conference on Applications of Natural Language to Information Systems
    Presentation's date: 2008-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The biases of individual algorithms for non-parametric document clustering can lead to non-optimal solutions. Ensemble clustering methods may overcome this limitation, but have not been applied to document collections. This paper presents a comparison of strategies for non-parametric document ensemble clustering.

  • TALP at TAC 2008: a semantic approach to recognizing textual entailment

     Ageno Pulido, Alicia; Cruz, Fermín; Farwell, David Loring; Ferrés, Daniel; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Text Analysis Conference
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Spoken document retrieval based on approximated sequence alignment  Open access

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge
    International Conference on Text, Speech and Dialogue
    Presentation's date: 2008
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. The classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques. However, ASRs tend to produce transcripts of spontaneous speech with significant word error rate, which is a drawback for standard retrieval techniques. To overcome such a limitation, our method is based on an approximated sequence alignment algorithm to search “sounds like” sequences. Our approach does not depend on extra information from the ASR and outperforms up to 7 points the precision of state-of-the-art techniques in our experiments.

    Postprint (author’s final draft)

  • Question answering on speech transcripts: the QAST evaluation in CLEF

     Lamel, Lori; Rosset, Sophie; Ayache, Christelle; Mostefa, Djamel; Turmo Borras, Jorge; Comas Umbert, Pere Ramon
    International Conference on Language Resources and Evaluation
    Presentation's date: 2008
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Recomendador t-incluye para un uso inclusivo del lenguaje

     Fuentes Gomez, Melchor; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-11
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A Question-driven Information System adaptable via Information Extraction techniques

     Sapena Masip, Emili; González, Manuel; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-06
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Coreference Resolution Survey

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-12
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A graph partitioning approach to entity disambiguation using uncertain information

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    6th International Conference Advances in Natural Language Processing
    Presentation's date: 2008
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a method for Entity Disambiguation in Information Extraction from different sources in the web. Once entities and relations between them are extracted, it is needed to determine which ones are referring to the same real-world entity. We model the problem as a graph partitioning problem in order to combine the available information more accurately than a pairwise classifier. Moreover, our method handle uncertain information which turns out to be quite helpful. Two algorithms are trained and compared, one probabilistic and the other deterministic. Both are tuned using genetic algorithms to find the best weights for the set of constraints. Experiments show that graph-based modeling yields better results using uncertain information.

  • Alias Assignment in Information Extraction

     Sapena, E; Padró Cirera, Lluís; Turmo Borras, Jorge
    Procesamiento del lenguaje natural
    Date of publication: 2007-09
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Alias assignment in information extraction

     Padró Cirera, Lluís; Turmo Borras, Jorge; Sapena Masip, Emili
    XXIII Congreso Anual de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a general method for alias assignment task in information extraction. We compared two approaches to face the problem and learn a classifier. The first one quantifies a global similarity between the alias and all the possible entities weighting some features about each pair alias-entity. The second is a classical classifier where each instance is a pair alias-entity and its attributes are their features. Both approaches use the same feature functions about the pair alias-entity where every level of abstraction, from raw characters up to semantic level, is treated in an homogeneous way. In addition, we propose an extended feature functions that break down the information and let the machine learning algorithm to determine the final contribution of each value. The use of extended features improve the results of the simple ones. ---------------------------------------Este artículo presenta un método general para la tarea de asignación de alias en extracción de información. Se comparan dos aproximaciones para encarar el problema y aprender un clasificador. La primera cuantifica una similaridad global entre el alias y todas las posibles entidades asignando pesos a las características sobre cada pareja alias-entidad. La segunda es el clásico clasificador donde cada instancia es una pareja alias-entidad y sus atributos son las características de ésta. Ambas aproximaciones usan las mismas funciones de características sobre la pareja alias-entidad donde cada nivel de abstracción, desde los carácteres hasta el nivel semántico, se tratan de forma homogénea. Ademés, se proponen unas funciones extendidas de características que desglosan la información y permiten al algoritmo de aprendizaje automático determinar la contribución final de cada valor. El uso de funciones extendidas mejora los resultados de las funciones simples.

  • A Comparison of Statistical and Rule-Induction Learners for Automatic Tagging of Time Expressions in English

     Poveda Poveda, Jordi; Surdeanu, Mihai; Turmo Borras, Jorge
    International Symposium on Temporal and Reasoning
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • An Evaluation Framework Based on Gold Standard Models for Definition Question Answering

     Kanaan Izquierdo, Samir; Turmo Borras, Jorge
    5th International Conference on Natural Language Processing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window