Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 112 results
  • Access to the full text
    TweetNorm_es: an annotated corpus for Spanish microtext normalization  Open access

     Alegria, Iñaki; Aranberri, Nora; Comas Umbert, Pere Ramon; Fresno, Víctor; Gamallo, Pablo; Padró Cirera, Lluís; San Vicente Roncal, Iñaki; Turmo Borras, Jorge; Zubiaga, Arkaitz
    International Conference on Language Resources and Evaluation
    p. 2274-2278
    Presentation's date: 2014-05-29
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we introduce TweetNorm es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

  • Access to the full text
    Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español  Open access

     Alegria, Iñaki; Aranberri, Nora; Fresno, Víctor; Gamallo, Pablo; Padró Cirera, Lluís; San Vicente Roncal, Iñaki; Turmo Borras, Jorge; Zubiaga, Arkaitz
    Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation's date: 2013-09-20
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.

    En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.

  • Access to the full text
    The TALP-UPC approach to Tweet-Norm 2013  Open access

     Ageno Pulido, Alicia; Comas Umbert, Pere Ramon; Padró Cirera, Lluís; Turmo Borras, Jorge
    Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    p. 91-95
    Presentation's date: 2013-09-20
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose dierent corrections for each out-of-vocabulary word. The final correction is chosen by weighted voting according to each module accuracy.

    This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose di erent corrections for each out-of-vocabulary word. The fi nal correction is chosen by weighted voting according to each module accuracy.

    Postprint (author’s final draft)

  • Access to the full text
    UPC-CORE : What can machine translation evaluation metrics and Wikipedia do for estimating semantic textual similarity?  Open access

     Barron Cedeño, Luis Alberto; Màrquez Villodre, Lluís; Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Joint Conference on Lexical and Computational Semantics
    p. 1-5
    Presentation's date: 2013-06-13
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we discuss our participation to the 2013 Semeval Semantic Textual Similarity task. Our core features include (i) a set of metrics borrowed from automatic machine translation, originally intended to evaluate automatic against reference translations and (ii) an instance of explicit semantic analysis, built upon opening paragraphs of Wikipedia 2010 articles. Our similarity estimator relies on a support vector regressor with RBF kernel. Our best approach required 13 machine translation metrics + explicit semantic analysis and ranked 65 in the competition. Our postcompetition analysis shows that the features have a good expression level, but overfitting and ¿mainly¿ normalization issues caused our correlation values to decrease.

    In this paper we discuss our participation to the 2013 Semeval Semantic Textual Similarity task. Our core features include (i) a set of metrics borrowed from automatic machine translation, originally intended to evaluate automatic against reference translations and (ii) an instance of explicit semantic analysis, built upon opening paragraphs of Wikipedia 2010 articles. Our similarity estimator relies on a support vector regressor with RBF kernel. Our best approach required 13 machine translation metrics + explicit semantic analysis and ranked 65 in the competition. Our postcompetition analysis shows that the features have a good expression level, but overfitting and —mainly— normalization issues caused our correlation values to decrease.

  • TIN2012-38584-C06-01 - Adquisición de escenarios de conocimiento a través de la lectura de textos: inferencia de relaciones entre eventos (SKATeR)

     Rodriguez Hontoria, Horacio; Abad Soriano, Maria Teresa; Ageno Pulido, Alicia; Catala Roig, Neus; Comas Umbert, Pere Ramon; Farreres De La Morena, Javier; Fuentes Fort, Maria; Gatius Vila, Marta; Mehdizadeh Naderi, Ali; Padró Cirera, Lluís; Turmo Borras, Jorge
    Competitive project

     Share

  • Access to the full text
    The TALP participation at TAC-KBP 2013  Open access

     Ageno Pulido, Alicia; Comas Umbert, Pere Ramon; Mehdizadeh Naderi, Ali; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Text Analysis Conference
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its second participation at TAC-KBP 2013 in both the Entity Linking and the Slot Filling tasks.

  • Unsupervised Learning of Relation Detection Patterns  Open access

     González Pellicer, Edgar
    Department of Computer Science, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.

    Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.

  • Factoid Question Answering for Spoken Documents  Open access

     Comas Umbert, Pere Ramon
    Department of Computer Science, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic. The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

    En aquesta Tesi, presentem un sistema de Question Answering (QA) factual, especialment ajustat per treballar amb documents orals. En el desenvolupament explorem, per primera vegada, quines tècniques de les habitualment emprades en QA per documents escrit són suficientment robustes per funcionar en l'escenari més difícil de documents orals. Amb més especificitat, estudiem nous mètodes de Information Retrieval (IR) dissenyats per tractar amb la veu, i utilitzem diversos nivells d'informació linqüística. Entre aquests s'inclouen, a saber: detecció de Named Entities utilitzant informació fonètica, "parsing" sintàctic aplicat a transcripcions de veu, i també l'ús d'un sub-sistema de detecció i resolució de la correferència. La nostra aproximació al problema es recolza en gran part en tècniques supervisades de Machine Learning, estant aquestes enfocades especialment cap a la part d'extracció de la resposta, i fa servir la menor quantitat possible de coneixement creat per humans. En conseqüència, tot el procés de QA pot ser adaptat a altres dominis o altres llengües amb relativa facilitat. Un dels resultats addicionals de la feina darrere d'aquesta Tesis ha estat que hem impulsat i coordinat la creació d'un marc d'avaluació de la taska de QA en documents orals. Aquest marc de treball, anomenat QAst (Question Answering on Speech Transcripts), proporciona un corpus de documents orals multi-lingüe, uns conjunts de preguntes d'avaluació, i les respostes correctes d'aquestes. Aquestes dades han estat utilitzades en les evaluacionis QAst que han tingut lloc en el si de les conferències CLEF en els anys 2007, 2008 i 2009; d'aquesta manera s'ha promogut i ajudat a la creació d'un estat-de-l'art de tècniques adreçades a aquest problema en particular. El sistema de QA que presentem i tots els seus particulars sumbòduls, han estat avaluats extensivament utilitzant el corpus EPPS (transcripcions de les Sessions Plenaries del Parlament Europeu) en anglès, que cónté transcripcions manuals de tots els discursos i també transcripcions automàtiques obtingudes mitjançant tres reconeixedors automàtics de la parla (ASR) diferents. Els reconeixedors tenen característiques i resultats diferents que permetes una avaluació quantitativa i qualitativa de la tasca. Aquestes dades pertanyen a l'avaluació QAst del 2009. Els resultats principals de la nostra feina confirmen que la informació sintàctica és mol útil per aprendre automàticament a valorar la plausibilitat de les respostes candidates, millorant els resultats previs tan en transcripcions manuals com transcripcions automàtiques, descomptat que la qualitat de l'ASR sigui molt baixa. En general, el rendiment del nostre sistema és comparable o millor que els altres sistemes pertanyents a l'estat-del'art, confirmant així la validesa de la nostra aproximació.

  • A constraint-based hypergraph partitioning approach to coreference resolution  Open access

     Sapena Masip, Emili
    Department of Computer Science, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The objectives of this thesis are focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that mention or refer to the same entity. The main contributions of this thesis are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use entity-mention classi cation model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classi cations without context and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and a research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results in the state of the art, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second position in CoNLL-2011.

    La resolució de correferències és una tasca de processament del llenguatge natural que consisteix en determinar les expressions d'un discurs que es refereixen a la mateixa entitat del mon real. La tasca té un efecte directe en la minería de textos així com en moltes tasques de llenguatge natural que requereixin interpretació del discurs com resumidors, responedors de preguntes o traducció automàtica. Resoldre les correferències és essencial si es vol poder “entendre” un text o un discurs. Els objectius d'aquesta tesi es centren en la recerca en resolució de correferències amb aprenentatge automàtic. Concretament, els objectius de la recerca es centren en els següents camps: + Models de classificació: Els models de classificació més comuns a l'estat de l'art estan basats en la classificació independent de parelles de mencions. Més recentment han aparegut models que classifiquen grups de mencions. Un dels objectius de la tesi és incorporar el model entity-mention a l'aproximació desenvolupada. + Representació del problema: Encara no hi ha una representació definitiva del problema. En aquesta tesi es presenta una representació en hypergraf. + Algorismes de resolució. Depenent de la representació del problema i del model de classificació, els algorismes de ressolució poden ser molt diversos. Un dels objectius d'aquesta tesi és trobar un algorisme de resolució capaç d'utilitzar els models de classificació en la representació d'hypergraf. + Representació del coneixement: Per poder administrar coneixement de diverses fonts, cal una representació simbòlica i expressiva d'aquest coneixement. En aquesta tesi es proposa l'ús de restriccions. + Incorporació de coneixement del mon: Algunes correferències no es poden resoldre només amb informació lingüística. Sovint cal sentit comú i coneixement del mon per poder resoldre coreferències. En aquesta tesi es proposa un mètode per extreure coneixement del mon de Wikipedia i incorporar-lo al sistem de resolució. Les contribucions principals d'aquesta tesi son (i) una nova aproximació al problema de resolució de correferències basada en satisfacció de restriccions, fent servir un hypergraf per representar el problema, i resolent-ho amb l'algorisme relaxation labeling; i (ii) una recerca per millorar els resultats afegint informació del mon extreta de la Wikipedia. L'aproximació presentada pot fer servir els models mention-pair i entity-mention de forma combinada evitant així els problemes que es troben moltes altres aproximacions de l'estat de l'art com per exemple: contradiccions de classificacions independents, falta de context i falta d'informació. A més a més, l'aproximació presentada permet incorporar informació afegint restriccions i s'ha fet recerca per aconseguir afegir informació del mon que millori els resultats. RelaxCor, el sistema que ha estat implementat durant la tesi per experimentar amb l'aproximació proposada, ha aconseguit uns resultats comparables als millors que hi ha a l'estat de l'art. S'ha participat a les competicions internacionals SemEval-2010 i CoNLL-2011. RelaxCor va obtenir la segona posició al CoNLL-2010.

  • Access to the full text
    Summarizing a multimodal set of documents in a smart room  Open access

     Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    International Conference on Language Resources and Evaluation
    p. 1-6
    Presentation's date: 2012-05-23
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This article reports an intrinsic automatic summarization evaluation in the scientific lecture domain. The lecture takes place in a Smart Room that has access to different types of documents produced from different media. An evaluation framework is presented to analyze the performance of systems producing summaries answering a user need. Several ROUGE metrics are used and a manual content responsiveness evaluation was carried out in order to analyze the performance of the evaluated approaches. Various multilingual summarization approaches are analyzed showing that the use of different types of documents outperforms the use of transcripts. In fact, not using any part of the spontaneous speech transcription in the summary improves the performance of automatic summaries. Moreover, the use of semantic information represented in the different textual documents coming from different media helps to improve summary quality.

    Postprint (author’s final draft)

  • Access to the full text
    Unsupervised ensemble minority clustering  Open access

     González Pellicer, Edgar; Turmo Borras, Jorge
    Date: 2012-03
    Report

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Cluster a alysis lies at the core of most unsupervised learning tasks. However, the majority of clustering algorithms depend on the all-in assumption, in which all objects belong to some cluster, and perform poorly on minority clustering tasks, in which a small fraction of signal data stands against a majority of noise. The approaches proposed so far for minority clustering are supervised: they require the number and distribution of the foreground and background clusters. In supervised learning and all-in clustering, combination methods have been successfully applied to obtain distribution-free learners, even from the output of weak individual algorithms. In this report, we present a novel ensemble minority clustering algorithm, Ewocs, suitable for weak clustering combination, and provide a theoretical proof of its properties under a loose set of constraints. The validity of the assumptions used in the proof is empirically assessed using a collection of synthetic datasets.

  • Cross-lingual Knowledge Extraction

     Padró Cirera, Lluís; Primadhanty, Audi; Quattoni, Ariadna Julieta; Lluis Martorell, Xavier; Turmo Borras, Jorge; Carreras Perez, Xavier
    Competitive project

     Share

  • Access to the full text
    The TALP participation at TAC-KBP 2012  Open access

     González Pellicer, Edgar; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Mehdizadeh Naderi, Ali; Ageno Pulido, Alicia; Sapena Masip, Emili; Vila Rigat, Marta; Martí Antonin, Maria Antònia
    Text Analysis Conference
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its first participation at TAC-KBP 2012 in both the Entity Linking and the Slot Filling tasks.

    Postprint (author’s final draft)

  • RelaxCor participation in CoNLL shared task on coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Conference on Computational Natural Language Learning
    p. 35-39
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the participation of RelaxCor in the CoNLL-2011 shared task: "Modeling Unrestricted Coreference in Ontonotes". RELAXCOR is a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method.

  • Access to the full text
    Using dependency parsing and machine learning for factoid question answering on spoken documents  Open access

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge; Màrquez Villodre, Lluís
    Annual Conference of the International Speech Communication Association
    p. 1-4
    Presentation's date: 2010-09-29
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents our experiments in question answering for speech corpora. These experiments focus on improving the answer extraction step of the QA process. We present two approaches to answer extraction in question answering for speech corpora that apply machine learning to improve the coverage and precision of the extraction. The first one is a reranker that uses only lexical information, the second one uses dependency parsing to score robust similarity between syntactic structures. Our experimental results show that the proposed learning models improve our previous results using only hand-made ranking rules with small syntactic information. Moreover, this results show also that a dependency parser can be useful for speech transcripts even if it was trained with written text data from a news collection. We evaluate the system on manual transcripts of speech from EPPS English corpus and a set of questions transcribed from spontaneous oral questions. This data belongs to the CLEF 2009 track on QA on speech transcripts (QAst).

    Postprint (author’s final draft)

  • A global relaxation labeling approach to coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    International Conference on Computational Linguistics
    p. 1086-1094
    Presentation's date: 2010-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method. Experiments show that our approach significantly outperforms systems based on separate classification and chain formation steps, and that it achieves the best results in the state of the art for the same dataset and metrics.

  • Evaluation protocol and tools for question-answering on speech transcripts

     Moreau, Nicolas; Hamon, Olivier; Mostefa, Djamel; Rosset, Sophie; Galibert, Olivier; Lamel, Lori; Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Rosso, Paolo; Buscaldi, Davide; Choukri, Khalid
    International Conference on Language Resources and Evaluation
    p. 2769-2773
    Presentation's date: 2010-05-19
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • RelaxCor : a global relaxation labeling approach to coreference resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    International Workshop on Semantic Evaluations
    p. 1-4
    Presentation's date: 2010
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the participation of RelaxCor in the Semeval-2010 task number 1: "Coreference Resolution in Multiple Languages". RelaxCor is a constraint-based graph partitioning approach to coreference resolution solved by relaxation labeling. The approach combines the strengths of groupwise classifiers and chain formation methods in one global method.

  • Access to the full text
    Unsupervised relation extraction by massive clustering  Open access

     González Pellicer, Edgar; Turmo Borras, Jorge
    IEEE International Conference On Data Mining
    p. 782-787
    DOI: 10.1109/ICDM.2009.81
    Presentation's date: 2009-12
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The goal of Information Extraction is to automatically generate structured pieces of information from the relevant information contained in text documents. Machine Learning techniques have been applied to reduce the cost of Information Extraction system adaptation. However, elements of human supervision strongly bias the learning process. Unsupervised learning approaches can avoid these biases. In this paper, we propose an unsupervised approach to learning for Relation Detection, based on the use of massive clustering ensembles. The results obtained on the ACE Relation Mention Detection task outperform in terms of F1 score by 5 points the state of the art of unsupervised techniques for this evaluation framework, in addition to being simpler and more flexible.

  • Language technologies: question answering in speech transcripts

     Turmo Borras, Jorge; Galibert, Olivier; Rosset, Sophie; Surdeanu, Mihai
    DOI: 10.1007/978-1-84882-054-8_3
    Date of publication: 2009-05-31
    Book chapter

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The Question Answering (QA) task consists of providing short, relevant answers to natural language questions. Most QA research has focused on extracting information from text sources, providing a the shortest relevant text in response to a question. For example, the correct answer to the question ”How many groups participate in the CHIL project?” is ”16”, whereas the response to “who are the partners in CHIL?” is a list of them. This simple example illustrates the two main advantages of QA over current search engines: first, the input is a natural language question rather a keyword query; and second, the answer provides the desired information content and not simply a potentially large set of documents or URLs that the user must plow through. One of the aims of the CHIL project was to provide information about what has been said during interactive seminars. Since the information must be located in speech data, the QA systems have to be able to deal with transcripts (manual or automatic) of spontaneous speech. This is a departure from much of the QA research carried by natural language groups who have typically developed techniques for written texts which are assumed to have a correct syntactic and semantic structure. The structure of spoken language is different from that of written language, and some of the anchor points used in processing such as punctuation must be inferred and are therefore error prone. Other spoken language phenomena include disfluencies, repetitions, restarts and corrections. In the case that automatic processing is used to create the speech transcripts, an additional challenge is dealing with the recognition errors. The response can be a short string, as in text-based QA, or an audio segment containing the response. This chapter summarizes the CHIL efforts devoted to QA for spoken language carried out at UPC and at CNRS-LIMSI. Research at UPC adapted a QA system developed for written texts to manually and automatically created speech transcripts, whereas at LIMSI an interactive oral QA system developed for the French language was adapted to the English language. CHIL organized the pilot track on Question Answering in Speech Transcripts (QAst), as part of CLEF 2007, in order to compare and evaluate QA technology on both manually and automatically produced transcripts of spontaneous speech.

  • A Graph Partitioning Approach to Coreference Resolution

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2009-01
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • An analysis of bootstrapping for the recognition of temporal expressions

     Poveda Poveda, Jordi; Turmo Borras, Jorge
    NAACL 2009 Workshop on Semi-supervised Learning for NLP
    p. 50-59
    Presentation's date: 2009
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Coreference Resolution Survey

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-12
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Recomendador t-incluye para un uso inclusivo del lenguaje

     Fuentes Gomez, Melchor; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-11
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A Question-driven Information System adaptable via Information Extraction techniques

     Sapena Masip, Emili; González, Manuel; Padró Cirera, Lluís; Turmo Borras, Jorge
    Date: 2008-06
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Comparing non-parametric ensemble methods for document clustering

     González Pellicer, Edgar; Turmo Borras, Jorge
    International Conference on Applications of Natural Language to Information Systems
    p. 245-256
    DOI: 10.1007/978-3-540-69858-6_25
    Presentation's date: 2008-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The biases of individual algorithms for non-parametric document clustering can lead to non-optimal solutions. Ensemble clustering methods may overcome this limitation, but have not been applied to document collections. This paper presents a comparison of strategies for non-parametric document ensemble clustering.

  • A graph partitioning approach to entity disambiguation using uncertain information

     Sapena Masip, Emili; Padró Cirera, Lluís; Turmo Borras, Jorge
    6th International Conference Advances in Natural Language Processing
    p. 428-439
    Presentation's date: 2008
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a method for Entity Disambiguation in Information Extraction from different sources in the web. Once entities and relations between them are extracted, it is needed to determine which ones are referring to the same real-world entity. We model the problem as a graph partitioning problem in order to combine the available information more accurately than a pairwise classifier. Moreover, our method handle uncertain information which turns out to be quite helpful. Two algorithms are trained and compared, one probabilistic and the other deterministic. Both are tuned using genetic algorithms to find the best weights for the set of constraints. Experiments show that graph-based modeling yields better results using uncertain information.

  • Question answering on speech transcripts: the QAST evaluation in CLEF

     Lamel, Lori; Rosset, Sophie; Ayache, Christelle; Mostefa, Djamel; Turmo Borras, Jorge; Comas Umbert, Pere Ramon
    International Conference on Language Resources and Evaluation
    p. 1995-1999
    Presentation's date: 2008
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Spoken document retrieval based on approximated sequence alignment  Open access

     Comas Umbert, Pere Ramon; Turmo Borras, Jorge
    International Conference on Text, Speech and Dialogue
    p. 1-8
    Presentation's date: 2008
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. The classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques. However, ASRs tend to produce transcripts of spontaneous speech with significant word error rate, which is a drawback for standard retrieval techniques. To overcome such a limitation, our method is based on an approximated sequence alignment algorithm to search “sounds like” sequences. Our approach does not depend on extra information from the ASR and outperforms up to 7 points the precision of state-of-the-art techniques in our experiments.

    Postprint (author’s final draft)

  • TALP at TAC 2008: a semantic approach to recognizing textual entailment

     Ageno Pulido, Alicia; Cruz, Fermín; Farwell, David Loring; Ferrés, Daniel; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Text Analysis Conference
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Comparison of Statistical and Rule-Induction Learners for Automatic Tagging of Time Expressions in English

     Poveda Poveda, Jordi; Surdeanu, Mihai; Turmo Borras, Jorge
    International Symposium on Temporal and Reasoning
    p. 141-149
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Alias assignment in information extraction

     Padró Cirera, Lluís; Turmo Borras, Jorge; Sapena Masip, Emili
    XXIII Congreso Anual de la Sociedad Española para el Procesamiento del Lenguaje Natural
    p. 1-2
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a general method for alias assignment task in information extraction. We compared two approaches to face the problem and learn a classifier. The first one quantifies a global similarity between the alias and all the possible entities weighting some features about each pair alias-entity. The second is a classical classifier where each instance is a pair alias-entity and its attributes are their features. Both approaches use the same feature functions about the pair alias-entity where every level of abstraction, from raw characters up to semantic level, is treated in an homogeneous way. In addition, we propose an extended feature functions that break down the information and let the machine learning algorithm to determine the final contribution of each value. The use of extended features improve the results of the simple ones. ---------------------------------------Este artículo presenta un método general para la tarea de asignación de alias en extracción de información. Se comparan dos aproximaciones para encarar el problema y aprender un clasificador. La primera cuantifica una similaridad global entre el alias y todas las posibles entidades asignando pesos a las características sobre cada pareja alias-entidad. La segunda es el clásico clasificador donde cada instancia es una pareja alias-entidad y sus atributos son las características de ésta. Ambas aproximaciones usan las mismas funciones de características sobre la pareja alias-entidad donde cada nivel de abstracción, desde los carácteres hasta el nivel semántico, se tratan de forma homogénea. Ademés, se proponen unas funciones extendidas de características que desglosan la información y permiten al algoritmo de aprendizaje automático determinar la contribución final de cada valor. El uso de funciones extendidas mejora los resultados de las funciones simples.

  • An Evaluation Framework Based on Gold Standard Models for Definition Question Answering

     Kanaan Izquierdo, Samir; Turmo Borras, Jorge
    5th International Conference on Natural Language Processing
    p. 93-101
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • TextMess - SAMiT (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies)

     Ageno Pulido, Alicia; Turmo Borras, Jorge; Rodriguez Hontoria, Horacio; Catala Roig, Neus; Comas Umbert, Pere Ramon; González Pellicer, Edgar; Fuentes Fort, Maria; Kanaan Izquierdo, Samir; Ferrés Domènech, Daniel; Sapena Masip, Emili
    Competitive project

     Share

  • Access to the full text
    FEMsum at DUC 2006: Semantic-based approach integrated in a flexible eclectic multitask summarizer architecture  Open access

     Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge; Ferrés Domènech, Daniel
    Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics
    p. 1-8
    Presentation's date: 2006-06-08
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In order to face different requirements at TALP Research Center we have built a highly parameterized environment allowing to instantiate specific summarizers for different summarization tasks in different languages. This paper describes and analyzes how our system deals with the DUC 2006 task of providing summary-length answers to complex questions. The given query is used to detect relevant passages. After that, semantic similarities between these relevant sentences are detected and then used as input of an iterative graph-based algorithm to avoid redundancy and obtain a cohesioned text. NIST human evaluations are used to analyze several aspects of our system and a specific analysis for each of the three different kinds of submitted summaries is reported.

    Postprint (author’s final draft)

  • Unsupervised Document Clustering by Weighted Combination

     González Pellicer, Edgar; Turmo Borras, Jorge
    Date: 2006-05
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Best IST 2006 Exhibit

     Casas Pla, Josep Ramon; Turmo Borras, Jorge; Surdeanu, Mihai; Rödder, M; Waibel, A; Stiefelhagen, R; Pianesi, F; Curin, J; Comas, P; Consortium, CHIL
    Award or recognition

     Share

  • A hybrid approach for the acquisition of information extraction patterns

     Turmo Borras, Jorge; Ageno Pulido, Alicia; Surdeanu, Mihai
    Workshop on Adaptive Text Extraction and Mining
    p. 48-55
    Presentation's date: 2006
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    TALP-UPC at TREC 2005: Experiments using voting scheme among three heterogeneous QA systems  Open access

     Ferrés, Daniel; Kanaan Izquierdo, Samir; Gonzàlez Pellicer, Edgar; Ageno Pulido, Alicia; Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Surdeanu, Mihai; Turmo Borras, Jorge
    Text Retrieval Conference
    Presentation's date: 2006
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes the experiments of the TALP-UPC group for factoid and ’other’ (definitional) questions at TREC 2005 Main Question Answering (QA)task. Our current approach for factoid questions is based on a voting scheme among three QA systems: TALP-QA (our previous QA system), Sibyl (a new QA system developed at DAMA-UPC and TALP-UPC), and Aranea (a web-based data-driven approach). For defitional questions, we used two different systems: the TALP-QA Definitional system and LCSUM (a Summarization-based system). Our results for factoid questions indicate that the voting strategy improves the accuracy from 7.5% to 17.1%. While these numbers are low (due to technical problems in the Answer Extraction phase of TALP-QA system) they indicate that voting is a succesful approach for performance boosting of QA systems. The answer to definitional questions is produced by selecting phrases using set of patterns associated with definitions. Its results are 17.2% of F-score in the best configuration of TALP-QA Definitional system.

  • ALIADO (Speech and Language Technologies for a Personal Assistant)

     Mariño Acebal, Jose Bernardo; Kanaan Izquierdo, Samir; Turmo Borras, Jorge
    Competitive project

     Share

  • TALP-UPC at TREC 2005: Experiments Using Voting Scheme Among Three Hetereogeneous QA Systems

     Ferrés Domènech, Daniel; Kanaan Izquierdo, Samir; Domínguez-Sal, D; Dominguez Sala, David; Gonzàlez, E; Ageno Pulido, Alicia; Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Surdeanu, Mihai; Turmo Borras, Jorge
    XIV TREC Conference
    p. 1-2
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Hybrid Unsupervised Approach for Document Clustering

     Surdeanu, Mihai; Turmo Borras, Jorge; Ageno Pulido, Alicia
    The Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    p. 685-690
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Summarizing Spontaneous Speech Using General Text Properties

     Fuentes Fort, Maria; González Pellicer, Edgar; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge; Alonso, L
    Crossing Barriers in Text Summarization Research Workshop
    p. 10-17
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Robust Combination Strategy for Semantic Role Labeling

     Turmo Borras, Jorge; Comas, Pere; Surdeanu, Mihai; Màrquez Villodre, Lluís
    Empirical Methods in Natural Language Processing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Unsupervised Clustering of spontaneous Speech Documents

     González Pellicer, Edgar; Turmo Borras, Jorge
    9th European Conference on Speech Communication and Technology
    p. 609-612
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window