Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 218 results
  • Access to the full text
    UPC-CORE : What can machine translation evaluation metrics and Wikipedia do for estimating semantic textual similarity?  Open access

     Barron Cedeño, Luis Alberto; Màrquez Villodre, Lluís; Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Joint Conference on Lexical and Computational Semantics
    Presentation's date: 2013-06-13
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we discuss our participation to the 2013 Semeval Semantic Textual Similarity task. Our core features include (i) a set of metrics borrowed from automatic machine translation, originally intended to evaluate automatic against reference translations and (ii) an instance of explicit semantic analysis, built upon opening paragraphs of Wikipedia 2010 articles. Our similarity estimator relies on a support vector regressor with RBF kernel. Our best approach required 13 machine translation metrics + explicit semantic analysis and ranked 65 in the competition. Our postcompetition analysis shows that the features have a good expression level, but overfitting and ¿mainly¿ normalization issues caused our correlation values to decrease.

    In this paper we discuss our participation to the 2013 Semeval Semantic Textual Similarity task. Our core features include (i) a set of metrics borrowed from automatic machine translation, originally intended to evaluate automatic against reference translations and (ii) an instance of explicit semantic analysis, built upon opening paragraphs of Wikipedia 2010 articles. Our similarity estimator relies on a support vector regressor with RBF kernel. Our best approach required 13 machine translation metrics + explicit semantic analysis and ranked 65 in the competition. Our postcompetition analysis shows that the features have a good expression level, but overfitting and —mainly— normalization issues caused our correlation values to decrease.

  • LIARc: Labeling Implicit ARguments in Spanish deverbal nominalizations

     Peris, Aina; Taulé, Mariona; Rodriguez Hontoria, Horacio; Bertran Ibarz, Manuel
    International Conference on Intelligent Text Processing and Computational Linguistics
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper deals with the automatic identification and annotation of the implicit arguments of deverbal nominalizations in Spanish. We present the first version of the LIAR system focusing on its classifier component. We have built a supervised Machine Learning feature based model that uses a subset of AnCora-Es as a training corpus. We have built four different models and the overall F-Measure is 89.9%, which means an increase F-Measure performance approximately 35 points over the baseline (55%). However, a detailed analysis of the feature performance is still needed. Future work will focus on using LIAR to automatically annotate the implicit arguments in the whole AnCora-Es.

  • Evaluación del trabajo Final de Grado

     Sanchez Carracedo, Fermin; Climent Vilaro, Juan; Corbalan Gonzalez, Julita; Fonseca Casas, Pau; Garcia Almiñana, Jordi; Herrero Zaragoza, José Ramón; Llinas Audet, Francisco Javier; Rodriguez Hontoria, Horacio; Sancho Samsó, Maria Ribera
    Jornadas de Enseñanza Universitaria de la Informática
    Presentation's date: 2013-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Los Proyectos de Fin de Carrera (PFC) se han evaluado tradicionalmente a partir de una memoria y de una presentación pública. Esta evaluación, en general, la realiza un tribunal formado por varios profesores, que juzga de forma integral el proyecto a partir de la documentación entregada y de su presentación pública. Para poner la nota final los centros no disponen, en general, de unos criterios claros y precisos, por lo que cada tribunal usa su propia experiencia previa para decidir la nota de cada proyecto. Los Trabajos de Fin de Grado (TFG) substituyen en los nuevos planes de estudios de las ingenierías a los antiguos PFC. La evaluación de los TFG debe consi- derar, de forma explícita, tanto las competencias específicas como las genéricas, y es necesario que existan criterios claros sobre la forma de evaluarlas. Para avanzar en este sentido, el Ministerio de Cien cia e Innovación y la Agencia para la Calidad del Sistema Universitario de Catalunya financiaron en 2008 y 2009 el proyecto ¿Guía para la evaluación de competencias en los Trabajos de Fin de Grado y de Máster en las Ingenierías¿. Esta guía es, en realidad, una guía para ayudar a que cada centro/titulación defina su propio procedimiento de evaluación del TFG. En este trabajo se presenta una implementación de las propuestas contenidas en la guía y se define una metodología para evaluar los TFG a partir de las competencias que se trabajan en la titulación de Grado en Ingeniería Informática de la Facultat d¿Informàtica de Barcelona. La metodología puede ser fácilmente replicada o adaptada para otros centros y otras titulaciones, lo que puede facilitar la realización de su propia guía de evaluación de los TFG.

  • Adquisición de escenarios de conocimiento a través de la lectura de textos: inferencia de relaciones entre eventos (SKATeR)

     Catala Roig, Neus; Rodriguez Hontoria, Horacio
    Participation in a competitive project

     Share

  • The TALP participation at TAC-KBP 2012

     González Pellicer, Edgar; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge; Comas Umbert, Pere Ramon; Mehdizadeh Naderi, Ali; Ageno Pulido, Alicia; Sapena Masip, Emili; Vila Rigat, Marta; Martí, Maria Antònia
    Text Analysis Conference
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its first participation at TAC-KBP 2012 in both the Entity Linking and the Slot Filling tasks.

  • Access to the full text
    Summarizing a multimodal set of documents in a smart room  Open access

     Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    International Conference on Language Resources and Evaluation
    Presentation's date: 2012-05-23
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This article reports an intrinsic automatic summarization evaluation in the scientific lecture domain. The lecture takes place in a Smart Room that has access to different types of documents produced from different media. An evaluation framework is presented to analyze the performance of systems producing summaries answering a user need. Several ROUGE metrics are used and a manual content responsiveness evaluation was carried out in order to analyze the performance of the evaluated approaches. Various multilingual summarization approaches are analyzed showing that the use of different types of documents outperforms the use of transcripts. In fact, not using any part of the spontaneous speech transcription in the summary improves the performance of automatic summaries. Moreover, the use of semantic information represented in the different textual documents coming from different media helps to improve summary quality.

    Postprint (author’s final draft)

  • IARG-AnCora: annotating AnCora corpus with implicit arguments

     Taulé, Mariona; Martí, Maria Antònia; Peris, Aina; Rodriguez Hontoria, Horacio; Moreno Boronat, Lidia; Moreda Pozo, Paloma
    Procesamiento del lenguaje natural
    Date of publication: 2012-09
    Journal article

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    IARG-AnCora tiene como objetivo la anotación con papeles temáticos de los argumentos implícitos de las nominalizaciones deverbales en el corpus AnCora. Estos corpus servirán de base para los sistemas de etiquetado automático de roles semánticos basados en técnicas de aprendizaje automático. Los analizadores semánticos son componentes básicos en las aplicaciones actuales de las tecnologías del lenguaje, en las que se quiere potenciar una comprensión más profunda del texto para realizar inferencias de más alto nivel y obtener así mejoras cualitativas en los resultados. | Iarg-AnCora aims to annotate the implicit arguments of deverbal nominalizations in AnCora corpus. This corpus will be the basis for systems of automatic semantic role labeling based on machine learning techniques. Semantic analyzers are essential components in the current applications of language technologies, in which it is important to obtain a deeper understanding of the text to make inferences on the highest level in order to obtain qualitative improvements in the results.

  • Empirical methods for the study of denotation in nominalizations in Spanish

     Peris, Aina; Taulé, Mariona; Rodriguez Hontoria, Horacio
    Computational linguistics
    Date of publication: 2012
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Nominalizaciones deverbales: Denotación y estructura argumental

     Aina Peris Morant
    Defense's date: 2012-05-11
    Universitat de Barcelona (UB)
    Theses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • TESTING AND TEST-DRIVEN DEVELOPMENT OF CONCEPTUAL SCHEMAS  Open access

     Tort Pugibet, Albert
    Defense's date: 2012-04-11
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The traditional focus for Information Systems (IS) quality assurance relies on the evaluation of its implementation. However, the quality of an IS can be largely determined in the first stages of its development. Several studies reveal that more than half the errors that occur during systems development are requirements errors. A requirements error is defined as a mismatch between requirements specification and stakeholders¿ needs and expectations. Conceptual modeling is an essential activity in requirements engineering aimed at developing the conceptual schema of an IS. The conceptual schema is the general knowledge that an IS needs to know in order to perform its functions. A conceptual schema specification has semantic quality when it is valid and complete. Validity means that the schema is correct (the knowledge it defines is true for the domain) and relevant (the knowledge it defines is necessary for the system). Completeness means that the conceptual schema includes all relevant knowledge. The validation of a conceptual schema pursues the detection of requirements errors in order to improve its semantic quality. Conceptual schema validation is still a critical challenge in requirements engineering. In this work we contribute to this challenge, taking into account that, since conceptual schemas of IS can be specified in executable artifacts, they can be tested. In this context, the main contributions of this Thesis are (1) an approach to test conceptual schemas of information systems, and (2) a novel method for the incremental development of conceptual schemas supported by continuous test-driven validation. As far as we know, this is the first work that proposes and implements an environment for automated testing of UML/OCL conceptual schemas, and the first work that explores the use of test-driven approaches in conceptual modeling. The testing of conceptual schemas may be an important and practical means for their validation. It allows checking correctness and completeness according to stakeholders¿ needs and expectations. Moreover, in conjunction with the automatic check of basic test adequacy criteria, we can also analyze the relevance of the elements defined in the schema. The testing environment we propose requires a specialized language for writing tests of conceptual schemas. We defined the Conceptual Schema Testing Language (CSTL), which may be used to specify automated tests of executable schemas specified in UML/OCL. We also describe a prototype implementation of a test processor that makes feasible the approach in practice. The conceptual schema testing approach supports test-last validation of conceptual schemas, but it also makes sense to test incomplete conceptual schemas while they are developed. This fact lays the groundwork of Test-Driven Conceptual Modeling (TDCM), which is our second main contribution. TDCM is a novel conceptual modeling method based on the main principles of Test-Driven Development (TDD), an extreme programming method in which a software system is developed in short iterations driven by tests. We have applied the method in several case studies, in the context of Design Research, which is the general research framework we adopted. Finally, we also describe an integration approach of TDCM into a broad set of software development methodologies, including the Unified Process development methodology, MDD-based approaches, storytest-driven agile methods and goal and scenario-oriented requirements engineering methods.

    Els enfocaments per assegurar la qualitat deis sistemes d'informació s'han basal tradicional m en! en l'avaluació de la seva implementació. No obstan! aix6, la qualitat d'un sis tema d'informació pot ser ampliament determinada en les primeres fases del seu desenvolupament. Diversos estudis indiquen que més de la meitat deis errors de software són errors de requisits . Un error de requisit es defineix com una desalineació entre l'especificació deis requisits i les necessitats i expectatives de les parts im plicades (stakeholders ). La modelització conceptual és una activitat essencial en l'enginyeria de requisits , l'objectiu de la qual és desenvolupar !'esquema conceptual d'un sistema d'informació. L'esquema conceptual és el coneixement general que un sistema d'informació requereix per tal de desenvolupar les seves funcions . Un esquema conceptual té qualitat semantica quan és va lid i complet. La valides a implica que !'esquema sigui correcte (el coneixement definit és cert peral domini) i rellevant (el coneixement definit és necessari peral sistema). La completes a significa que !'esquema conceptual inclou tot el coneixement rellevant. La validació de !'esquema conceptual té coma objectiu la detecció d'errors de requisits per tal de millorar la qualitat semantica. La validació d'esquemes conceptuals és un repte crític en l'enginyeria de requisits . Aquesta te si contribueix a aquest repte i es basa en el fet que els es quemes conceptuals de sistemes d'informació poden ser especificats en artefactes executables i, per tant, poden ser provats. Les principals contribucions de la te si són (1) un enfocament pera les pro ves d'esquemes conceptuals de sistemes d'informació, i (2) una metodología innovadora pel desenvolupament incremental d'esquemes conceptuals assistit per una validació continuada basada en proves . Les pro ves d'esquemes conceptuals poden ser una im portant i practica técnica pera la se va validació, jaque permeten provar la correctesa i la completesa d'acord ambles necessitats i expectatives de les parts interessades. En conjunció amb la comprovació d'un conjunt basic de criteris d'adequació de les proves, també podem analitzar la rellevancia deis elements definits a !'esquema. L'entorn de test proposat inclou un llenguatge especialitzat per escriure proves automatitzades d'esquemes conceptuals, anomenat Conceptual Schema Testing Language (CSTL). També hem descrit i implementa! a un prototip de processador de tes tos que fa possible l'aplicació de l'enfocament proposat a la practica. D'acord amb l'estat de l'art en validació d'esquemes conceptuals , aquest és el primer treball que proposa i implementa un entorn pel testing automatitzat d'esquemes conceptuals definits en UML!OCL. L'enfocament de proves d'esquemes conceptuals permet dura terme la validació d'esquemes existents , pero també té sentit provar es quemes conceptuals incomplets m entre estant sent desenvolupats. Aquest fet és la base de la metodología Test-Driven Conceptual Modeling (TDCM), que és la segona contribució principal. El TDCM és una metodología de modelització conceptual basada en principis basics del Test-Driven Development (TDD), un métode de programació en el qual un sistema software és desenvolupat en petites iteracions guiades per proves. També hem aplicat el métode en diversos casos d'estudi en el context de la metodología de recerca Design Science Research. Finalment, hem proposat enfocaments d'integració del TDCM en diverses metodologies de desenvolupament de software.

  • Factoid Question Answering for Spoken Documents  Open access

     Comas Umbert, Pere Ramon
    Defense's date: 2012-06-12
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic. The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

    En aquesta Tesi, presentem un sistema de Question Answering (QA) factual, especialment ajustat per treballar amb documents orals. En el desenvolupament explorem, per primera vegada, quines tècniques de les habitualment emprades en QA per documents escrit són suficientment robustes per funcionar en l'escenari més difícil de documents orals. Amb més especificitat, estudiem nous mètodes de Information Retrieval (IR) dissenyats per tractar amb la veu, i utilitzem diversos nivells d'informació linqüística. Entre aquests s'inclouen, a saber: detecció de Named Entities utilitzant informació fonètica, "parsing" sintàctic aplicat a transcripcions de veu, i també l'ús d'un sub-sistema de detecció i resolució de la correferència. La nostra aproximació al problema es recolza en gran part en tècniques supervisades de Machine Learning, estant aquestes enfocades especialment cap a la part d'extracció de la resposta, i fa servir la menor quantitat possible de coneixement creat per humans. En conseqüència, tot el procés de QA pot ser adaptat a altres dominis o altres llengües amb relativa facilitat. Un dels resultats addicionals de la feina darrere d'aquesta Tesis ha estat que hem impulsat i coordinat la creació d'un marc d'avaluació de la taska de QA en documents orals. Aquest marc de treball, anomenat QAst (Question Answering on Speech Transcripts), proporciona un corpus de documents orals multi-lingüe, uns conjunts de preguntes d'avaluació, i les respostes correctes d'aquestes. Aquestes dades han estat utilitzades en les evaluacionis QAst que han tingut lloc en el si de les conferències CLEF en els anys 2007, 2008 i 2009; d'aquesta manera s'ha promogut i ajudat a la creació d'un estat-de-l'art de tècniques adreçades a aquest problema en particular. El sistema de QA que presentem i tots els seus particulars sumbòduls, han estat avaluats extensivament utilitzant el corpus EPPS (transcripcions de les Sessions Plenaries del Parlament Europeu) en anglès, que cónté transcripcions manuals de tots els discursos i també transcripcions automàtiques obtingudes mitjançant tres reconeixedors automàtics de la parla (ASR) diferents. Els reconeixedors tenen característiques i resultats diferents que permetes una avaluació quantitativa i qualitativa de la tasca. Aquestes dades pertanyen a l'avaluació QAst del 2009. Els resultats principals de la nostra feina confirmen que la informació sintàctica és mol útil per aprendre automàticament a valorar la plausibilitat de les respostes candidates, millorant els resultats previs tan en transcripcions manuals com transcripcions automàtiques, descomptat que la qualitat de l'ASR sigui molt baixa. En general, el rendiment del nostre sistema és comparable o millor que els altres sistemes pertanyents a l'estat-del'art, confirmant així la validesa de la nostra aproximació.

  • A constraint-based hypergraph partitioning approach to coreference resolution  Open access

     Sapena Masip, Emili
    Defense's date: 2012-05-16
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The objectives of this thesis are focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that mention or refer to the same entity. The main contributions of this thesis are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use entity-mention classi cation model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classi cations without context and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and a research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results in the state of the art, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second position in CoNLL-2011.

    La resolució de correferències és una tasca de processament del llenguatge natural que consisteix en determinar les expressions d'un discurs que es refereixen a la mateixa entitat del mon real. La tasca té un efecte directe en la minería de textos així com en moltes tasques de llenguatge natural que requereixin interpretació del discurs com resumidors, responedors de preguntes o traducció automàtica. Resoldre les correferències és essencial si es vol poder “entendre” un text o un discurs. Els objectius d'aquesta tesi es centren en la recerca en resolució de correferències amb aprenentatge automàtic. Concretament, els objectius de la recerca es centren en els següents camps: + Models de classificació: Els models de classificació més comuns a l'estat de l'art estan basats en la classificació independent de parelles de mencions. Més recentment han aparegut models que classifiquen grups de mencions. Un dels objectius de la tesi és incorporar el model entity-mention a l'aproximació desenvolupada. + Representació del problema: Encara no hi ha una representació definitiva del problema. En aquesta tesi es presenta una representació en hypergraf. + Algorismes de resolució. Depenent de la representació del problema i del model de classificació, els algorismes de ressolució poden ser molt diversos. Un dels objectius d'aquesta tesi és trobar un algorisme de resolució capaç d'utilitzar els models de classificació en la representació d'hypergraf. + Representació del coneixement: Per poder administrar coneixement de diverses fonts, cal una representació simbòlica i expressiva d'aquest coneixement. En aquesta tesi es proposa l'ús de restriccions. + Incorporació de coneixement del mon: Algunes correferències no es poden resoldre només amb informació lingüística. Sovint cal sentit comú i coneixement del mon per poder resoldre coreferències. En aquesta tesi es proposa un mètode per extreure coneixement del mon de Wikipedia i incorporar-lo al sistem de resolució. Les contribucions principals d'aquesta tesi son (i) una nova aproximació al problema de resolució de correferències basada en satisfacció de restriccions, fent servir un hypergraf per representar el problema, i resolent-ho amb l'algorisme relaxation labeling; i (ii) una recerca per millorar els resultats afegint informació del mon extreta de la Wikipedia. L'aproximació presentada pot fer servir els models mention-pair i entity-mention de forma combinada evitant així els problemes que es troben moltes altres aproximacions de l'estat de l'art com per exemple: contradiccions de classificacions independents, falta de context i falta d'informació. A més a més, l'aproximació presentada permet incorporar informació afegint restriccions i s'ha fet recerca per aconseguir afegir informació del mon que millori els resultats. RelaxCor, el sistema que ha estat implementat durant la tesi per experimentar amb l'aproximació proposada, ha aconseguit uns resultats comparables als millors que hi ha a l'estat de l'art. S'ha participat a les competicions internacionals SemEval-2010 i CoNLL-2011. RelaxCor va obtenir la segona posició al CoNLL-2010.

  • Unsupervised Learning of Relation Detection Patterns  Open access

     González Pellicer, Edgar
    Defense's date: 2012-06-01
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.

    Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.

  • Access to the full text
    Georeferencing textual annotations and tagsets with geographical knowledge and language models  Open access

     Ferrés Domènech, Daniel; Rodriguez Hontoria, Horacio
    Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural
    Presentation's date: 2011
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Presentamos en este artículo cuatro aproximaciones al georeferenciado genérico de anotaciones textuales multilingües y etiquetas sem ánticas. Las cuatro aproximaciones se basan en el uso de 1) Conocimiento geogr áfi co, 2) Modelos del lenguaje (LM), 3) Modelos del lenguaje con predicciones re-ranking y 4) Fusi ón de las predicciones basadas en conocimiento geográfi co con otras aproximaciones. Los recursos empleados incluyen el gazetteer geogr áfi co Geonames, los modelos de recuperación de informaci ón TFIDF y BM25, el Hiemstra Language Modelling (HLM), listas de stop words para varias lenguas y un diccionario electróonico de la lengua inglesa. Los mejores resultados en precisión del georeferenciado se han obtenido con la aproximación de re-ranking que usa el HLM y con su fusióon con conocimiento geográfi co. Estas estrategias mejoran los mejores resultados de los mejores sistemas participantes en la tarea o cial de georeferenciado en MediaEval 2010. Nuestro mejor resultado obtiene una precisión de 68.53% en la tarea de geoeferenciado hasta 100 Km. This paper describes generic approaches for georeferencing multilingual textual annotations and sets of tags from metadata associated to textual or multimedia content with high precision. We present four approaches based on: 1) Geographical Knowledge, 2) Language Modelling (LM), 3) Language Modelling with Re-Ranking predictions, 4) Fusion of Geographical Knowledge predictions with the other approaches. The resources employed were the Geonames geographical gazetteer, the TFIDF and BM25 Information Retrieval algorithms, the Hiemstra Language Modelling (HLM) algorithm, stopwords lists from several languages, and an electronic English dictionary. The best results in georeferencing accuracy are achieved with the HLM Re-Ranking approach and its fusion with Geographical Knowledge. These strategies outperformed the best results in accuracy reported by the state-of-the art systems that participated at MediaEval 2010 official Placing task. Our best results achieved are 68.53% of accuracy georeferencing up to a distance of 100 Km.

    Postprint (author’s final draft)

  • Access to the full text
    Cultural configuration of Wikipedia: measuring autoreferentiality in different languages  Open access

     Miquel Ribé, Marc; Rodriguez Hontoria, Horacio
    Recent Advances in Natural Language Processing
    Presentation's date: 2011
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Among the motivations to write in Wikipedia given by the current literature there is often coincidence, but none of the studies presents the hypothesis of contributing for the visibility of the own national or language related content. Similar to topical coverage studies, we outline a method which allows collecting the articles of this content, to later analyse them in several dimensions. To prove its universality, the tests are repeated for up to twenty language editions of Wikipedia. Finally, through the best indicators from each dimension we obtain an index which represents the degree of autoreferentiality of the encyclopedia. Last, we point out the impact of this fact and the risk of not considering its existence in the design of applications based on user generated content.

  • Access to the full text
    TALP at MediaEval 2011 Placing Task: georeferencing Flickr videos with geographical knowledge and information retrieval  Open access

     Ferrés Domènech, Daniel; Rodriguez Hontoria, Horacio
    MediaEval Workshop
    Presentation's date: 2011
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes our Georeferencing approaches, experiments, and results at the MediaEval 2011 Placing Task evaluation. The task consists of predicting the most probable geographical coordinates of Flickr videos. Our approaches used only Flickr users textual annotations and tagsets to predict. We used three approaches for this task: 1) a Geographical Knowledge approach, 2) an Information Retrieval based approach with Re-Ranking, and 3) a combination of both (GeoFusion). The GeoFusion approach achieved the best results within the margin of errors from 10km to 10000km.

  • Paraphrase concept and typology: a linguistically based and computationally oriented approach

     Vila, Marta; Martí, Maria Antònia; Rodriguez Hontoria, Horacio
    Procesamiento del lenguaje natural
    Date of publication: 2011
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Multilingual Acquisition of Large Scale Knowledge Resources

     Cuadros Oller, Montserrat
    Defense's date: 2011-11-22
    Department of Software, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Araknion: inducción de modelos lingüísticos a partir de corpora

     Martí, Maria Antònia; Taulé, Mariona; Rodriguez Hontoria, Horacio; Martínez Barco, Patricio Manuel; Carreras Perez, Xavier
    Procesamiento del lenguaje natural
    Date of publication: 2011
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Extracting terminology from Wikipedia

     Vivaldi, Jorge; Rodriguez Hontoria, Horacio
    Procesamiento del lenguaje natural
    Date of publication: 2011
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    ADN-classifier: automatically assigning denotation types to nominalizations  Open access

     Peris, Aina; Taulé, Mariona; Boleda Torrent, Gemma; Rodriguez Hontoria, Horacio
    International Conference on Language Resources and Evaluation
    Presentation's date: 2010-05
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents the ADN-Classifier, an Automatic classification system of Spanish Deverbal Nominalizations aimed at identifying its semantic denotation (i.e. event, result, underspecified, or lexicalized). The classifier can be used for NLP tasks such as coreference resolution or paraphrase detection. To our knowledge, the ADN-Classifier is the first effort in acquisition of denotations for nominalizations using Machine Learning.We compare the results of the classifier when using a decreasing number of Knowledge Sources, namely (1) the complete nominal lexicon (AnCora-Nom) that includes sense distictions, (2) the nominal lexicon (AnCora-Nom) removing the sense-specific information, (3) nominalizations’ context information obtained from a treebank corpus (AnCora-Es) and (4) the combination of the previous linguistic resources. In a realistic scenario, that is, without sense distinction, the best results achieved are those taking into account the information declared in the lexicon (89.40% accuracy). This shows that the lexicon contains crucial information (such as argument structure) that corpus-derived features cannot substitute for.

  • Access to the full text
    TALP at WePS-3 2010  Open access

     Ferrés Domènech, Daniel; Rodriguez Hontoria, Horacio
    Conference on Multilingual and Multimodal Information Access Evaluation
    Presentation's date: 2010
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we present our system and experiments at the Third Web People Search Workshop (WePS-3) task for clustering web people search documents in English. In our experiments we used a simple approach with three algorithms: Lingo, Hierachical Agglomerative Clustering (HAC), and a 2-step HAC algorithm. We also present the results and initial conclusions in the context of the WePS-3 Task 1 for clustering. We obtained best results with HAC and 2-step HAC algorithms.

    Postprint (author’s final draft)

  • Access to the full text
    Semantic annotation of deverbal nominalizations in the Spanish corpus AnCora  Open access

     Peris, Aina; Taulé, Mariona; Rodriguez Hontoria, Horacio
    International Workshop on Treebanks and Linguistic Theories
    Presentation's date: 2010
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents the methodology and the linguistic criteria followed to enrich the AnCora-Es corpus with the semantic annotation of deverbal nominalizations. The first step was to run two independent automated processes: one for the annotation of denotation types and another one for the annotation of argument structure. Secondly, we manually checked both types of information and measured inter-annotator agreement. The result is the Spanish AnCora-Es corpus enriched with the semantic annotation of deverbal nominalizations. As far as we know, this is the first Spanish corpus annotated with this type of information.

    Postprint (author’s final draft)

  • Finding domain terms using Wikipedia

     Vivaldi, Jorge; Rodriguez Hontoria, Horacio
    International Conference on Language Resources and Evaluation
    Presentation's date: 2010
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    TALP at MediaEval 2010 placing task: geographical focus detection of Flickr textual annotations  Open access

     Ferrés, Dani; Rodriguez Hontoria, Horacio
    MediaEval Workshop
    Presentation's date: 2010
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes our geographical text analysis and geotagging experiments in the context of the Multimedia Placing Task at MediaEval 2010 evaluation. The task consists of predicting the most probable coordinates of Flickr videos. We used a Natural Language Processing approach trying to match geographical place names in the Flickr users textual annotations. The resources employed to deal with this task were the Geonames geographical gazetteer, stopwords lists from several languages, and an electronic English dictionary. We used two geographical focus disambiguation strategies, one based on population heuristics and another that combines geographical knowledge and population heuristics. The second strategy does achieve the best results. Using stopwords lists and the English dictionary as a lter for ambiguous place names also improves the results.

  • Inference of lexical ontologies. The LeOnI methodology

     Farreres De La Morena, Javier; Gibert Oliveras, Karina; Rodriguez Hontoria, Horacio; Pluempitiwiriyawej, Charnyote
    Artificial intelligence
    Date of publication: 2010-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this article we present a method for semi-automatically deriving lexico-conceptual ontologies in other languages, given a lexico-conceptual ontology for one language and bilingual mapping resources. Our method uses a logistic regression model to combine mappings proposed by a set of classifiers (up to 17 in our implementation). The method is formally described and evaluated by means of two implementations for semi-automatically building Spanish and Thai WordNets using Princeton's WordNet for English and conventional English¿Spanish and English¿Thai bilingual dictionaries.

  • Automatically extending named entities coverage of Arabic WordNet using Wikipedia

     Alkhalifa, Musa; Rodriguez Hontoria, Horacio
    International journal on information & communication technologies
    Date of publication: 2010
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Using Wikipedia for term extraction in the biomedical domain: first experiences

     Vivaldi, Jorge; Rodriguez Hontoria, Horacio
    Procesamiento del lenguaje natural
    Date of publication: 2010
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DIGUI: A Flexible Dialogue System for Guiding the User Interaction to Access Web Services

     Gonzalez Bermudez, Meritxell
    Defense's date: 2010-10-22
    Facultat d'Informàtica de Barcelona (UPC)
    Theses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Multilingual On-Line Translation

     Rodriguez Hontoria, Horacio; Gonzalez Bermudez, Meritxell; España Bonet, Cristina; Farwell, David Loring; Carreras Perez, Xavier; Xambó Descamps, Sebastian; Màrquez Villodre, Lluís; Padró Cirera, Lluís; Saludes Closa, Jordi
    Participation in a competitive project

     Share

  • WRPA: a system for relational paraphrase acquisition from Wikipedia

     Vila, Marta; Rodriguez Hontoria, Horacio; Martí, Maria Antònia
    Procesamiento del lenguaje natural
    Date of publication: 2010
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Anotación morfo-sintáctica y semántica de corpus: adquisición y uso

     Rodriguez Hontoria, Horacio
    Date of publication: 2009
    Book chapter

     Share Reference managers Reference managers Open in new window

  • Automatically extending NE coverage of Arabic WordNet using Wikipedia

     Alkhalifa, Musa; Rodriguez Hontoria, Horacio
    International Conference on Arabic Language Processing
    Presentation's date: 2009
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    TALP at GikiCLEF 2009  Open access

     Ferrés Domènech, Daniel; Rodriguez Hontoria, Horacio
    Conference on Multilingual and Multimodal Information Access Evaluation
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes our experiments in Geographical Information Retrieval with the Wikipedia collection in the context of our participation in the GikiCLEF 2009 Multilingual task in English and Spanish. Our system, called gikiTALP, follows a very simple approach that uses standard Information Retrieval with the Sphinx full-text search engine and some Natural Language Processing techniques without Geographical Knowdledge.

    Postprint (author’s final draft)

  • Access to the full text
    CoCo, a web interface for corpora compilation  Open access

     España Bonet, Cristina; Vila Rigat, Marta; Rodriguez Hontoria, Horacio; Martí, Maria Antònia
    Procesamiento del lenguaje natural
    Date of publication: 2009
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    CoCo es una interfaz web colaborativa para la compilación de recursos lingüísticos. En esta demo se presenta una de sus posibles aplicaciones: la obtención de paráfrasis. / CoCo is a collaborative web interface for the compilation of linguistic resources. In this demo we are presenting one of its possible applications: paraphrase acquisition.

  • hacia un sistema de clasificación automática de sustantivos deverbales

     Peris, Aina; Taulé Delor, Mariona; Rodriguez Hontoria, Horacio
    Procesamiento del lenguaje natural
    Date of publication: 2009
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • GPLN: GRUP DE PROCESSAMENT DEL LLENGUATGE NATURAL

     Rodriguez Hontoria, Horacio
    Participation in a competitive project

     Share

  • Access to the full text
    GeoTextMESS: result fusion with fuzzy Borda ranking in geographical information retrieval  Open access

     Buscaldi, Davide; Perea Ortega, Jose Manuel; Rosso, Paolo; Ureña López, L. Alfonso; Ferrés Domènech, Daniel; Rodriguez Hontoria, Horacio
    Lecture notes in computer science
    Date of publication: 2009
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we discuss the integration of different GIR systems by means of a fuzzy Borda method for result fusion. Two of the systems, the one by the Universidad Politécnica de Valencia and the one of the Universidad of Jaén participated to the GeoCLEF task under the name TextMess. The proposed result fusion method takes as input the document lists returned by the different systems and returns a document list where the documents are ranked according to the fuzzy Borda voting scheme. The obtained results show that the fusion method allows to improve the results of the component systems, although the fusion is not optimal, because it is effective only if the components return a similar set of relevant documents.

    Postprint (author’s final draft)

  • A Flexible Multitask Summarizer for Documents from Different Media, Domain and Language  Open access

     Fuentes Fort, Maria
    Defense's date: 2008-03-31
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Automatic Summarization is probably crucial with the increase of document generation. Particularly when retrieving, managing and processing information have become decisive tasks. However, one should not expect perfect systems able to substitute human sumaries. The automatic sumarization process strongly depends not only on the characteristics of the documents, but also on user different needs.Thus, several aspects have to be taken into account when designing an information system for summarizing, because, depending on the characteristics of the input documents and the desired results, several techniques can be aplied. In order to suport this process, the final goal of the thesis is to provide a flexible multitask summarizer architecture. This goal is decomposed in three main research purposes. First, to study the process of porting systems to different summarization tasks, processing documents in different lenguages, domains or media with the aim of designing a generic architecture to permit the easy addition of new tasks by reusing existents tools. Second, the developes prototypes for some tasks involving aspects related with the lenguage, the media and the domain of the document or documents to be summarized as well as aspects related with the summary content: generic, novelly summaries, or summaries that give answer to a specific user need. Third, to create an evaluation framework to analyze the performance of several approaches in written news and scientific oral presentation domains, focusing mainly in its intrinsic evaluation.

    El resumen automático probablemente sea crucial en un momento en que la gran cantidad de documentos generados diariamente hace que recuperar, tratar y asimilar la información que contienen se haya convertido en una ardua y a su vez decisiva tarea. A pesar de ello, no podemos esperar que los resúmenes producidos de forma automática vayan a ser capaces de sustituir a los humanos. El proceso de resumen automático no sólo depende de las características propias de los documentos a ser resumidos, sino que es fuertemente dependiente de las necesidades específicas de los usuarios. Por ello, el diseño de un sistema de información para resumen conlleva tener en cuenta varios aspectos. En función de las características de los documentos de entrada y de los resultados deseados es posible aplicar distintas técnicas. Por esta razón surge la necesidad de diseñar una arquitectura flexible que permita la implementación de múltiples tareas de resumen. Este es el objetivo final de la tesis que presento dividido en tres subtemas de investigación. En primer lugar, estudiar el proceso de adaptabilidad de sistemas a diferentes tareas de resumen, como son procesar documentos producidos en diferentes lenguas, dominios y medios (sonido y texto), con la voluntad de diseñar una arquitectura genérica que permita la fácil incorporación de nuevas tareas a través de reutilizar herramientas existentes. En segundo lugar, desarrollar prototipos para distintas tareas, teniendo en cuenta aspectos relacionados con la lengua, el dominio y el medio del documento o conjunto de documentos que requieren ser resumidos, así como aspectos relacionados con el contenido final del resumen: genérico, novedad o resumen que de respuesta a una necesidad especifica. En tercer lugar, crear un marco de evaluación que permita analizar la competencia intrínseca de distintos prototipos al resumir noticias escritas y presentaciones científicas orales.

  • TALP at TAC 2008: a semantic approach to recognizing textual entailment

     Ageno Pulido, Alicia; Cruz, Fermín; Farwell, David Loring; Ferrés, Daniel; Rodriguez Hontoria, Horacio; Turmo Borras, Jorge
    Text Analysis Conference
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Arabic WordNet: current state and future extensions

     Rodriguez Hontoria, Horacio; Farwell, David Loring; Farreres De La Morena, Javier; Bertran, Manuel; Alkhalifa, Musa; Martí, Maria Antònia; Elkateb, Sabri; Black, William; Kirk, James; Pease, Adam; Vossen, Piek; Felbaum, Christianne
    International WordNet Conference
    Presentation's date: 2008
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Arabic WordNet: semi-automatic extensions using Bayesian inference

     Rodriguez Hontoria, Horacio; Farwell, David Loring; Farreres De La Morena, Javier; Bertran, Manuel; Alkhalifa, Musa; Martí, Maria Antònia
    International Conference on Language Resources and Evaluation
    Presentation's date: 2008
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • TALP at GeoQuery 2007: linguistic and geographical analysis for query parsing

     Ferrés, Dani; Rodriguez Hontoria, Horacio
    Lecture notes in computer science
    Date of publication: 2008
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • TALP at GeoCLEF 2007: results of a geographical knowledge filtering approach with Terrier

     Ferrés, Dani; Rodriguez Hontoria, Horacio
    Lecture notes in computer science
    Date of publication: 2008
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Kornai, András: Mathematical Linguistics (book review)

     Rodriguez Hontoria, Horacio
    Machine translation
    Date of publication: 2008
    Journal article

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    FEMsum at DUC 2007  Open access

     Fuentes Fort, Maria; Rodriguez Hontoria, Horacio; Ferrés Domènech, Daniel
    Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics
    Presentation's date: 2007-06-26
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper describes and analyzes how the FEMsum system deals with DUC 2007 tasks of providing summary-length answers to complex questions, both background and just-the-news summaries. We participated in producing background summaries for the main task with the FEMsum approach that obtained better results in our last year participation. The FEMsum semantic based approach was adapted to deal with the update pilot task with the aim of producing just-the-news summaries.

    Postprint (author’s final draft)

  • Access to the full text
    Support vector machines for query-focused summarization trained and evaluated on pyramid data  Open access

     Fuentes Fort, Maria; Alfonseca, Enrique; Rodriguez Hontoria, Horacio
    Annual Meeting of the Association for Computational Linguistics
    Presentation's date: 2007-06-25
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents the use of Support Vector Machines (SVM) to detect relevant information to be included in a queryfocused summary. Several SVMs are trained using information from pyramids of summary content units. Their performance is compared with the best performing systems in DUC-2005, using both ROUGE and autoPan, an automatic scoring method for pyramid evaluation.

  • Machine learning with semantic-based distances between sentences for textual entailment

     Ferrés, Dani; Rodriguez Hontoria, Horacio
    ACL-PASCAL Workshop on Textual Entailment and Paraphrasing
    Presentation's date: 2007
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • The UPC System for Arabic-to-English Entity Translation

     Farwell, David Loring; Gimenez Linares, Jesús Ángel; González Pellicer, Edgar; Halkoum, Reda; Rodriguez Hontoria, Horacio; Surdeanu, Mihai
    Automatic Content Extraction (ACE) Entity Translation (ET) 2007 Pilot Evaluation
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • TALP at GeoCLEF 2006: experiments using JIRS and Lucene with the ADL feature type thesaurus

     Ferrés, Dani; Rodriguez Hontoria, Horacio
    Lecture notes in computer science
    Date of publication: 2007
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window