AUTHECO: AUtomatic THEsaurus COstruction

Start: October 2008 End: 4x 12 months Funding: Wallonia-Brussels International Excellent Grant

Semantic networks and thesauri are powerful tools for representing knowledge about certain problem domain such as agriculture, finance, or medcine. A well constructed thesaurus is recognized as a valuable source of semantic information for various applications, especially for Information Retrieval.

An information retrieval thesaurus describes a certain knowledge domain by listing all its main concepts and semantic relations between them. In their simplest form thesauri consist of a list of important terms and semantic relations between them. Thesauri have been used in documentation management projects for years. They were even used by libraries and documentation centers long before the computer era. This long tradition and the more recent success of the thesaurus based information systems has led to adoption of thesaurus-based techniques by the industry and to the development of international standards such as ANSI Z39.19-2005.

The main purposes to use a thesaurus are (1) to provide a standard vocabulary for indexing and searching, (2) to assist users with locating terms for proper query formulation, and (3) to provide classified hierarchies that allow the broadening and narrowing of the current request according to the needs of the user.

EuroVOC is one example of a big contemporary information retrieval thesaurus: it is used for indexing documents of the European Parliament, the Office for Official Publications of the European Communities, and many other European institutions. Another well-known thesaurus is AgroVOC - a multilingual, structured and controlled vocabulary designed to cover the terminology of all subject fields in agriculture, forestry, fisheries, food and related domains. This resource was created by the Food and Agriculture Organization of the United Nations (FAO) and has many applications all over the world.

The main hindrances to using thesaurus-oriented approaches are the high complexity and cost of manual thesauri creation. The traditional way of thesaurus construction involves great amount of manual labor and proved to be very time consuming and costly. Furthermore, it does not allow for an easy way to keep semantic resources updated. All these factors limit applications of thesaurus-oriented approaches. One of the solutions to this problem is to develop an information technology which would automatize thesaurus construction. Basically this is the main objective of this research project.

Abstract

This research project aims to develop an information extraction technology for automatic construction of semantic networks and thesauri from corpus of domain-specific texts.

The project investigates two problems of (semi-)automatic thesaurus generation from text corpora. The first problem is selecting salient domain-specific terms from corpus. This subtask is also known as term extraction. For instance, if we would like to construct a thesaurus of medical domain such as MESH we are to include to it terms which are relevant to the medical domain such as "poliomyelitis", "hepatitis A", or "swine vesicular disease", but not the terms "gear box" or "building". The second problem is extracting meaningful semantic relations between the terms of a domain such as synonymy, hyponymy, association. This subtask is known as relationship extraction.

The project involves using and developing different techniques of Natural Language Processing, Computer Science, and Data Mining. The proposed information technology will be implemented in a prototype system for automatization of thesaurus construction. In order to bring more valuable outcome from the project to the Wallonia region the experiments will be conducted on the datasets (text corpora) corresponding to the priority domains of the Marshall plan: agro-industry, transport and logistics, life sciences, mechanical engineering, aeronautics-aerospace.

Demo

You can try a small demo of the technology developed within this project. This system is a kind of "lexico-semantic search engine". Given a text query it provides a list of related words. A traditional search engine provides as a results a list of related documents. The current version is based on two semantic similarity measures -- Serelex and PatternSim. The first relies on definitions of words, while the second relies on a text corpus.

Publications

Panchenko A., Morozova O., Naets H. “A Semantic Similarity Measure Based on Lexico-Syntactic Patterns.” In Conference on Natural Language Processing (KONVENS 2012), — Vienna (Austria), pp.174--178, 2012
Panchenko A, Beaufort R., Fairon C. Detection of Child Sexual Abuse Media on P2P Networks: Normalization and Classification of Associated Filenames". In Proceedings of Workshop on Language Resources for Public Security Applications of the 8th International Conference on Language Resources and Evaluation (LREC), 2012
Panchenko A., Adeykin S., Romanov P., Romanov A., “Extraction of Semantic Relations between Concepts with KNN Algorithms on Wikipedia”. In Proceedings of Concept Discovery in Unstructured Data Workshop (CDUD) of International Conference On Formal Concept Analysis, pp.78-88, Belgium, 2012
Panchenko A. Morozova O. “A Study of Hybrid Similarity Measures for Semantic Relation Extraction”. In Proceedings of Workshop of Innovative Hybrid Approaches to the Processing of Textual Data Workshop of European Chapter of the Association for Computational Linguistics (EACL), pp.10-18, France, 2012
Panchenko A. “A Study of Heterogeneous Similarity Measures for Semantic Relation Extraction”. In Proceedings of 14e Rencontres des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langue (JEP-TALN-RECITAL), 2012
Panchenko A., Adeykin S., Romanov P., Romanov A., “Extraction of Semantic Relations from Wikipedia Articles with KNN Algorithms”. In AIST Conference (In Russian), 2012
Panchenko A. “Towards an Efficient Combination of Similarity Measures for Semantic Relation Extraction”. Abstract in Proceedings of The 22nd Meeting of Computational Linguistics in the Netherlands (CLIN 22), 2012
Panchenko A. “Comparison of the Knowledge-, Corpus-, and Web- based Similarity Measures for Semantic Relations Extraction”. In Proceedings of the Workshop GEometrical Models of Natural Language Semantics (GEMS) of EMNLP,pp.10-18, 2011
Panchenko A. “Can We Automatically Reproduce Semantic Relations of an Information Retrieval Thesaurus?”, In Proceedings of the 4th Russian Summer School in Information Retrieval YSC RuSSIR, pp.36–51, 2010. reprint: Panchenko.A. “Method for Automatic Construction of Semantic Relations of an Information Retrieval Thesaurus.”. In Herald of the Voronezh State University, vol.2, pp.131–139, 2011
Panchenko A. “Computing Semantic Relations from Heterogeneous Evidence”. Abstract in CLIN 21, University College of Ghent, pp.39, 2011

Developments

Lexico-Semantic Search Engine. A web application which let you find semantically related words. Written in JavaScript (node.js + mongodb). Source code. A colloborative development with Pavel Romanov.
Serelex. A tool for semantic relation extraction based on definitions from Wikipedia, Wiktionary, and the like. Written in C++. A collaborative development.
PatternSim. A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns. A colloborative development.

Researcher

Alexander Panchenko

Advisor

Prof. Cédrick Fairon

Menu