Marie-Claude L’Homme is full professor at the Department of Linguistics and Translation at the University of Montreal where she teaches terminology and computer-assisted translation. She is also the Director of the research group “Observatoire de linguistique Sens-Texte, OLST”. Her main research interests are lexical semantics and corpus linguistics applied to terminology. She is editor (in collaboration with Kyo Kageura, University of Tokyo) of the journal Terminology.
Designing specialized dictionaries with natural language processing techniques:A state-of-the-art
During the last decades, terminology work has changed drastically due mostly to the introduction of computer applications and the availability of corpora in electronic form. Although the main steps of the methodology have remained basically the same (compiling corpora, finding relevant terms in these corpora, locating data that can help process terms, inserting the information collected during the previous steps in a record, updating of records, etc.), the way in which the data is handled is completely different.
In this talk, I will present a methodology for compiling an online specialized dictionary that incorporates natural language processing applications. The dictionary considered is representative of a new generation of specialized dictionaries which aim to give users access to rich linguistic information based mostly on information collected from specialized corpora. These reference works differ from most specialized dictionaries which aim at providing users with explanation on concepts similar to that given in encyclopaedias. The dictionary I will present includes terms related to computing and the Internet and provides for each of them: fine-grained semantic distinctions, argument structure, combinatorial possibilities of terms with other terms of the domain, lists of lexical relationships (e.g., synonyms, antonyms, hyperonyms, collocates), etc. The dictionary also provides syntactic and semantic annotations of contexts in which terms appear.
First, the six basic steps of the methodology will be described: 1. compilation of corpora; 2. identification of relevant terminological units; 3. collection of data from corpora; 4. analysis of the data collected; 5. compilation of term records; 6. establishment of relationships between terms records. I will proceed to show how some resources and tools that can assist terminologists during some of these steps and present some of the challenges that their introduction in terminology work has raised. I will focus on: a. management of corpora in electronic form for terminology purposes; b. annotation of corpora (part-of-speech tagging and lemmatization); c. term extraction; d. automatic or semi-automatic identification of information on terms in corpora, especially for finding semantic relationships (e.g. hyperonymic relationships, collocations, or predicate-argument relationships); e. formalisms for encoding terminological data. The point of view taken when presenting computer applications will be that of users rather than that of developers.
Then, I will proceed to illustrate how other resources and computer applications can assist terminologists carrying bilingual terminology work. These applications include (in addition to those reviewed for monolingual specialized dictionary compilation): a. bilingual corpora; b. bilingual term extraction; c) comparing term extraction results between languages. Specific challenges posed by these techniques will be discussed.
Interestingly, computer applications and the use of electronic corpora have changed the way terminologists consider specialized data and have led the definition of new practices. This, in turn, has raised theoretical issues in terminology theory. I will look at some examples and examine their implications for terminology as a set of practices, but also as a discipline providing a theoretical framework for these practices. These changes have also created a need for terminologists with increased knowledge in natural language processing techniques, corpus-based linguistics, and lexical semantics. I will also examine some implications these changes have on training in terminology.
References (selected list)
Auger, A. and C. Barrière (eds.). 2008. Pattern-based Approaches to Semantic Relation Extraction: Special issue of Terminology 14(1).
Barrière, C. and A. Agbago. 2006. TerminoWeb: A Software Environment for Term Study in Rich Contexts. In International Conference on Terminology, Standardisation and Technology Transfer (TSTT 2006). Beijing, China.
Drouin, P. 2003. Term extraction using non-technical corpora as a point of leverage. Terminology 9(1), 99-117.
Kraif, O. 2008. Extraction automatique de lexique bilingue : application pour la recherche d'exemples en lexicographie, in F. Maniez, P. Dury, N. Arlin et C. Rougemont (dir.) Corpus et dictionnaires de langues de spécialité, Grenoble : Presses universitaires de Grenoble, Grenoble, pp. 67-87.
L’Homme, M.C. 2004. La terminologie : principes et techniques, Montréal : Les Presses de l'Université de Montréal.
L'Homme, M.C. 2008. Le DiCoInfo. Méthodologie pour une nouvelle génération de dictionnaires spécialisés. Traduire 217, pp. 78-103.