Ulrich Heid (PH D. in 1995, außerplanmässiger Professor at Universität Stuttgart, Germany) has worked on projects in computational lexicography since 1991, with a focus on tools for corpus lexicography, models of lexical representation for electronic dictionaries and phraseology for both general and specialized language. He is one of the co-editors of a fourth volume of the well-known HSK Handbook on Dictionaries (Hausmann, Reichmann, Weigand, Zgusta 1989). THis volume (to appear 2011) will be pârtly devoted to computational lexicography.



Aspects of lexical description for electronic dictionaries


Printed monolingual definition dictionaries are primarily structured as lists of lists (cf. Tarp 2006): each entry is a list of items (“Angaben” in Wiegand's sense), i.e. of lexicographical data allowing the user to infer information about different linguistic properties of the lexical object treated in the entry: for example its morphological, syntactic, semantic and pragmatic properties (microstructure). These entries are themselves listed, e.g. alphabetically (macrostructure), and some of them are related by links. The links are usually only of few types: synonymy, antonymy, an unspecific “see also”-relation, etc. Electronic dictionaries which have been transformed from printed ones tend to conserve this linear structure, possibly with the inclusion of general links from each word form appearing in one of the items to that word's entry.

The claim underlying this paper is that electronic dictionaries could do better: they would serve users in different situations much better if they included more relations, and if they were typed, i.e. based on a typology of lexical objects, properties and relations. In addition to browsing use, this would support more focused search in the dictionary, as well as non-standard access to lexicographic data (i.e. access not based on a single lemma).

We will illustrate our claim by outlining partial descriptive models of the intended kind for certain types of multiword expressions and for marked vocabulary: we look at noun+verb collocations (pay attention; a question arises), at German predicative PPs in copula constructions (er ist aus dem Häuschen (“he is excited”), das ist an der Zeit (“it is time for this”)), and at geographically, domain-wise or otherwise marked vocabulary.

To treat the latter, among others relations between marked and unmarked lexical objects are necessary, and likely also directed links (e.g. from dispreferred to preferred items). For multiword expressions, two sets of properties and relations are needed: (i) those that apply to elements of the multiword, and (ii) those that concern the multiword as a whole. Deciding to use type (ii) implies that multiword expressions get the same status (of treatment units, cf. Heid/Gouws 2006) as single word lexical objects.

A typed dictionary of the above kind will enhance access possibilities, and it allows for a simple integration of function-specific views (in the sense of Bergenholtz and Tarp's function theory): we will provide small sample fragments of German noun+verb collocations and of predicative PPs, along with function-specific views, for text understanding, for text production, and for specific production-related search. These views should be created by applying constraints on the selection of lexicographic data (filters: which types of properties and relations are relevant for a user in a given situation?) and on its presentation (sequencing, layout, etc. on screen).

With the proposed model in mind, we finally briefly analyse existing online dictionaries (an online portal, and the learner's dictionaries ELDIT (Abel/Weber 2005 etc.), BLF (Verlinde et al. 2006) and DICE (www.dicesp.com)), as well as current representational proposals (e.g. ISO-1951 Lexical Systems (Polguère 2006), and LMF (Francopoulou et al. 2006)). The portals analysed (e.g. StarDict (http://stardict.sourceforge.net)) mainly reproduce the printed dictionaries, enable parallel access to all included dictionaries, and add cross-references from word forms in the article text to the respective entries; ISO-1951 is also mainly focussed on the reproduction of printed material, providing a meta-representaion general enough to host quite different print dictionary formats; Lexical Systems is another format for the cohabitation of data from different sources; and LMF provides a general meta-model for dictionary data for NLP, covering a broad range of dictionary types. The electronic learner's dictionaries ELDIT and BLF are designed to provide a substantial amount of different relations between lexical objects. Nevertheless, they seem mainly to be made for browsing, but not for focused search, as their query support is still mainly lemma-based.

As far as implementations of our typed dictionary model are concerned, both relational (or object) databases and typed formalisms like OWL-DL (cf. Bechhofer et al. 2004) seem to be appropriate. Spohr (2008), for example, is working with OWL-DL. In the eLexicography conference, he presents a protoype of a multifunctional dictionary based on OWL-DL which accounts for different users' needs (Spohr 2009). We see an interesting research potential in combining available representation systems (like OWL-DL) with a detailed enough lexicographic data description to account for a wide variety of usage scenarios.




