Presentation

CIOL

The GREgORI Project - “GREgORI : Softwares, linguistic data and tagged corpus for ancient GReek and ORIental languages” — former Research Project in Greek Lexicology (Projet de Recherche en Lexicologie Grecque; PRLG) — has been conducted since 1990 at the Institut orientaliste (Oriental Institute) of the Université catholique de Louvain (UCL, Louvain-la-Neuve, Belgium).

It provides scholars with lemmatized concordances of Greek texts by Church Fathers and Byzantine historians. All these concordances are based on a stable standard of lexical examination.

The name of the GREgORI project itself refers to both the Greek and Oriental languages, on the one hand, and the name of Gregory Nazianzen, on the other hand. Historically, the PRLG was born of the editing programme of the complete works of the Theologian, which ran for many years at the Oriental Institute.

Technical developments are carried out in collaboration with the CENTAL, a computing centre specialized in the study of natural language processing (NLP).

To date, the analyses relate to a corpus of more than 4,000,000 word-occurrences. All lexical data are collected in a specific electronic lexicon, the Dictionnaire Automatique Grec (Automated Greek Lexicon; DAG), designed to cover the whole vocabulary of every processed text. Each word form is classified under his corresponding (one or more) lemma(s) (marked by part-of-speech tags and, for such applications as word forms disambiguation, by inflectional tags; download the PDF file with the GREgORI tagset for morphological and inflectional informations). Gradually enriched, this lexical resource is used as a basis for lemmatization. Each new text analysis puts the contents of the DAG to the test and contributes to its development.

Greek Texts are processed by UNITEX, an open-source package created by Sébastien Paumier at the Institut Gaspard Monge. UNITEX is used as corpus processor (working with lexical and inflectional information from the dictionary) but also as a concordancer and a textual search engine. Once the text has been processed, lemmatized and disambiguized lexical data are gathered in a SQL database, which makes it possible to publish concordances and lexical tools such as reverse indices or frequency indices, commun specific vocabulary lists.

Between 1990 and 2012, lemmatized concordances were published in the Thesaurus Patrum Graecorum series, a subset of the Corpus Christianorum.

Thanks to the stability of the standard of lexical examination, computerized tools and language resources set up for the Greek texts of the Fathers and the historians of the Byzantine Empire are also applicable to Classical Greek. As a result, the processed texts have become fully lemmatized and disambiguated corpora, which constitutes usefull textual data for NLP.

Since 2014, in addition to the Greek, the project has been extended to the analysis of other languages of the Christian East, mainly Syriac, Arabic, Armenian and Georgian. This opening to other languages has naturally led the project’s managers to integrate these languages in all the PRLG tools, paving the way for bilingual concordances and dictionaries. The purpose is to supply a corpus-based approach of the translation’s methods in the Christian East era.