Aller au contenu principal

Digital Matenagirk Hayots Project

ciol |

Presentation and aim

The project, directed by Professor Bernard Coulie at UCLouvain, and led by the GREgORI Project (UCLouvain) and Calfa (Paris), involves the process of 2.319 pages and 542.410 wordforms from 5th cent. AD texts written in Classical Armenian by 13 authors (Matenagirk Hayots, volumes 1 and 2): 

  1. Grigor Lusavorich (53.886 wordforms)
  2. Sahak of Armenia (8.277 wordforms)
  3. Koriun (11.215 wordforms)
  4. P’awstos Buzand (59.740 wordforms)
  5. Eznik of Kolb (34.703 wordforms)
  6. Yeghishe (96.631 wordforms)
  7. Giwt Arahtzatsi (1.832 wordforms)
  8. Mambre Vertsanoł (14.465 wordforms)
  9. Anania T’argmanich (2.176 wordforms)
  10. Hovhannes Mandakuni (50.132 wordforms)
  11. Agathangelos (62.748 wordforms)
  12. Movses Khorenatsi (79.314 wordforms)
  13. Ghazar Parpetsi (67.291 wordforms)

Text transcription (OCR; accuracy 99,81%) and linguistic analysis (lemmatization and POS-tagging; accuracy under evaluation) are performed with specialized AI models developed within the scope of the project, with minimal manual proofreading of the results. 

Scholars interested in acquiring Armenian texts not processed yet are invited to email us (info-gregori@uclouvain.be or contact@calfa.fr) for terms and conditions.

Foundings

This project has received fundings from (alphabetical order):

Logo CalousteLogo Gregori

Members

Professor Emeritus Bernard Coulie (UCLouvain/CIOL)

Professor Emmanuel Van Elverdinghe (UCLouvain/CIOL)

Doctor Bastien Kindt (UCLouvain/CIOL)

Chahan Vidal-Gorène (École nationale des chartes – PSL and Calfa)

 

Related bibliography

Coulie B., Kindt B., Kepeklian G., Van Elverdinghe, E., Étiquettes morphosyntaxiques et flexionnelles pour le traitement automatique de l’arménien ancien, dans Le Muséon, 135 (1-2), p. 209-241.

Kindt B., Vidal-Gorène C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96 (WEB version).

Kindt B., Vidal-Gorène C., Delle Donne S., Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (Web version).

Vidal-Gorène C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) – (WEB version).

Vidal-Gorène C., Reconhecimento automático de manuscritos para o teste de idiomas não latinos, O Programming Historian em portugês, 5 (2024),(WEB version) (translated from the original in French published in 2023 - WEB version).

Available data

1. Editions (PDF files) available from the website of the Matenadaran (Yerevan, Armenia)

2. Processed data available on the DMHP’s Zenodo repository

All files are encoded in UTF-8 plain text format (ensuring data interoperability). For each processed text, the DMHP’s Zenodo repository offers the following files:

■ [file_name]_ocr.txt : UTF-8 plain text, raw OCR result with automated markups (page and line numbers of the PDF files)

■ [file_name]_text.txt : Inherited from [file_name]_ed.txt file, with automated markups (page and line numbers of the PDF files), no hyphenation, empty lines deletion.

no hyphenation, empty lines deletion

■ [file_name]_text.vert : Vertical text generated from [file_name]_text_markup.txt file, enriched with intuitive form, lemma, intuitive lemma and POS for each wordform (analysis performed by AI with minimal manual proofreading of the results). These files can be used on the Sketch Engine platform allowing text analysis and text mining.