Aller au contenu principal

Calfa-GREgORI Patrologia Graeca

ciol |

Presentation and aim

The project, led by the GREgORI project (UCLouvain) and Calfa (Paris) under the academic supervision of Professor Jean-Marie Auwers (UCLouvain), aims to provide scholars with a digital version of texts from the Patrologia Graeca (PG) that have not yet been digitised or are not yet available online in open access.

Text transcription (OCR; word accuracy 94,60%) and linguistic analysis (lemmatization and POS-tagging; Lemma-pos accuracy 94,74%) are performed with specialized AI models developed within the scope of the project, with minimal manual proofreading of the results.

This OCR software, specially developed for this purpose, preserves the complex layout of the pages from the PG volumes, and produces a mostly reliable text, because of the well-known occasionally unclear typography of the J.-P. Migne’s publications. Despite this inconvenience and the remainder of imperfectly recognized words, the results produce a searchable version of the texts. Users will have to check and possibly complete the text they need, and are invited to send their corrections.

In addition, linguistic analysis, based on linguistic resources, computer tools, and IA models jointly developed by GREgORI and Calfa, assigns a lemma and a part-of-speech to each word attested in the processed texts.

An evaluation of the results, allowing to provide scholars with an accurate assessment of the effectiveness of the AI models, will be presented in a forthcoming paper.

Scholars interested in acquiring Greek texts from the PG (with or without linguistic analysis) are invited to email us (info-gregori@uclouvain.be or contact@calfa.fr) for terms and conditions.

About input and output files (results), see below.

Foundings

This project has received fundings from (alphabetical order):

  • ASBL Byzantion

Logo ByzantionLogo CalfaLogo CIOL

■ UCLouvain - FSS - Fondation Sedes Sapientiae

Logo Sapientiae

■ UCLouvain GREgORI Project

Logo Gregori

■ UCLouvain - INCAL - Institut des Civilisations Arts et Lettres

Logo INCAL

■ UCLouvain - RSCS - Institut de recherche pluridisciplinaire Religions Spiritualités Cultures Sociétés

Logo RSCS

 And other private financing.

Members

Professor Emeritus Jean-Marie Auwers (UCLouvain, RSCS)

Professor Sébastien Moureau (UCLouvain/CIOL)

Doctor Véronique Somers (UCLouvain/CIOL) 

Doctor Bastien Kindt (UCLouvain/CIOL)

Chahan Vidal-Gorène (Université Paris Sciences & Lettres and École nationale des Chartes et Calfa)

Related bibliography

Kindt B., Auwers J.-M., La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie grecque, dans Bulletin de la Fondation Sedes Sapientiae, 45 (janvier 2024), p. 19-21 (WEB version).

Kindt B., Vidal-Gorène C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96(WEB version).

Kindt B., Vidal-Gorène C., Delle Donne S., Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (WEB version).

C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) – (WEB version).

Vidal-Gorène C., Reconhecimento automático de manuscritos para o teste de idiomas não latinos, O Programming Historian em portugês, 5 (2024),(WEB version) (translated from the original in French published in 2023 - WEB version).

Input files

Input files processed by the OCR are PDF files available from the Patritisca.net portal or from the Roger Pearse weblog. These files are mainly digitized by Google, and, therefore, are also available from the Google Books portal.

Output files and results 

■ Output files and results are available on the Calfa’s GitHub repository of the project. 

■ OCR ground truth is available on the Zenodo platform of the project.

■ A sample of the CGPG corpus can be used on the Sketch Engine platform (see below).

File formats description

All files are encoded in UTF-8 plain text format (this format ensures data interoperability). For each processed text, the Calfa’s GitHub repository offers the following files:

■ [file_name]_text_raw.txt: UTF-8 plain text, raw OCR result.

■ [file_name]_text_markup.txt: Inherited from [file_name]_text_raw.txt file, with text structure markups (volume number, page number of the processed PDF file), no hyphenation, empty lines deletion.

■[file_name]_text_markup_ske.vert: Vertical text generated from [file_name]_text_markup.txt, enriched with intuitive form, and, in a next steep, with lemma, intuitive lemma and POS for each wordform (analysis performed by AI with minimal manual proofreading of the results).  These files can be used on the Sketch Engine platform, allowing text analysis and text mining.

List of currently processed texts

Total of currently processed words of the CGPG corpus : 3.149.208

Total of currently words of the sample of the CGPG corpus usable on the Sketch Engine platform: 1.556.561

PG 071

author : Cyril of Alexandria

author's date : 4th-5th AD

edition’s PDF file

work : Commentarii in Prophetas Minores

word count : 208.423

available files : *_raw.txt | *_markup.txt | *_markup_ske.vert

linguistic analysis : forthcoming

 

PG 073

author : Cyril of Alexandria

author's date : 4th-5th AD

edition’s PDF file

work : In Joannis Evangelium

word count : 230.336

available files : *_raw.txt | *_markup.txt | *_markup_ske.vert

linguistic analysis : forthcoming

 

PG 087.1

author : Procopius the Christian Sophist

author's date : 5th-6th AD?

edition’s PDF file

work : Commentarii in OT

word count : 211.763

available files : *_raw.txt | *_markup.txt

linguistic analysis : forthcoming

 

PG 101

author : Photios I of Constantinople

author's date : 9th AD

edition’s PDF file

work : Amphilochiana, Commentarii in NT

word count : 229.437

available files : *_raw.txt | *_markup.txt

linguistic analysis : forthcoming

 

PG 109

author : Scriptores Post Theophanem

author's date : varia

edition’s PDF file

work : varia

word count : 211.898

available files : *_raw.txt | *_markup.txt | *_markup_ske.vert

linguistic analysis : forthcoming

 

PG 112

author : Constantine Porphyrogenitus

author's date : 10th AD

edition’s PDF file

work : De Ceremoniis

word count : 153.718

available files : *_raw.txt | *_markup.txt

linguistic analysis : forthcoming

 

PG 123

author : Theophylact of Ohrid

author's date : 11th-12th AD

edition’s PDF file

work : Commentarii in NT

word count : 247.369

available files : *_raw.txt | *_markup.txt

linguistic analysis : forthcoming

 

PG 124

author : Theophylact of Ohrid

author's date : 11th-12th AD

edition’s PDF file

work : Commentarii in NT

word count : 263.430

available files : *_raw.txt | *_markup.txt

linguistic analysis : forthcoming

 

PG 125

author : Theophylact of Ohrid

author's date : 11th-12th AD

edition’s PDF file

work : Commentarii in NT

word count : 249.703

available files : *_raw.txt | *_markup.txt

linguistic analysis : forthcoming

 

PG 126

author : Theophylact of Ohrid

author's date : 11th-12th AD

edition’s PDF file

work : Commentarii in NT et alia opera

word count : 229.628

available files : *_raw.txt | *_markup.txt

linguistic analysis : forthcoming

 

PG 134

author : Joannes Zonaras

author's date : 11th-12th AD

edition’s PDF file

work : Annales

word count : 271.191

available files : *_raw.txt | *_markup.txt | *_markup_ske.vert

linguistic analysis : forthcoming

 

PG 146

author : Nikephoros Kallistos Xanthopoulos

author's date : 13th-14th AD

edition’s PDF file

work : Ecclesiastica Historia

word count : 242.816

available files : *_raw.txt | *_markup.txt | *_markup_ske.vert

linguistic analysis : forthcoming

 

PG 155

author : Simeon of Thessalonica

author's date : 14th-15th AD

edition’s PDF file

work : Dialogus in Christo (et alia opera)

word count : 204.532

available files : *_raw.txt | *_markup.txt | *_markup_ske.vert

linguistic analysis : forthcoming

 

PG 158

author : Michael Glykas (et al.)

author's date : 12th AD

edition’s PDF file

work : Annales (et alia)

word count : 195.632

available files : *_raw.txt | *_markup.txt | *_markup_ske.vert

linguistic analysis : forthcoming