Calfa-GREgORI Patrologia Graeca
ciol |
Presentation and aim
The project, led by the GREgORI project (UCLouvain) and Calfa (Paris) under the academic supervision of Professor Jean-Marie Auwers (UCLouvain), aims to provide scholars with a digital version of texts from the Patrologia Graeca (PG) that have not yet been digitised or are not yet available online in open access.
Text transcription (OCR; word accuracy 94,60%) and linguistic analysis (lemmatization and POS-tagging; Lemma-pos accuracy 94,74%) are performed with specialized AI models developed within the scope of the project, with minimal manual proofreading of the results.
This OCR software, specially developed for this purpose, preserves the complex layout of the pages from the PG volumes, and produces a mostly reliable text, because of the well-known occasionally unclear typography of the J.-P. Migne’s publications. Despite this inconvenience and the remainder of imperfectly recognized words, the results produce a searchable version of the texts. Users will have to check and possibly complete the text they need, and are invited to send their corrections.
In addition, linguistic analysis, based on linguistic resources, computer tools, and IA models jointly developed by GREgORI and Calfa, assigns a lemma and a part-of-speech to each word attested in the processed texts.
An evaluation of the results, allowing to provide scholars with an accurate assessment of the effectiveness of the AI models, will be presented in a forthcoming paper.
Scholars interested in acquiring Greek texts from the PG (with or without linguistic analysis) are invited to email us (info-gregori@uclouvain.be or contact@calfa.fr) for terms and conditions.
About input and output files (results), see below.
Foundings
This project has received fundings from (alphabetical order):
ASBL Byzantion

Calfa (Paris)


■ UCLouvain - FSS - Fondation Sedes Sapientiae

■ UCLouvain GREgORI Project

■ UCLouvain - INCAL - Institut des Civilisations Arts et Lettres

■ UCLouvain - RSCS - Institut de recherche pluridisciplinaire Religions Spiritualités Cultures Sociétés

And other private financing.
Members
Professor Emeritus Jean-Marie Auwers (UCLouvain, RSCS)
Professor Sébastien Moureau (UCLouvain/CIOL)
Doctor Véronique Somers (UCLouvain/CIOL)
Doctor Bastien Kindt (UCLouvain/CIOL)
Chahan Vidal-Gorène (Université Paris Sciences & Lettres and École nationale des Chartes et Calfa)
Related bibliography
Kindt B., Auwers J.-M., La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie grecque, dans Bulletin de la Fondation Sedes Sapientiae, 45 (janvier 2024), p. 19-21 (WEB version).
Kindt B., Vidal-Gorène C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96(WEB version).
Kindt B., Vidal-Gorène C., Delle Donne S., Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (WEB version).
C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) – (WEB version).
Vidal-Gorène C., Reconhecimento automático de manuscritos para o teste de idiomas não latinos, O Programming Historian em portugês, 5 (2024),(WEB version) (translated from the original in French published in 2023 - WEB version).
Input files
Input files processed by the OCR are PDF files available from the Patritisca.net portal or from the Roger Pearse weblog. These files are mainly digitized by Google, and, therefore, are also available from the Google Books portal.
Output files and results
■ Output files and results are available on the Calfa’s GitHub repository of the project.
■ OCR ground truth is available on the Zenodo platform of the project.
■ A sample of the CGPG corpus can be used on the Sketch Engine platform (see below).
File formats description
All files are encoded in UTF-8 plain text format (this format ensures data interoperability). For each processed text, the Calfa’s GitHub repository offers the following files:
■ [file_name]_text_raw.txt: UTF-8 plain text, raw OCR result.
■ [file_name]_text_markup.txt: Inherited from [file_name]_text_raw.txt file, with text structure markups (volume number, page number of the processed PDF file), no hyphenation, empty lines deletion.
■[file_name]_text_markup_ske.vert: Vertical text generated from [file_name]_text_markup.txt, enriched with intuitive form, and, in a next steep, with lemma, intuitive lemma and POS for each wordform (analysis performed by AI with minimal manual proofreading of the results). These files can be used on the Sketch Engine platform, allowing text analysis and text mining.
List of currently processed texts
Total of currently processed words of the CGPG corpus : 3.149.208
Total of currently words of the sample of the CGPG corpus usable on the Sketch Engine platform: 1.556.561
PG 071
author : Cyril of Alexandria
author's date : 4th-5th AD
edition’s PDF file
work : Commentarii in Prophetas Minores
word count : 208.423
available files : *_raw.txt | *_markup.txt | *_markup_ske.vert
linguistic analysis : forthcoming
PG 073
author : Cyril of Alexandria
author's date : 4th-5th AD
edition’s PDF file
work : In Joannis Evangelium
word count : 230.336
available files : *_raw.txt | *_markup.txt | *_markup_ske.vert
linguistic analysis : forthcoming
PG 087.1
author : Procopius the Christian Sophist
author's date : 5th-6th AD?
edition’s PDF file
work : Commentarii in OT
word count : 211.763
available files : *_raw.txt | *_markup.txt
linguistic analysis : forthcoming
PG 101
author : Photios I of Constantinople
author's date : 9th AD
edition’s PDF file
work : Amphilochiana, Commentarii in NT
word count : 229.437
available files : *_raw.txt | *_markup.txt
linguistic analysis : forthcoming
PG 109
author : Scriptores Post Theophanem
author's date : varia
edition’s PDF file
work : varia
word count : 211.898
available files : *_raw.txt | *_markup.txt | *_markup_ske.vert
linguistic analysis : forthcoming
PG 112
author : Constantine Porphyrogenitus
author's date : 10th AD
edition’s PDF file
work : De Ceremoniis
word count : 153.718
available files : *_raw.txt | *_markup.txt
linguistic analysis : forthcoming
PG 123
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 247.369
available files : *_raw.txt | *_markup.txt
linguistic analysis : forthcoming
PG 124
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 263.430
available files : *_raw.txt | *_markup.txt
linguistic analysis : forthcoming
PG 125
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 249.703
available files : *_raw.txt | *_markup.txt
linguistic analysis : forthcoming
PG 126
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT et alia opera
word count : 229.628
available files : *_raw.txt | *_markup.txt
linguistic analysis : forthcoming
PG 134
author : Joannes Zonaras
author's date : 11th-12th AD
edition’s PDF file
work : Annales
word count : 271.191
available files : *_raw.txt | *_markup.txt | *_markup_ske.vert
linguistic analysis : forthcoming
PG 146
author : Nikephoros Kallistos Xanthopoulos
author's date : 13th-14th AD
edition’s PDF file
work : Ecclesiastica Historia
word count : 242.816
available files : *_raw.txt | *_markup.txt | *_markup_ske.vert
linguistic analysis : forthcoming
PG 155
author : Simeon of Thessalonica
author's date : 14th-15th AD
edition’s PDF file
work : Dialogus in Christo (et alia opera)
word count : 204.532
available files : *_raw.txt | *_markup.txt | *_markup_ske.vert
linguistic analysis : forthcoming
PG 158
author : Michael Glykas (et al.)
author's date : 12th AD
edition’s PDF file
work : Annales (et alia)
word count : 195.632
available files : *_raw.txt | *_markup.txt | *_markup_ske.vert
linguistic analysis : forthcoming