Public Thesis defense - ICTEAM

SST

29 janvier 2020

16h

Louvain-la-Neuve

Place du Levant, 3 - Salle Shannon - Maxwell building

Document Image Analysis and Text Recognition on Khmer Historical Manuscripts by Dona VALY

Pour l’obtention du grade de Docteur en sciences de l’ingénieur et technologie

Palm leaves have been used as one of the major sources of writing and painting in many Southeast Asian countries. In Cambodia nowadays, palm leaf documents called “Sleuk Rith” in Khmer are still around attributable to their cultural value as well as the precious contents written on them. However, as a consequence of deterioration from natural aging and damage caused by various natural factors, palm leaf manuscripts are facing destruction and are in need for preservation. Many programs and projects are underway to recover and preserve palm leaf documents not only in their physical form but also in digital imaging through scanning and photography. The centralization of the digitized images allows easy access for the public. Nonetheless, searching and filtering the content of those documents using particular keywords are still unmanageable. An automatic recognition system therefore needs to be developed.

This dissertation takes part in exploring document image analysis (DIA) researches which put Khmer palm leaf manuscripts into the spotlight. We aim to bring added values by designing tools to analyze, index, and access quickly and efficiently to the text content of palm leaf documents. In order to achieve this objective, different DIA tasks are studied, and novel approaches to solve such tasks are proposed. First, a new corpus of digitized Khmer palm leaf manuscripts has been collected. From this corpus, the first Khmer palm leaf manuscripts dataset called “SleukRith Set” consisting of different types of annotated data has been constructed. Experimental evaluations and comparisons of approaches on various DIA tasks such as binarization, text line segmentation, and isolated character recognition have been conducted on Khmer palm leaf manuscript datasets in addition to datasets of palm leaf manuscripts from Indonesia. Moreover, we propose an efficient line segmentation scheme for grayscale images of Khmer ancient documents which is able to adapt to the curvature of the actual text lines and to produce separating seams using a path finding technique. We also introduce a novel concept of utilizing the annotated information of glyph components in the word image to build a glyph-class map followed by a complete text recognition scheme using encoder-decoder mechanism. A new annotated data called “sub-syllable” which can be used as an efficient data augmentation technique for the text recognition task has been added to SleukRith set.

Jury members :

  • Prof. Michel Verleysen (UCLouvain), supervisor
  • Prof. Jean-Pierre Raskin (UCLouvain), chairperson
  • Prof. John Lee (UCLouvain), secretary
  • Prof. Bernard Gosselin (UMons, Belgium)
  • Dr. Sophea Chhun (ITC, Cambodia)

Télécharger l'annonce