Les séminaires du CENTAL ont pour but de réunir des enseignants, des étudiants et des chercheurs (du monde académique ou de l'industrie) intéressés par le traitement automatique de langues. Les séminaires sont gratuits et ouverts à tous et ont généralement lieu le vendredi de 14h à 15h. Si vous souhaitez être informé par courrier électronique des séminaires que nous organisons et des actualités du CENTAL, nous vous proposons de vous inscrire à la liste de diffusion du CENTAL en indiquant votre adresse électronique dans le formulaire.
14 octobre 2022 — Carlos Hidalgo - Local C228 (Bibliothèque du Collège Erasme)
gApp: a text preprocessing system to improve the neural machine translation of discontinuous multiword expressions
Carlos Hidalgo, Researcher in research group LEXYTRAD (Lexicography and Translation) (University of Malaga)
In this seminar, we present gApp, a text-preprocessing system designed for automatically detecting and converting discontinuous multiword expressions (MWEs) into their continuous forms so as to improve the performance of current neural machine translation systems (NMT) (see Hidalgo-Ternero, 2021, & Hidalgo-Ternero & Corpas Pastor, 2020, 2022a, 2022b & 2022c, among others). To test its effectiveness, several experiments with different NMT systems (DeepL, Google Translate and ModernMT, among others) and in different language directionalities (ES/FR/IT>EN/DE/ES/FR/IT/PT/ZH) have been carried out in order to verify to what extent gApp can enhance the performance of NMT systems under the challenge of phraseological discontinuity.
Current language technologies are mostly built from a static point of view. This view has difficulties in adapting to the dynamic and ever-changing aspects of language as it is used in society by speakers and users with diverse social and linguistic backgrounds (e.g., multilinguals), needs and preferences. During the talk, comparisons will be made across communication contexts & languages, data sets and methods of analyses to illustrate the challenges and possible solutions covering both linguistics and computational linguistics domains.
The field of NLP has historically had a strong bias towards work that primarily uses English as a language of investigation. The situation is changing and multilingual NLP is booming.
This talk starts with a description of the state of multilingual NLP, highlighting both its successes and its limitations. In particular, large multilingual pretrained models (PLM) such as mBERT or XLM-R have shown surprising cross-lingual capabilities but they cover a small fraction of the world's languages with large inequalities in performance. These inequalities stem from at least two sources: 1) NLP datasets are highly imbalanced with regards to typological diversity and 2) NLP models tend to be developed for English first and then adapted to other languages, which leads to biases in the model assumptions. I describe attempts at overcoming both of these limitations. To overcome data imbalance, I describe a method from algorithmic fairness which samples data from different sources in a way that is more robust to underrepresented languages than alternative sampling methods. To overcome model assumption biases, I describe a PLM which uses pixel-based representations of language instead of the commonly used subword representations. I conclude with some directions for working towards typologically fairer NLP.