Les séminaires du CENTAL ont pour but de réunir des enseignants, des étudiants et des chercheurs (du monde académique ou de l'industrie) intéressés par le traitement automatique de langues. Les séminaires sont gratuits et ouverts à tous et ont généralement lieu le vendredi de 14h à 15h. Si vous souhaitez être informé par courrier électronique des séminaires que nous organisons et des actualités du CENTAL, nous vous proposons de vous inscrire à la liste de diffusion du CENTAL en indiquant votre adresse électronique dans le formulaire.
Organisation 2022-2023
Calendrier 2022-2023
14 octobre 2022 — Carlos Hidalgo - Local C228 (Bibliothèque du Collège Erasme)
gApp: a text preprocessing system to improve the neural machine translation of discontinuous multiword expressions
Carlos Hidalgo, Researcher in research group LEXYTRAD (Lexicography and Translation) (University of Malaga)
Abstract :
In this seminar, we present gApp, a text-preprocessing system designed for automatically detecting and converting discontinuous multiword expressions (MWEs) into their continuous forms so as to improve the performance of current neural machine translation systems (NMT) (see Hidalgo-Ternero, 2021, & Hidalgo-Ternero & Corpas Pastor, 2020, 2022a, 2022b & 2022c, among others). To test its effectiveness, several experiments with different NMT systems (DeepL, Google Translate and ModernMT, among others) and in different language directionalities (ES/FR/IT>EN/DE/ES/FR/IT/PT/ZH) have been carried out in order to verify to what extent gApp can enhance the performance of NMT systems under the challenge of phraseological discontinuity.
25 novembre 2022 — A. Seza Dogruöz - More 51
Toward Dynamic and Inclusive Language Technologies
Abstract :
Current language technologies are mostly built from a static point of view. This view has difficulties in adapting to the dynamic and ever-changing aspects of language as it is used in society by speakers and users with diverse social and linguistic backgrounds (e.g., multilinguals), needs and preferences. During the talk, comparisons will be made across communication contexts & languages, data sets and methods of analyses to illustrate the challenges and possible solutions covering both linguistics and computational linguistics domains.
02 décembre 2022 — Miryam de Lhoneux - Doyen 31 -
Typologically fair NLP
Abstract :
The field of NLP has historically had a strong bias towards work that primarily uses English as a language of investigation. The situation is changing and multilingual NLP is booming.
This talk starts with a description of the state of multilingual NLP, highlighting both its successes and its limitations. In particular, large multilingual pretrained models (PLM) such as mBERT or XLM-R have shown surprising cross-lingual capabilities but they cover a small fraction of the world's languages with large inequalities in performance. These inequalities stem from at least two sources: 1) NLP datasets are highly imbalanced with regards to typological diversity and 2) NLP models tend to be developed for English first and then adapted to other languages, which leads to biases in the model assumptions. I describe attempts at overcoming both of these limitations. To overcome data imbalance, I describe a method from algorithmic fairness which samples data from different sources in a way that is more robust to underrepresented languages than alternative sampling methods. To overcome model assumption biases, I describe a PLM which uses pixel-based representations of language instead of the commonly used subword representations. I conclude with some directions for working towards typologically fairer NLP.
08 décembre 2022 — Aurélie Calèbre et Eole Lapeyre - C228 (Collège Erasme) -
Automated text simplification as a reading aid for low-vision individuals
Abstract :
In developed countries, the majority of people with visual impairment are legally blind, but not totally blind. Instead, they have what is referred to as low vision, commonly caused by Central visual Field Loss (CFL). This degenerative condition is caused by non-curable retinal diseases, such as Age-related Macular Degeneration (DMLA in French). Patients suffering from such pathologies will develop a blind region called scotoma, located at the center of their visual field and spanning about 20° or more. To better visualize the impact of such a large hole in your visual field, try stretching your index and little finger as far as possible from each other at arm’s length; the span is about 15°. Central vision cannot be restored and difficulty with reading becomes the primary complaint of patients seeking rehabilitation. To help CFL individuals improve their reading performance, it is necessary to investigate the underlying causes of their deficit, to then overcome them with specific adjustments.
In this presentation, I propose to address the issue of reading with CFL from a linguistic perspective, which takes into account the whole complexity of texts. I will present a series of experiments that investigate what makes a text especially complex when reading with CFL. I will conclude on the relevance of this work to design text simplification tools, customised to the specific needs of readers with CFL, to be used as efficient reading aids for this population.
24 février 2023 — Gael Guibon - Doyen 22
De l'identification des émotions à la détection des conversations problématiques dans des conversations de service client
Gael Guibon, Associate professor in Institute of Digital Sciences, Management and Cognition (University of Lorraine)
Abstract :
Nombreux sont les contextes industriels nécessitant la mise en place d’un service client qualitatif par le biais de tchats textuels. En effet, ces services ont pour principal objectif d’aider le client à résoudre leur problème rencontré et par la même occasion d’améliorer sa satisfaction. Cependant, les données qui y sont extraites sont principalement confidentielles, ce qui constitue un frein majeur à leur utilisation et à leur partage au sein de la communauté de recherche en traitement automatique du langage naturel. Ces données sont de surcroît rarement annotées. Dans ce séminaire, nous synthétiserons l’état de l’art de la reconnaissance d’émotions en conversation et de la détection des conversations problématiques en service client. Nous présenterons ensuite nos travaux sur la détection d’émotions en conversations à l’aide d’apprentissage méta ou frugal, avant de conclure sur l’identification du statut de résolution du problème du client et la détection des conversations problématiques. Tous les travaux qui seront présentés dans ce séminaire ont été effectués lors d’un post-doctorat commun au Laboratoire Traitement et Communication de l'Information (LTCI) de Télécom Paris et à la Direction Technologie Innovation Recherche Groupe (DTIPG) de la SNCF.
Références :
Auguste, J., Charlet, D., Damnati, G., Béchet, F., & Favre, B. (2019, May). Can we predict self-reported customer satisfaction from interactions?. In ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 7385-7389). IEEE.
Bao, Y., Ma, Q., Wei, L., Zhou, W., & Hu, S. (2022). Speaker-Guided Encoder-Decoder Framework for Emotion Recognition in Conversation. arXiv preprint arXiv:2206.03173.
Chowdhury, S. A., Stepanov, E. A., & Riccardi, G. (2016). Predicting User Satisfaction from Turn-Taking in Spoken Conversations. In Interspeech (pp. 2910-2914).
Guibon, G., Labeau, M., Flamein, H., Lefeuvre, L., & Clavel, C. (2021, November). Few-Shot Emotion Recognition in Conversation with Sequential Prototypical Networks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 6858-6870).
Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020, July). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8342-8360).
08 mars 2023 —Simon Hengchen - D.243 (Descamps, FIAL)
Quantitative approaches to historical texts: some (non-)issues and how to tackle them
Simon Hengchen, Founder of Iguanodon.ai
Abstract:
Quantitative methods for historical text analysis offer exciting opportunities for researchers interested in gaining new insights into long studied texts. However, the methodological underpinnings of these methods remains under-explored. In the first part of the talk I will show and discuss, through the use of a case study, the (non-)effect the OCR process has on a range of quantitative text analyses. In the second part of the talk, I will present a novel and totally unsupervised OCR post-correction method on the same dataset, as well as its most recent evolution on a highly-inflected language, Finnish.
10 mars 2023 —Vincent Vandeghinste - Doyen 22 (reporté au 28 avril 2023)
Challenges in Machine Translation for Sign Languages
Vincent Vandeghinste, KU Leuven, Belgique
Abstract :
This talk is about the SignON project, in which we aim to build MT engines from Sign Languages to Spoken Languages and vice versa. While this is MT between two natural languages there are several major differences with regular MT between written languages. This talk will be about these differences and how we try to tackle them in the SignON project.
15 mars 2023 — Melvin Wevers - D.243 (Descamps, FIAL)
NLP as an Intermediary for Historical Research
Melvin Wevers, University of Amsterdam
Abstract :
In this talk, I will focus on how I use methods from NLP (text classification, parsing, topic modeling, embeddings) toolkit in my work as a historian. Rather than focusing my attention on improving NLP, I show how they function as an intermediary in my research workflow. NLP is great for extracting information from digitized historical sources, and as such it can inform search and exporation of digitized archives. However, if we want to model historical processes or phenomena, we need to think how we can use these extracted features as input for methods outside of NLP.
Using examples from my own work, I will highlight the importance of NLP but I also argue that we need to think broaden our toolkit if we truly want to engage with history in a computational manner.
16 mars 2023 — Leonardo Campillos-Llanos - Doyen -31
Advances in processing and simplification of clinical trials texts
Leonardo Campillos-Llanos, tenure track scientist at the Spanish National Research Council (CSIC)
Abstract :
Clinical trial announcements report information about patients' eligibility criteria, medical condition under investigation and interventional procedures to be tested. This information is a valuable source of data for named entity recognition tasks, complementary to other resources such as patients' records. Our current project (CLARA-Med) focuses on automatic text simplification of trial contents to improve their understanding by patients. Preliminary work will be explained about the approaches to tackle this task in Spanish. First, the creation of a comparable and parallel corpus for automatic medical text simplification. Second, the creation of a lexicon of technical and simplified medical terms. Lastly, initial experiments applying deep-learning-based models to simplify technical sentences. The work-in-progress will be presented and perspectives of our work will be discussed.
31 mars 2023 — Guillaume Bernard - Doyen -32
Détection et suivi d'événements dans des documents de presse historique
Guillaume Bernard, Université de La Rochelle, France
Abstract :
Les campagnes actuelles de numérisation de documents historiques issus de fonds documentaires du monde entier ouvrent de nouvelles voies aux historiens, historiennes et spécialistes des sciences sociales. La compréhension des événements du passé se renouvelle par l’analyse de ces grands volumes de données historiques : découdre le fil des événements, tracer de fausses informations sont, entre autres, des possibilités offertes par les sciences du numérique. Ces travaux s’intéressent à ces articles de presse historique et proposent, à travers deux stratégies que tout oppose, deux processus d’analyse répondant à la problématique de suivi des événements dans la presse. Un cas d’utilisation simple est celui d’une équipe de recherche en humanités numériques qui s’intéresse à un événement particulier du passé. Ses membres cherchent à découvrir tous les documents de presse qui s’y rapportent. L’analyse manuelle des articles est irréalisable dans un temps contraint. En publiant à la fois algorithmes, jeux de données et analyses, cette thèse est un premier jalon vers la publication d’outils plus sophistiqués. Nous permettons à tout individu de fouiller les fonds de presse ancienne à la recherche d’événements, et pourquoi pas, renouveler certaines de nos connaissances historiques.
21 avril 2023 — Dominique Brunato - Doyen -32
Measuring linguistic complexity from a computational linguistics perspective
Dominique Brunato, ItaliaNLP Lab, Institute of Computational Linguistics (CNR-ILC), Pisa, Italy
Abstract:
Linguistic complexity is a highly debated and multifaceted notion, for which several definitions have been proposed according to theories and empirical evidence acquired from different frameworks ‒ such as language acquisition, language typology, computational stylometry ‒ as well as according to the specific research purpose. By exploiting a computational linguistics perspective, in my talk I will present an approach to model linguistic complexity based on linguistic profiling as a methodological framework (Biber, 1993; Val Halteren, 2004 among others), and I will illustrate Profiling-UD (Brunato et al., 2020), a recently introduced tool that allows to carry out the linguistic profiling of a text for multiple languages sharing the same annotation formalism based on the Universal Dependencies representation (Nivre et al., 2015). A few case studies will be discussed in order to show how this approach has been successfully applied to track language learning development and to model the human perception of complexity.
28 avril 2023 —Vincent Vandeghinste - Doyen 22
Challenges in Machine Translation for Sign Languages
Vincent Vandeghinste, KU Leuven, Belgique
Abstract :
This talk is about the SignON project, in which we aim to build MT engines from Sign Languages to Spoken Languages and vice versa. While this is MT between two natural languages there are several major differences with regular MT between written languages. This talk will be about these differences and how we try to tackle them in the SignON project.