Séminaires du CENTAL (archives 2017-2018)

Organisation 2017-18

Serge Bibauw
Anaïs Tack

Calendrier 2017-18

En bref

20 octobre 2017 : Natalia Grabar (Université de Lille), Acquisition de ressources pour la simplification de textes médicaux
17 novembre 2017 : Naomi Baron (American University), Learning, Knowing, and Remembering in a Digital World
24 novembre 2017 : Dirk De Hertog (imec - KU Leuven), Embeddings and their use as features in supervised learning tasks
1^er décembre 2017 : Pierre Deville (Bisnode), Network Science in the era of Text Mining and Big Data
8 décembre 2017 : Aline Villavicencio (University of Essex • Federal University of Rio Grande do Sul), Identifying Idiomatic Language with Distributional Semantic Models

9 mars 2018 : Leonie Grön (KU Leuven), Term variation in clinical records: First insights from a corpus study
30 mars 2018 : Eric Kergosien (Université de Lille), Analyse spatiale des médias numériques via des approches de fouille de textes
20 avril 2018 : Henning Wachsmuth (University of Paderborn), Computational Assessment of Argumentation Quality
4 mai 2018 : Pierre Lison (Norwegian Computing Center), Modélisation du dialogue: systèmes de dialogue parlé et corpus multilingues
18 mai 2018 : PLIN Linguistic Day 2018, Technological innovation in language learning and teaching

Programme complet

PREMIER QUADRIMESTRE

Vendredi 20 octobre 2017, 14h-15h, Collège Erasme c.142

Natalia Grabar (STL, Université de Lille, CNRS, FR)

Acquisition de ressources pour la simplification de textes médicaux

Une des particularités des textes médicaux consiste en utilisation de termes techniques très spécialisés, qui restent souvent non compréhensibles pour les locuteurs. Lors de la simplification de ces textes, il est donc important de disposer de ressources nécessaires. Nous introduisons ici deux méthodes pour effectuer l'acquisition de telles ressources. L'une repose sur les indices internes des termes (l'analyse morphologique des termes composés) alors que l'autre exploite les indices externes des termes (les reformulations effectuées dans les textes). Aucune de ces méthodes ne requiert l'exploitation de corpus parallèles. Nous décrivons et discutons les résultats.

mots-clés : simplification de textes · acquisition de ressources · définition de règles · catégorisation · domaine médical

références :

Antoine, E., & Grabar, N. (2016). Exploitation de reformulations pour l’acquisition d’un vocabulaire expert/non expert. In Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 2 : TALN (pp. 153–166).
Grabar, N., & Hamon, T. (2016). A large rated lexicon with French medical words. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 2643–48).
Grabar, N., & Hamon, T. (2016). Exploitation de la morphologie pour l’extraction automatique de paraphrases grand public des termes médicaux. TAL, 57(1), 85–109.

diapositives : ici

Vendredi 17 novembre 2017, 14h-15h, Auditoire MONTESQUIEU 3
Séminaire co-organisé avec l'Institut Langage et Communication (IL&C)

Naomi Baron (CTRL, American University, Washington, D.C., US)

Learning, Knowing, and Remembering in a Digital World

mots-clés : cognition · digital impact on memory · educational curricula · GPS

Digital tools such as the internet, search engines, and online navigation have put a wealth of information at our fingertips. Are these same tools impacting the way we use human cognitive skills to learn, know, and remember? Research suggests that availability of “google knowing” is redefining our assumptions about what kinds of data – and knowledge – are appropriately held in our own heads. These redefinitions are, in turn, reshaping academic curricula, for good or for ill.

références : ici

diffusion en direct et vidéo : www.facebook.com/didaxoUlearn/videos/1174447416020956

diapositives : ici

Vendredi 24 novembre 2017, 14h-15h, Collège Erasme c.142

Dirk De Hertog (ITEC, imec - KU Leuven, BE)

Embeddings and their use as features in supervised learning tasks

mots-clés : embeddings · supervised learning

This talk provides an introduction to the use and value of distributional word representations within machine learning approaches to NLP. Machine learning aims to learn how to perform specific tasks (e.g., POS-tagging, Named Entity Recognition…) by deriving statistical associations between annotated examples and so called features, i.e., meaningful pieces of information that are relevant for the problem at hand. If the learning is successful then it can be successfully applied to similar, yet new examples. A recent development within NLP is to replace traditional ‘flat’ features with distributional ‘semantic’ representations, such as Semantic Vector Spaces (SVS) and word2Vec. The latter methods rely on contextual information that is derived from large scale corpora to build vector representations of words, effectively transforming a word into a complex data structure.

référence :

Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141–188. https://doi.org/10.1613/jair.2934

diapositives : ici

Vendredi 1^erdécembre 2017, 14h-15h, Collège Erasme c.142

Pierre Deville (Head of Data Science, Bisnode Group Analytics, BE)

Network Science in the era of Text Mining and Big Data

mots-clés : networks · big data · visualization

Networks are everywhere. From micro to macro, they pervade our world. Understanding their structure and behaviour has been a major concern recently. However, while tremendous progress has been made on natural networks, we still scramble when it comes to human or business behavior. But this is about to change, as we are now entering the data driven age. Former PhD student at UCL, young entrepreneur and professor at Solvay, Pierre will discuss the opportunities offered by the big data revolution and text mining tools and how it relates to his research on network analysis.

références :

Deville, P., Wang, D., Sinatra, R., Song, C., Blondel, V. D., & Barabási, A.-L. (2014). Career on the Move: Geography, Stratification, and Scientific Impact. Scientific Reports, 4, srep04770. https://doi.org/10.1038/srep04770
Sinatra, R., Deville, P., Szell, M., Wang, D., & Barabási, A.-L. (2015). A century of physics. Nature Physics, 11(10), 791–796. https://doi.org/10.1038/nphys3494
Sinatra, R., Wang, D., Deville, P., Song, C., & Barabási, A.-L. (2016). Quantifying the evolution of individual scientific impact. Science, 354(6312). https://doi.org/10.1126/science.aaf5239

Vendredi 8 décembre 2017, 14h-15h, Collège Erasme c.142

Aline Villavicencio (University of Essex, UK • INF, Federal University of Rio Grande do Sul, BR)

Identifying Idiomatic Language with Distributional Semantic Models

Precise natural language understanding requires adequate treatments both of single words and of larger units. However, expressions like compound nouns may display idiomaticity, and while a police car is a car used by the police, a loan shark is not a fish that can be borrowed. Therefore it is important to identify which expressions are idiomatic, and which are not, as the latter can be interpreted from a combination of the meanings of their component words while the former cannot. In this talk I discuss the ability of distributional semantic models (DSMs) to capture idiomaticity in compounds, by means of a large-scale multilingual evaluation of DSMs in French and English. A total of 816 DSMs were constructed in 2,856 evaluations. The results obtained show a high correlation with human judgments about compound idiomaticity (Spearman’s ρ=.82 in one dataset), indicating that these models are able to successfully detect idiomaticity.

mots-clés : idiomaticity · distributional semantic models · compound nouns · multiword expressions

références :

Wilkens, R., Zilio, L., Cordeiro, S. R., Ramisch, C., Idiart, M., & Villavicencio, A. (2017). LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds. In Proceedings of the 12th International Conference on Computational Semantics (IWCS). Montpellier.
Cordeiro, S., Ramisch, C., Idiart, M., & Villavicencio, A. (2016). Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers, pp. 1986–1997). Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1187
Ramisch, C., Cordeiro, S., Zilio, L., Idiart, M., & Villavicencio, A. (2016). How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) (Vol. 2: Short Papers, pp. 156–161). Berlin, Germany: ACL. http://aclweb.org/anthology/P/P16/P16-2026.pdf
Cordeiro, S., Ramisch, C., & Villavicencio, A. (2016). mwetoolkit+sem: Integrating Word Embeddings in the mwetoolkit for Semantic MWE Processing. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 1221–1225). Portorož, Slovenia: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2016/summaries/271.html

DEUXIÈME QUADRIMESTRE

Vendredi 9 mars 2018, 14h-15h, Collège Erasme c.142

Leonie Grön (QLVL, KU Leuven, BE)

Term variation in clinical records: First insights from a corpus study

In clinical documentation, we encounter a range of term variants that goes far beyond the standard forms found in medical ontologies. For instance, in Dutch, the concept of high blood pressure can be expressed by a vernacular expression (bloedhoogdruk ‘high blood pressure’) or a specialized term (hypertensie ‘hypertension’); in addition, we find regular morphological alternations (‘verhoogde bloeddruk ‘elevated blood pressure’, hypertens ‘hypertensive’), abbreviations (bhd, ht) and numerous idiosyncratic variants.

Medical term variation poses a challenge to the automatic processing of clinical documents, such as the automatic assignment of ontological codes to electronic health records (EHRs): As medical ontologies do not cover non-standard variants, knowledge-based methods to named entity recognition (NER) are prone to miss a considerable portion of relevant entities in text. On the other hand, machine learning and distributional approaches struggle with the high proportion of idiosyncracies, as well as the stylistic features of EHRs, which are typically composed in a non-grammatical, telegraphic style. To improve NER applications for the clinical domain, it is thus crucial to, firstly, identify patterns of variation between term types and, secondly, detect the context factors that motivate such alternations. For instance, the individual sections within an EHR (e.g. physical examination vs. clinical conclusion) may show distinctive proportions of terms from certain semantic categories (e.g. procedures of examination vs. diagnoses) and terms from a particular register (e.g. vernacular vs. specialized). Such correlations can be utilized to compose a domain-specific feature set for clinical NER.

To assess term variation in clinical Dutch, we conducted a corpus study based on a sample of EHRs from endocrinology. Altogether, the medical histories of 180 patients with diabetes were labelled with codes from the clinical terminology SNOMED-CT. After reporting on methodological challenges encountered during the annotation process, I will present preliminary results from the analysis of the annotated data. Starting from a typology of variation types, I will assess the overlap with the terms covered by standard medical ontologies. Then, I will investigate the term distribution in our corpus based on the relative frequency of term types across the individual case histories, and the different EHR sections. To conclude, I will summarize my initial findings on the influence of context factors, as well as semantic properties of the underlying concept, on term variation.

mots-clés : terminological variation · clinical sublanguage · electronic health records

références :

Grön L., Bertels A. (2018). Clinical sublanguages: Vocabulary structure and its impact on term weighting. Terminology, 24 (1).

diapositives : ici

Vendredi 30 mars 2018, 14h-15h, Collège Erasme c.142

Eric Kergosien (GERIICO / SID, Université de Lille, FR)

Analyse spatiale des médias numériques via des approches de fouille de textes

Dans sa présentation, Éric Kergosien aborde les questions que soulève l’analyse du contenu des communications médiées par les technologies pour extraire des connaissances sur les territoires à partir de méthodes de traitement automatique du langage : comment identifier de manière univoque les localisations dans les documents textuels (règles linguistiques pour récupérer des informations sur les unités spatiales) ? Comment extraire les sujets abordés dans les corpus (termes qui reviennent le plus souvent ensemble), au-delà de l’analyse souvent plus limitée des hashtags ?

Peut-on détecter des opinions favorables ou défavorables à certaines opérations d’aménagement, en partant de lexiques de mots positifs/négatifs ? À travers différentes familles de texte standards (articles de presse, publications scientifiques) et non-standards (tweets et SMS), les exemples présentés font ressortir les redoutables défis linguistiques que soulèvent ces analyses de contenu, qui renouvellent les techniques d’analyse textuelle classiques. Les perspectives d’amélioration du lien entre contenu spatial et thématique (où parle-t-on le plus de tels événements ?) nécessitent plus que jamais la collaboration étroite d’informaticiens, de linguistes et de géographes.

mots-clés : text mining · spatial information extraction · social networks

références :

Zenasni, S., Kergosien, E., Roche, M., & Teisseire, M. (2018). Spatial Information Extraction from Short Messages. Expert Systems with Applications, 95, 351–367. https://doi.org/10.1016/j.eswa.2017.11.025.

Vendredi 20 avril 2018, 14h-15h, Collège Erasme c.142

Henning Wachsmuth (Computational Social Science Group, University of Paderborn, DE)

Computational Assessment of Argumentation Quality

The automatic mining of arguments from natural language text has recently received increased attention, due to its expected impact on future search engines and intelligent personal assistants. Assessing the quality of arguments and argumentation is critical for any application built upon argument mining. Based on foundations from argumentation theory, this talk will present a selection of recent computational approaches to quality assessment. We will discuss the benefit of these approaches in light of the first search engine for arguments on the web, args.me.

mots-clés : computational argumentation · quality assessment · argument search

références :

Henning Wachsmuth, Khalid Al-Khatib, and Benno Stein. Using Argument Mining to Assess the Argumentation Quality of Essays. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), pages 1680-1692, 2016.
Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Alberdingk Thijm, Graeme Hirst, and Benno Stein. Computational Argumentation Quality Assessment in Natural Language. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 176-187, 2017.
Henning Wachsmuth, Martin Potthast, Khalid Al-Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. Building an Argument Search Engine for the Web. In Proceedings of the Fourth Workshop on Argument Mining (ArgMining), pages 49-59, 2017.

diapositives : ici

Vendredi 4 mai 2018, 14h-15h, Collège Erasme c.142

Pierre Lison (Norwegian Computing Center, NO)

Modélisation du dialogue: systèmes de dialogue parlé et corpus multilingues

La modélisation du dialogue fait partie intégrante de multiples applications en TALN, en bonne partie grâce au succès grandissant des interfaces "conversationnelles" comme les chatbots et assistants personnels (par ex. Siri ou Alexa). L'objectif de cet exposé est de donner un aperçu rapide de la modélisation du dialogue à travers deux questions.
1) Peut-on apprendre à une machine à dialoguer dans le cadre d'une tâche précise ? Plus particulièrement, nous examinerons comment estimer efficacement des modèles statistiques du dialogue lorsque les données disponibles sont très limitées (ou coûteuses à collectionner), en faisant appel à une approche hybride combinant connaissances linguistiques ou liées au domaine et modèles statistiques.
2) Peut-on construire des corpus multilingues à partir de sous-titres de films ? Les sous-titres de films constituent une ressource importante pour le TALN, de part leur disponibilité dans de nombreuses langues et pour de multiples registres de langue. Le corpus OpenSubtitles, dont nous avons récemment rendu publique la dernière version, est la plus grande collection de corpus parallèles dans le domaine public, et couvre 3,4 milliards de phrases réparties sur non moins de 60 langues, de l'afrikaans au vietnamien en passant par le breton et le cinghalais.

mots-clés : dialogue modelling · spoken dialogue systems · parallel corpora · machine translation

références :

Lison, P. & Meena, R. (2014) Spoken Dialogue Systems: The New Frontier in Human-computer Interaction. XRDS: Crossroads, 21(1):46-51, ACM.
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.

Vendredi 18 mai 2018

PLIN Linguistic Day 2018:

Technological innovation in language learning and teaching

> Programme complet en PDF

Comité d'organisation

Cédrick Fairon, Université catholique de Louvain
Mathieu Lecouvet, Université catholique de Louvain
Fanny Meunier, Université catholique de Louvain
Ferran Suñer, Université catholique de Louvain

Menu

cental | Louvain-la-Neuve

Organisation 2017-18

Calendrier 2017-18

En bref

Programme complet

Acquisition de ressources pour la simplification de textes médicaux

Learning, Knowing, and Remembering in a Digital World

Embeddings and their use as features in supervised learning tasks

Network Science in the era of Text Mining and Big Data

Identifying Idiomatic Language with Distributional Semantic Models

Term variation in clinical records: First insights from a corpus study

Analyse spatiale des médias numériques via des approches de fouille de textes

Computational Assessment of Argumentation Quality

Modélisation du dialogue: systèmes de dialogue parlé et corpus multilingues

PLIN Linguistic Day 2018:

Technological innovation in language learning and teaching