Organisation 2017-18
Calendrier 2017-18
En bref
- 20 octobre 2017 : Natalia Grabar (Université de Lille), Acquisition de ressources pour la simplification de textes médicaux
- 17 novembre 2017 : Naomi Baron (American University), Learning, Knowing, and Remembering in a Digital World
- 24 novembre 2017 : Dirk De Hertog (imec - KU Leuven), Embeddings and their use as features in supervised learning tasks
- 1er décembre 2017 : Pierre Deville (Bisnode), Network Science in the era of Text Mining and Big Data
- 8 décembre 2017 : Aline Villavicencio (University of Essex • Federal University of Rio Grande do Sul), Identifying Idiomatic Language with Distributional Semantic Models
- 9 mars 2018 : Leonie Grön (KU Leuven), Term variation in clinical records: First insights from a corpus study
- 30 mars 2018 : Eric Kergosien (Université de Lille), Analyse spatiale des médias numériques via des approches de fouille de textes
- 20 avril 2018 : Henning Wachsmuth (University of Paderborn), Computational Assessment of Argumentation Quality
- 4 mai 2018 : Pierre Lison (Norwegian Computing Center), Modélisation du dialogue: systèmes de dialogue parlé et corpus multilingues
- 18 mai 2018 : PLIN Linguistic Day 2018, Technological innovation in language learning and teaching
Programme complet
Natalia Grabar (STL, Université de Lille, CNRS, FR)
Acquisition de ressources pour la simplification de textes médicaux
Une des particularités des textes médicaux consiste en utilisation de termes techniques très spécialisés, qui restent souvent non compréhensibles pour les locuteurs. Lors de la simplification de ces textes, il est donc important de disposer de ressources nécessaires. Nous introduisons ici deux méthodes pour effectuer l'acquisition de telles ressources. L'une repose sur les indices internes des termes (l'analyse morphologique des termes composés) alors que l'autre exploite les indices externes des termes (les reformulations effectuées dans les textes). Aucune de ces méthodes ne requiert l'exploitation de corpus parallèles. Nous décrivons et discutons les résultats.
mots-clés : simplification de textes · acquisition de ressources · définition de règles · catégorisation · domaine médical
références :
- Antoine, E., & Grabar, N. (2016). Exploitation de reformulations pour l’acquisition d’un vocabulaire expert/non expert. In Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 2 : TALN (pp. 153–166).
- Grabar, N., & Hamon, T. (2016). A large rated lexicon with French medical words. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 2643–48).
- Grabar, N., & Hamon, T. (2016). Exploitation de la morphologie pour l’extraction automatique de paraphrases grand public des termes médicaux. TAL, 57(1), 85–109.
diapositives : ici
Vendredi 17 novembre 2017, 14h-15h, Auditoire MONTESQUIEU 3
Séminaire co-organisé avec l'Institut Langage et Communication (IL&C)
Naomi Baron (CTRL, American University, Washington, D.C., US)
Learning, Knowing, and Remembering in a Digital World
mots-clés : cognition · digital impact on memory · educational curricula · GPS
Digital tools such as the internet, search engines, and online navigation have put a wealth of information at our fingertips. Are these same tools impacting the way we use human cognitive skills to learn, know, and remember? Research suggests that availability of “google knowing” is redefining our assumptions about what kinds of data – and knowledge – are appropriately held in our own heads. These redefinitions are, in turn, reshaping academic curricula, for good or for ill.
références : ici
diffusion en direct et vidéo : www.facebook.com/didaxoUlearn/videos/1174447416020956
diapositives : ici
Vendredi 24 novembre 2017, 14h-15h, Collège Erasme c.142
Dirk De Hertog (ITEC, imec - KU Leuven, BE)
Embeddings and their use as features in supervised learning tasks
mots-clés : embeddings · supervised learning
This talk provides an introduction to the use and value of distributional word representations within machine learning approaches to NLP. Machine learning aims to learn how to perform specific tasks (e.g., POS-tagging, Named Entity Recognition…) by deriving statistical associations between annotated examples and so called features, i.e., meaningful pieces of information that are relevant for the problem at hand. If the learning is successful then it can be successfully applied to similar, yet new examples. A recent development within NLP is to replace traditional ‘flat’ features with distributional ‘semantic’ representations, such as Semantic Vector Spaces (SVS) and word2Vec. The latter methods rely on contextual information that is derived from large scale corpora to build vector representations of words, effectively transforming a word into a complex data structure.
référence :
- Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141–188. https://doi.org/10.1613/jair.2934
diapositives : ici
Vendredi 1er décembre 2017, 14h-15h, Collège Erasme c.142
Pierre Deville (Head of Data Science, Bisnode Group Analytics, BE)
Network Science in the era of Text Mining and Big Data
mots-clés : networks · big data · visualization
Networks are everywhere. From micro to macro, they pervade our world. Understanding their structure and behaviour has been a major concern recently. However, while tremendous progress has been made on natural networks, we still scramble when it comes to human or business behavior. But this is about to change, as we are now entering the data driven age. Former PhD student at UCL, young entrepreneur and professor at Solvay, Pierre will discuss the opportunities offered by the big data revolution and text mining tools and how it relates to his research on network analysis.
références :
- Deville, P., Wang, D., Sinatra, R., Song, C., Blondel, V. D., & Barabási, A.-L. (2014). Career on the Move: Geography, Stratification, and Scientific Impact. Scientific Reports, 4, srep04770. https://doi.org/10.1038/srep04770
- Sinatra, R., Deville, P., Szell, M., Wang, D., & Barabási, A.-L. (2015). A century of physics. Nature Physics, 11(10), 791–796. https://doi.org/10.1038/nphys3494
- Sinatra, R., Wang, D., Deville, P., Song, C., & Barabási, A.-L. (2016). Quantifying the evolution of individual scientific impact. Science, 354(6312). https://doi.org/10.1126/science.aaf5239
Vendredi 8 décembre 2017, 14h-15h, Collège Erasme c.142
Aline Villavicencio (University of Essex, UK • INF, Federal University of Rio Grande do Sul, BR)
Identifying Idiomatic Language with Distributional Semantic Models
Precise natural language understanding requires adequate treatments both of single words and of larger units. However, expressions like compound nouns may display idiomaticity, and while a police car is a car used by the police, a loan shark is not a fish that can be borrowed. Therefore it is important to identify which expressions are idiomatic, and which are not, as the latter can be interpreted from a combination of the meanings of their component words while the former cannot. In this talk I discuss the ability of distributional semantic models (DSMs) to capture idiomaticity in compounds, by means of a large-scale multilingual evaluation of DSMs in French and English. A total of 816 DSMs were constructed in 2,856 evaluations. The results obtained show a high correlation with human judgments about compound idiomaticity (Spearman’s ρ=.82 in one dataset), indicating that these models are able to successfully detect idiomaticity.
mots-clés : idiomaticity · distributional semantic models · compound nouns · multiword expressions
références :
- Wilkens, R., Zilio, L., Cordeiro, S. R., Ramisch, C., Idiart, M., & Villavicencio, A. (2017). LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds. In Proceedings of the 12th International Conference on Computational Semantics (IWCS). Montpellier.
- Cordeiro, S., Ramisch, C., Idiart, M., & Villavicencio, A. (2016). Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers, pp. 1986–1997). Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1187
- Ramisch, C., Cordeiro, S., Zilio, L., Idiart, M., & Villavicencio, A. (2016). How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) (Vol. 2: Short Papers, pp. 156–161). Berlin, Germany: ACL. http://aclweb.org/anthology/P/P16/P16-2026.pdf
- Cordeiro, S., Ramisch, C., & Villavicencio, A. (2016). mwetoolkit+sem: Integrating Word Embeddings in the mwetoolkit for Semantic MWE Processing. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 1221–1225). Portorož, Slovenia: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2016/summaries/271.html
Leonie Grön (QLVL, KU Leuven, BE)
Term variation in clinical records: First insights from a corpus study
In clinical documentation, we encounter a range of term variants that goes far beyond the standard forms found in medical ontologies. For instance, in Dutch, the concept of high blood pressure can be expressed by a vernacular expression (bloedhoogdruk ‘high blood pressure’) or a specialized term (hypertensie ‘hypertension’); in addition, we find regular morphological alternations (‘verhoogde bloeddruk ‘elevated blood pressure’, hypertens ‘hypertensive’), abbreviations (bhd, ht) and numerous idiosyncratic variants.
Medical term variation poses a challenge to the automatic processing of clinical documents, such as the automatic assignment of ontological codes to electronic health records (EHRs): As medical ontologies do not cover non-standard variants, knowledge-based methods to named entity recognition (NER) are prone to miss a considerable portion of relevant entities in text. On the other hand, machine learning and distributional approaches struggle with the high proportion of idiosyncracies, as well as the stylistic features of EHRs, which are typically composed in a non-grammatical, telegraphic style. To improve NER applications for the clinical domain, it is thus crucial to, firstly, identify patterns of variation between term types and, secondly, detect the context factors that motivate such alternations. For instance, the individual sections within an EHR (e.g. physical examination vs. clinical conclusion) may show distinctive proportions of terms from certain semantic categories (e.g. procedures of examination vs. diagnoses) and terms from a particular register (e.g. vernacular vs. specialized). Such correlations can be utilized to compose a domain-specific feature set for clinical NER.
To assess term variation in clinical Dutch, we conducted a corpus study based on a sample of EHRs from endocrinology. Altogether, the medical histories of 180 patients with diabetes were labelled with codes from the clinical terminology SNOMED-CT. After reporting on methodological challenges encountered during the annotation process, I will present preliminary results from the analysis of the annotated data. Starting from a typology of variation types, I will assess the overlap with the terms covered by standard medical ontologies. Then, I will investigate the term distribution in our corpus based on the relative frequency of term types across the individual case histories, and the different EHR sections. To conclude, I will summarize my initial findings on the influence of context factors, as well as semantic properties of the underlying concept, on term variation.
mots-clés : terminological variation · clinical sublanguage · electronic health records
références :
- Grön L., Bertels A. (2018). Clinical sublanguages: Vocabulary structure and its impact on term weighting. Terminology, 24 (1).
diapositives : ici
Vendredi 30 mars 2018, 14h-15h, Collège Erasme c.142
Eric Kergosien (GERIICO / SID, Université de Lille, FR)
Analyse spatiale des médias numériques via des approches de fouille de textes
Dans sa présentation, Éric Kergosien aborde les questions que soulève l’analyse du contenu des communications médiées par les technologies pour extraire des connaissances sur les territoires à partir de méthodes de traitement automatique du langage : comment identifier de manière univoque les localisations dans les documents textuels (règles linguistiques pour récupérer des informations sur les unités spatiales) ? Comment extraire les sujets abordés dans les corpus (termes qui reviennent le plus souvent ensemble), au-delà de l’analyse souvent plus limitée des hashtags ?
Peut-on détecter des opinions favorables ou défavorables à certaines opérations d’aménagement, en partant de lexiques de mots positifs/négatifs ? À travers différentes familles de texte standards (articles de presse, publications scientifiques) et non-standards (tweets et SMS), les exemples présentés font ressortir les redoutables défis linguistiques que soulèvent ces analyses de contenu, qui renouvellent les techniques d’analyse textuelle classiques. Les perspectives d’amélioration du lien entre contenu spatial et thématique (où parle-t-on le plus de tels événements ?) nécessitent plus que jamais la collaboration étroite d’informaticiens, de linguistes et de géographes.
mots-clés : text mining · spatial information extraction · social networks
références :
- Zenasni, S., Kergosien, E., Roche, M., & Teisseire, M. (2018). Spatial Information Extraction from Short Messages. Expert Systems with Applications, 95, 351–367. https://doi.org/10.1016/j.eswa.2017.11.025.
Vendredi 20 avril 2018, 14h-15h, Collège Erasme c.142
Henning Wachsmuth (Computational Social Science Group, University of Paderborn, DE)
Computational Assessment of Argumentation Quality
The automatic mining of arguments from natural language text has recently received increased attention, due to its expected impact on future search engines and intelligent personal assistants. Assessing the quality of arguments and argumentation is critical for any application built upon argument mining. Based on foundations from argumentation theory, this talk will present a selection of recent computational approaches to quality assessment. We will discuss the benefit of these approaches in light of the first search engine for arguments on the web, args.me.
mots-clés : computational argumentation · quality assessment · argument search
références :
- Henning Wachsmuth, Khalid Al-Khatib, and Benno Stein. Using Argument Mining to Assess the Argumentation Quality of Essays. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), pages 1680-1692, 2016.
- Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Alberdingk Thijm, Graeme Hirst, and Benno Stein. Computational Argumentation Quality Assessment in Natural Language. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 176-187, 2017.
- Henning Wachsmuth, Martin Potthast, Khalid Al-Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. Building an Argument Search Engine for the Web. In Proceedings of the Fourth Workshop on Argument Mining (ArgMining), pages 49-59, 2017.
diapositives : ici
Vendredi 4 mai 2018, 14h-15h, Collège Erasme c.142
Pierre Lison (Norwegian Computing Center, NO)
Modélisation du dialogue: systèmes de dialogue parlé et corpus multilingues
La modélisation du dialogue fait partie intégrante de multiples applications en TALN, en bonne partie grâce au succès grandissant des interfaces "conversationnelles" comme les chatbots et assistants personnels (par ex. Siri ou Alexa). L'objectif de cet exposé est de donner un aperçu rapide de la modélisation du dialogue à travers deux questions.
1) Peut-on apprendre à une machine à dialoguer dans le cadre d'une tâche précise ? Plus particulièrement, nous examinerons comment estimer efficacement des modèles statistiques du dialogue lorsque les données disponibles sont très limitées (ou coûteuses à collectionner), en faisant appel à une approche hybride combinant connaissances linguistiques ou liées au domaine et modèles statistiques.
2) Peut-on construire des corpus multilingues à partir de sous-titres de films ? Les sous-titres de films constituent une ressource importante pour le TALN, de part leur disponibilité dans de nombreuses langues et pour de multiples registres de langue. Le corpus OpenSubtitles, dont nous avons récemment rendu publique la dernière version, est la plus grande collection de corpus parallèles dans le domaine public, et couvre 3,4 milliards de phrases réparties sur non moins de 60 langues, de l'afrikaans au vietnamien en passant par le breton et le cinghalais.
mots-clés : dialogue modelling · spoken dialogue systems · parallel corpora · machine translation
références :
-
Lison, P. & Meena, R. (2014) Spoken Dialogue Systems: The New Frontier in Human-computer Interaction. XRDS: Crossroads, 21(1):46-51, ACM.
-
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.
PLIN Linguistic Day 2018:
Technological innovation in language learning and teaching
Comité d'organisation
Cédrick Fairon, Université catholique de Louvain
Mathieu Lecouvet, Université catholique de Louvain
Fanny Meunier, Université catholique de Louvain
Ferran Suñer, Université catholique de Louvain