Thomas François & Damien De Meyere
The CEFRLex project: multilingual CEFR-graded lexical resources for foreign language learning, teaching and research
Since Thorndike's list of the 20,000 most frequent words in English, vocabulary lists have been thoroughly used to help define learning goals for the lexical component of foreign language curriculums. Various approaches have been used to build such lists: (1) estimating word frequencies in large native language (L1) corpora (Kucera & Francis, 1967; Baayen et al., 1993; Brysbaert & New, 2009), (2) relying on expert knowledge, as is the case for the Reference Level Descriptions (RLD) of the CEFR (Common European Framework of Reference), or (3) trying to obtain a graded list of words with machine learning algorithms (Brooke et al., 2012). However, all these lists suffer from the same flaw. They model lexical knowledge in a nominal fashion, i.e. they assign words to levels of proficiency, which implies that all learners from a given level should know all the words from this level.
In the CEFRLex project, lexical resources that describe the frequency distributions of thousand of lexemes along the CEFR scale, have been built for various European languages (so far French, Swedish, English, Spanish, Dutch, and German). These distributions have been estimated on corpora of pedagogical materials intended for L2 purposes such as textbooks and simplified readers. The resulting resources have been manually checked and are are machine-readable and open-licensed. Furthermore, they are available via an on-line query engine for teachers and/or learners.
The main specificity of the CEFRLex project is to assume a continuous vision of lexical learning. As words are described in terms of frequency distribution over proficiency levels, this approach can account for the fact that a word, normally considered to be known at a given level (e.g. B1) in standard resources, already appears in texts intended for lower proficiency levels. Similarly, frequencies per level also allow to discriminate between words assigned to the same level of proficiency. We believe that these innovative resources pave the way for a large spectrum of research opportunities for text readability assessment (Tack et al., 2016a; Pilán et al., 2016), for automated essay grading systems (Pilán et al., 2016), or for personalized models of learners' lexical knowledge (Tack et al., 2021). In our presentation, we will summarize the methodological principles underlying the CEFRLex project, introduce the different resources and demonstrate a new nd enhanced interface for the CEFRLex project.
Elena VolodinaCEFR-graded Morpheme Family for L2 Swedish
Swedish Morpheme Family is a resource that has been developed within the project for Swedish L2 Profiles. We started from a list of lexemes (lemmas+POS+sense) from two L2 corpora - course books and learner essays, both containing CEFR labels for each text. The lexical items have been manually analyzed for morphemes, primarily derivational morphemes such as roots, prefixes, suffixes, as well as for primary word building mechanisms, such as compounding and derivation. As a result, we have frequency information per CEFR level not only for each lexical item, but also for each root, prefix, suffix, word formation mechanism, etc., which makes our resource unique. Besides, we offer the users a possibility to explore receptive and productive tendencies and download the dataset for any potential applications.
In this talk, I will present the work behind the creation of the dataset and the user interface, demo the tool, as well as outline how the tool and the resource can be applied to SLA research, teaching, assessment, ICALL development and some other scenarios.
Elena Volodina is a researcher at the University of Gothenburg, Sweden. She has been active within the development of resources and applications for language learning, her main area of expertise being that of Intelligent Computer-Assisted Language Learning, Learner Corpus Infrastructure, computational linguistic methods and corpus-based text studies. Recently, she has been involved in developing tools for automatic pseudonymization of learner essays and creating lexical and grammatical profiles for Swedish as a second language.
Graded resources : from linguistic engineering to practical applications
Interests in matching reading materials with the abilities of the students can trace its origins back to late 19th century, strongly rooted in empiricism : observation, experience, data. Language professors and psychologists started to collect and analyse quantitative information from textbooks intended to help teachers and educators deal with vocabulary and reading instruction. While a variety of graded reading standards emerged and gradually improved the materials with levels of difficulty, vocabulary-frequency lists for different languages, and readability formulas blossomed and came a long way during the 20th century.
Where are we at now at the time of language technologies, artificial intelligence, smart education, and adaptive learning? In this talk I will propose an overview of existing graded resources and linguistic engineering methodologies. A brief sketch on some practical applications in different scenarios will enable to open the debate on future work in the field.
Núria Gala is Assistant Professor at Aix Marseille Univ. (AMU, France) since 2004. She is interested in analysing linguistic complexity and in building resources to help struggling readers improve reading and vocabulary learning. Her research projects are oriented towards the use of language technologies in computer-assisted language learning applications, and towards populations with special reading-comprehension needs (low-readers, dyslexia, illiteracy, etc.).
The Duolingo CEFR Checker: A Multilingual Tool for Adapting Learning Content
Duolingo is the world's most popular language education platform, with more than 500 million students worldwide. Content creation for the Duolingo app requires adapting text in many languages to target varying levels of proficiency. To make this process more efficient, we have developed automated multilingual methods for aligning content to the CEFR proficiency standard. In this talk, I’ll discuss the Duolingo CEFR Checker, a (semi-)language-agnostic tool that aligns text to the CEFR standard using methods that involve transfer learning, multilingual word embeddings, and word frequencies estimated across large corpora.
Bill McDowell has worked as a Machine Learning Engineer at Duolingo since 2018. At Duolingo, his work has spanned across NLP projects mostly related to grammar pedagogy, including grammatical error correction, detection, and rule induction. Prior to Duolingo, he spent several years wandering around to many academic labs as a research programmer, where he worked on a smorgasbord of projects ranging across topics including temporal
relation extraction, computational pragmatics, knowledge base construction, and interpretable machine learning. His background generally falls within the scope of computer science, cognitive science, and philosophy.