The Duolingo CEFR Checker: A Multilingual Tool for Adapting Learning Content
Duolingo is the world's most popular language education platform, with more than 500 million students worldwide. Content creation for the Duolingo app requires adapting text in many languages to target varying levels of proficiency. To make this process more efficient, we have developed automated multilingual methods for aligning content to the CEFR proficiency standard. In this talk, I’ll discuss the Duolingo CEFR Checker, a (semi-)language-agnostic tool that aligns text to the CEFR standard using methods that involve transfer learning, multilingual word embeddings, and word frequencies estimated across large corpora.
Bill McDowell has worked as a Machine Learning Engineer at Duolingo since 2018. At Duolingo, his work has spanned across NLP projects mostly related to grammar pedagogy, including grammatical error correction, detection, and rule induction. Prior to Duolingo, he spent several years wandering around to many academic labs as a research programmer, where he worked on a smorgasbord of projects ranging across topics including temporal
relation extraction, computational pragmatics, knowledge base construction, and interpretable machine learning. His background generally falls within the scope of computer science, cognitive science, and philosophy.
Elena VolodinaCEFR-graded Morpheme Family for L2 Swedish
Swedish Morpheme Family is a resource that has been developed within the project for Swedish L2 Profiles. We started from a list of lexemes (lemmas+POS+sense) from two L2 corpora - course books and learner essays, both containing CEFR labels for each text. The lexical items have been manually analyzed for morphemes, primarily derivational morphemes such as roots, prefixes, suffixes, as well as for primary word building mechanisms, such as compounding and derivation. As a result, we have frequency information per CEFR level not only for each lexical item, but also for each root, prefix, suffix, word formation mechanism, etc., which makes our resource unique. Besides, we offer the users a possibility to explore receptive and productive tendencies and download the dataset for any potential applications.
In this talk, I will present the work behind the creation of the dataset and the user interface, demo the tool, as well as outline how the tool and the resource can be applied to SLA research, teaching, assessment, ICALL development and some other scenarios.
Elena Volodina is a researcher at the University of Gothenburg, Sweden. She has been active within the development of resources and applications for language learning, her main area of expertise being that of Intelligent Computer-Assisted Language Learning, Learner Corpus Infrastructure, computational linguistic methods and corpus-based text studies. Recently, she has been involved in developing tools for automatic pseudonymization of learner essays and creating lexical and grammatical profiles for Swedish as a second language.