GR4L2 - Accepted presentations

An automatic annotation toolchain to recognize, quantify and visualize occurrences of CEFR-graded vocabulary and grammar patterns in English learner texts

Viet Phe Nguyen, Ye Yao, Andrea Horbach, Stefan Keller, Ronja Laarmann-Quante and Torsten Zesch

When scoring essays written by foreign language learners to determine their language proficiency, the occurrence of vocabulary and grammatical phenomena associated with a certain CEFR level are important sources of information both for human scorers and for automatic scoring. To inform computers, various kinds of features based on the frequency of word groups and grammatical phenomena are used (e.g. Hancke and Meurers 2013). It has been shown that lexical and grammatical features in learner texts strongly influence human scores, even distorting perception of unrelated textual issues (Vögelin, Jansen, Keller, Machts, & Möller, 2019). It stands to reason, therefore, that humans benefit from inspecting these feature values or being shown instances of phenomena associated with various CEFR levels directly in the text. However, no empirical studies (to our knowledge) have investigated the effects of such measures, constituting a research gap.

To meet the twin goals of informing computers and assisting human raters in assessment, we propose an automatic annotation framework that recognizes instances of vocabulary items and grammatical structures associated with a certain CEFR level in English learner texts. We rely on the English vocabulary profile and the English Grammar profile as provided through the English Profile Project ( The resources include about 16000 vocabulary entries and 1200 grammatical structures anchored within the CEFR.

Our implementation of grammatical structures relies on RUTA (Kluegl et al. 2016) patterns integrated into an UIMA (Ferrucci and Lally 2004) NLP annotation pipeline (Zesch and Horbach 2018). We provide feature extractors based on DKPro TC that extract count-based features about the frequencies of individual grammatical patterns as well as grammatical constructions associated with a certain phenomenon (such as adjectives) or a specific CEFR level (e.g., all patterns associated with B1).

On the vocabulary level, we provide features based on the frequency of words from a certain CEFR level on either type or token basis. We use NLP preprocessing information to extract lemma information for database lookup and POS information to distinguish between vocabulary items with the same lemma, but different POS classes that belong to different CEFR levels (such as ‘house’ as a noun (A1) vs ‘house’ as a verb (C2)).

For the grammar patterns, we implement RUTA patterns for 80% of the patterns from the English Grammar Profile, again relying on lemma and POS information for the process. We provide a manual evaluation study on authentic learner texts to determine precision of the patterns.

We integrate the automatic annotation pipeline into a user interface that allows both the automatic extraction of features and the highlighting of the found annotations in the text. In the workshop, we present preliminary data on the precision of this tool and discuss its future application in studies involving training of teachers’ diagnostic competences, as well as automated feedback to learners.


Toward constructing a corpus with CEFR-based sentence level annotations

Satoru Uchida, Yuki Arase and Tomoyuki Kajiwara

The CEFR is now the world’s most widely used indicator of language learning, where each level is defined by “can-do” descriptors. For example, in the self-assessment grid for reading comprehension in B1, one of the descriptors is “I can understand the description of events, feelings and wishes in personal letters.” Such a statement clearly defines what learners at this level can be expected to do and is helpful to assess learners’ abilities across languages. Vocabulary and grammar skills that are needed to perform the action differ from language to language, so attempts have been made to clarify the criterial features of each level. For example, the English Profile project (cf. Salamoura & Saville, 2010) and the CEFR-J project (cf. Ishii & Tono, 2018) relate English vocabulary and grammar to CEFR levels mainly based on learners’ and textbook corpora. There have also been attempts to determine the level of English reading passages, such as Text Inspector (Bax, 2012) and CVLA (Uchida & Negishi, 2018). These approaches, both at micro (vocabulary and grammar) and macro (passage) levels, have proven to be useful, but there have been few attempts to assign CEFR levels on a sentence level despite its importance for learning and teaching. In addition, sentences with CEFR levels are valuable in natural language processing research. Text simplification (e.g. Nishihara et al. 2019) automatically transforms a complex text into a simple one suitable for a learner using machine learning. In the current text simplification, transformation is sentence-based. However, researchers in this area have struggled with the scarcity of a language resource that assigns CEFR levels to sentences in a reliable manner.

The purpose of the present study is to provide an overview of the construction of language resources with CEFR levels assigned on a sentence-by-sentence basis. The level of a sentence is defined by a variety of factors, including vocabulary, grammar, and topic. For example, the sentence “Music has been important to people.” is entirely A1-level vocabulary, but it can be grammatically complex because of the present perfect tense. Also, “Don’t shoot the messenger.” is simple in vocabulary and grammar but is difficult for beginners because it is an idiom. On the other hand, the sentence “A curator is a person who takes care of objects or works of art in a museum.” contains the difficult word “curator,” but since it is a defining statement, understanding the sentence itself is not very challenging. In other words, the difficulty of a sentence is determined by a combination of multiple factors. Given the language resource with sentence level annotations, machine learning can potentially uncover the factors that define sentence difficulty. To this end, we are currently constructing a corpus of about 20,000 sentences, each of which is assigned a CEFR level by experienced language educators. In this presentation, we will give an overview of this corpus, which will be released in the near future, and then discuss the feature of each CEFR level using the preliminary data.


Grammar profiling for empirical research and teaching

Therese Lindström Tiedemann, Yousuf Ali Mohammed and Elena Volodina

The Swedish L2 grammatical profile (SweL2GP) is the first digital tool (work-in-progress) allowing users to study Swedish written learner language from a grammatical point of view. Currently, it can be browsed by verb and noun patterns as they are represented in course books and learner essays. The SweL2GP contains 40 verbal patterns and 26 noun patterns, distributed across five proficiency levels (A1–C1) and contrasting receptive and productive knowledge.

The Swedish L2 grammatical profile is a descriptive resource that offers the users a possibility to draw their own conclusions about the centrality of various grammatical patterns at different stages of linguistic competence. It is possible:
(1) to filter for various characteristics, e.g. for verbs: Tense, Mood, Voice, each pattern having a reference description that can be used by researchers, teachers and learners alike;
(2) to explore hits containing a search pattern in the two source corpora: receptive/COCTAILL (Volodina et al., 2014) and productive/SweLL-pilot (Volodina et al. 2016), by clicking on their frequencies;
(3) to study a statistical break-down of each selection contrasting receptive and productive competences for each feature and to compare absolute and relative frequencies per level, identifying important tendencies ;
(4) to download (the filtered parts of) the dataset for further study or annotation.

Grammar profile resources exist predominantly for English, e.g. English Grammar profile (EGP; O’Keeffe and Mark, 2017), Pearsson’s GSE Teacher Toolkit, CEFR-J for Japanese English (Tono, 2018). Most other languages have nothing similar, the Estonian profile (Üksik et al., 2021) being one of the first non-English profiles.

The design of grammar profiles can target different user groups. For example, EGP and other previous grammar profiles focus on teachers and learners. Even though they have been based on empirical corpus data, this data is not openly provided in connection with the resource. EGP, for instance, has a didactic set up that gives teachers and learners encyclopedic overviews, with some excerpts from the data sources that it has been based on but no full access. The L2 Swedish grammar profile is clearly different in that it takes a non-prescriptive view of the language and provides access to the empirical evidence, i.e. all corpus hits and statistics of actual usage. It lets users zoom in on actual data and draw their own conclusions, or stay with the overview provided in the resource. The provision of both receptive and productive frequencies gives a more nuanced picture of language learning. Our resource – and tool – is thus more readily appropriate for research on second language acquisition than any predecessor known to us. Our intention in developing the Swedish L2 grammatical profile is to stimulate further research into L2 learner Swedish, as well as to offer teachers and learners a possibility to work with the tool as part of their courses. In addition, the open nature of the resource makes it highly usable for future learning apps and the training of various automatic tools.


Graded Word Family resource for L2 Swedish

Elena Volodina, Yousuf Ali Mohammed and Therese Lindstrom Tiedemann

The choice of the main lexical unit in a vocabulary list has a significant impact on its application scenarios. The appropriateness of different types of lexical units in language learning contexts is debated (e.g. Brown 2018), each of the units have their advantages and disadvantages. The Swedish L2 Profile (SweL2P) offers a possibility to explore vocabulary in several ways: using lemgrams, sense-based lemgrams (cf lexemes) and word families (grouped around a shared root) as the main lexical units. We hypothesize that the value of different approaches can best be studied if it is possible to compare the same lexical items from different perspectives and we see e.g. sense-based lemgrams and word families as providing different perspectives.

An advantage of having vocabulary organized by word families is that it helps to track whether knowledge of certain roots or previously learnt items primes the next step in vocabulary acquisition (cf Laufer et al. 2021). For example, once a learner has become acquainted with the word day, it is hypothetically easier to acquire words and expressions containing day (cf Bauer & Nation 1993; Schmitt & Zimmerman 2002), e.g. weekday, day off. With information about the level where the words are first used and/or (relative) frequency of occurrence at different levels we have a way of studying this systematically. These are features of the Swedish L2 Word Family resource.

Swedish L2 Word Family contains 15.217 sense-based lemgrams organized into 4.429 word families. To create this list, we have generated a list of sense-based lemgrams from a corpus of text books representing receptive knowledge (Coctaill, Volodina et al. 2014) and a corpus of learner essays representing productive knowledge (SweLL-pilot, Volodina et al. 2016), both labeled with CEFR levels. A team of annotators have analyzed each item on the list for their constituent morphemes (e.g. prefix, root) and word formation mechanism (e.g. compounding, derivation). The resulting resource is browsable online.

The SweL2P/Word Family-resource can be filtered for e.g. a certain root, word, word class(es) and/or word formation mechanism(s). Entering a root into the search window (e.g. bröd), shows all lemgrams in the family (for each level where they appear). The resource can be explored using several views: Table, Graphical and Statistical. The Table view (Fig.1) lists all items with associated information about each of them. Columns contain descriptive information, among others, a clickable lemgram with a link to its definition, and shows the morphological analysis of the word, and clickable receptive and productive (relative and absolute) frequencies open a corpus search tool containing the hits with those lemgrams. Graphical and statistical views summarize the statistics and the distribution of various features for the current selection in the two sources – receptive and productive.

The SweL2P/Word family-resource is of a descriptive nature and offers an empirically-based resource for research, teaching and ICALL app development.


DAFLex: a CEFR-graded lexical resource for German as a foreign language

Thomas Francois, Patricia Kerres, Damien De Meyere and Ferran Suñer Muñoz

Since Thorndike's list of the 20,000 most frequent words in English, vocabulary lists have been used consistently to help define learning goals for the lexical component of foreign language curriculums. Various approaches have been used to build such lists. The most common approach consists in estimating word frequencies in large native language (L1) corpora. Interestingly, German is the first language in which a list of the most frequent words was developed by Kaeding in 1898, based on a corpus of 11 million words (Bontrager, 1991). However, one had to wait 1995 and the German Celex (Baayen et al., 1995) to get a modern frequency list. Other recent lists include the dlexDB project (Heister et al., 2011), SUBTLEX-D (Brysbaert et al., 2011), or the Frequency Dictionary of German (Tschirner et al., 2019). A second approach relies on expert knowledge, as is the case for the Reference Level Descriptions (RLD) of the CEFR (Common European Framework of Reference). For German, the RLD is called Profile deutsch (Glaboniat et al., 2015). and describes level A1 to level C2.

DAFLex is part of the CEFRLex project that aims to describe word usage by L2 learners of several European languages. The project adopts a fine-grained and continuous approach of word usage and considers that words do not occur with the same frequency at the different stages of the learning process. Therefore, DAFLEX is a lexical resource for German that describes the frequency distributions of thousands of lexemes over the six levels of the CEFR scale. These distributions have been estimated on a corpus of pedagogical materials intended for German as a foreign language (GFL). It is currently made up of more than 3700 texts selected from 32 GLF textbooks, which have been OCRized, for a total of about 750,000 words. We then conducted a comparative study of several POS-taggers annotators for German to select the best one to use to create DAFLex. As a next step, we computed normalized frequency distributions across six CEFR levels (A1 to C2), following the methodology of the CEFRLex project (François et al., 2014).

In our talk, we will detail more precisely the methodology and data used to create DAFLex and report statistics comparing DAFLex with other frequency lists for German, but also with other resources from the CEFRLex project. The resulting resource is machine- readable and open-licensed. Furthermore, it will be available via an on-line query engine1 for teachers and/or learners. We believe that DAFLex paves the way for a large spectrum of research opportunities for text readability assessment, for automated essay grading systems, or for personalized models of learners' lexical knowledge, among others.


Mapping Multi-word expressions to CEFR for French as a foreign language: PolylexFLE

Amalia Todirascu, Thomas Francois and Marion Cargill

Multi-word expressions (MWEs) are a class of linguistic objects that are highly heterogeneous : idioms, collocations, fixed phrases. It is acknowledged that foreign language learners have difficulties acquiring and processing MWEs: their MWE proficiency falls far short of their general lexical knowledge (Bahns & Eldaw, 1993) and they tend to translate MWE word-for-word, ignoring their figurative meaning or their lexical and syntactic constraints. However, recent studies show that good mastery of MWEs improves reading comprehension (Kremmel et al., 2017).

Various initiatives or research have been investigated to improve MWE proficiency for L2 learners. Among them, Language Muse offers compound words or particle verbs in English to improve learners' reading (Madnani et al., 2016); Substituto (Araneta et al., 2020) is a serious game where a limited number of particle verbs, manually identified in the English Vocabulary Profile, should be related to their synonyms. Alfter and Graën (2019) have also developed a system dedicated to particle verbs based on the resources of the CEFRLex project. However, current pedagogical efforts remain hindered by the lack of a rich lexical resource for MWE in which expressions would be connected to the CEFR scale of proficiency. At this level, the Base Lexicale du Français (Verlinde et al., 2006), the Lexique-Grammaire (Gross, 1994) or the DIRE autrement project (Hamel et al., 2016) offer lists of MWE for French, but they are not connected to the CEFR scale (European Framework of Reference for Languages). On the other hand, the Beacco et al. (2004)’s Referential Level Descriptor includes some MWEs with their CEFR levels, but the list is limited and its format is not appropriate for a computer. Finally, the FLELex resource (François et al., 2014) comprises about 2,000 MWEs for French, connected to the CEFR scale, but most of these are of nominal nature.

In this talk, we present an extended version of the PolylexFLE database, containing 4,525 verbal multiword expressions (MWE). Verbal MWE seem more complex for L2 learners (Siyanova-Chanturia, 2017) but also represent a harder challenge for NLP, as concerns their automatic identification (Pasquer et al., 2018). In order to build a resource of MWEs with a CEFR levels, we used a mixed approach (manual and automatic) to annotate 1,186 expressions according to the CEFR levels. The paper focuses on the automatic procedure that first identifies the expressions from the PolylexFLE database (and their variants) in a corpus using a regular expression-based system. In a second step, their distribution in this corpus, labelled according to the CEFR scale, is estimated and transformed into a single CEFR level. To assess the quality of the labels such inferred, the CEFR automatic annotation of the expressions has been evaluated by 52 FFL learners on a sample of 61 MWEs. The resulting resource, PolylexFLE, could be used to propose exercises that automatically adapt to the CEFR level of the learners for French.


Inducing CEFR levels for student skills and linguistic constructs from learner data

Jue Hou, Anisia Katinskaia, Giacomo Furlan, Ilmari Kylliäinen, Nataliya Stoyanova and Roman Yangarber

We present an AI-based approach to language learning1. We aim to develop an environment for learners beyond the beginner level. The approach tries to: continually model the learner’s competency, generate exercises personalized ac- cording to current competency, provide feedback to guide the learner toward the correct answers, and use the answers to continually adjust the student models.

Support for teachers includes: sharing learning materials with students, con- trolling learning settings, and detailed analysis to track the learner’s progress.

Learners can upload arbitrary texts about their topics of interest, and the system creates an unlimited amount of exercises based on these texts. Providing relevant feedback requires evaluating the level of proficiency of the learner and the difficulty of the material. For each language, experts in didactics specify a hierarchy of linguistic constructs, which the learner should master. Examples of constructs: a particular paradigm of noun inflection; how certain verbs govern their arguments; etc. The system links each exercise or test question to learning constructs. Each construct is assigned a CEFR level by a teaching expert.

Our main goal is to provide personalized learning to the student: to of- fer exercises that are optimally adapted to the student’s current level of skills. Therefore, assessment of learner competency is central to the approach. Continual assessment is needed 1. internally, to support personalization; 2. externally, to communicate the student’s competency to the learner and the teacher. Thus, we try to link internal representations of competency to the CEFR scale. We research methods for inducing assessment from continuous streams of incoming learner data. We summarize several methods, and how each assessment method is linked to the CEFR scale.

One approach to assessment is via Elo ratings [1, 2, 3, 4, 5]. We can treat one learner response to one exercise as a “match” between the user and the learning item. An item may be: an exercise in a snippet of text—cloze, multiple-choice, etc.—a test question, a vocabulary flashcard, etc. As the “matches” progress, the Elo formula (which is used in chess and many on-line games) adjusts the difficulty levels of learning constructs and skill scores of learners. The score changes continually depending on the correctness of the learner’s answers. Some of the students also receive CEFR levels manually assigned by their (human) teachers, who are our collaborators. These human-assigned CEFR levels are used to calibrate the Elo scores of learners, induced automatically from learner data. This provides estimates of CEFR levels for new students, and estimates of the difficulty of the learning items.

We explore two other approaches: simple accuracy count, and the 3PL model from Item Response Theory (IRT) [6, 7, 8, 9]. These methods show that CEFR levels tagged by human experts do not correlate perfectly with difficulty levels induced for constructs directly from learner data. This paper investigates the potential causes for these discrepancies, and shows that the estimates induced automatically from large data are more reliable than levels assigned by teachers by following “traditional” assessment guidelines. The difficulty of a concept, in our context, models the difficulty of answering the questions as experienced by a large (and expanding) base of learners.


Mapping of American English vocabulary by grade levels

Michael Flor, Steven Holtzman, Paul Deane and Isaac Bejar

We describe a large-scale effort to map English-language vocabualry by US school- grade levels. Our main motivation is for rapidly expanding coverage of available resources oriented for work with native English speakers in the USA, while taking into consideration school-related influences rather than relying on just the corpus-frequency approaches. Our approach is based on which words are expected to be learned at each grade level. We describe the motivation of this effort and report on the initial effort of data collection, which resulted in mapping of about 21,000 words (types). We provide comparisons of this mapping to some other recent vocabulary mapping efforts, such as age-of-acquisition and testing-based approaches. We then describe the efforts to automatically expand this resource by using linguistically motivated variables and automated corpus-based methods. Our current resource maps more than 126,000 English words (types) to grade levels K-16. Next, we also describe the mapping of our US grade-leveled resource to CEFR vocabulary levels. We also discuss the advantages and the shortcomings of our approach.


Fine-tuning Auto-Regressive language models for Conditional Text Generation

Vasileios Kalogiras, Quintus Roos and Sebastiaan Vergunst

Learning a new language is complicated, requiring both time and effort from the learner (Brown et al., 2000). Depending on the level of ambition and ability, mastering all parts of a new language can take on average up to 5-7 years in traditional learning environments (Demie, 2013). In contrast to traditional classroom based learning, where pupils follow pre- defined curriculum, personalised learning aims to adapt content to each learners needs (Holmes et al., 2018). In order for language learners to remain motivated, it is important to adapt the content to the right level. By leveraging modern technology, such as recommendation systems and text generators, language learners are able to interact with personalised content and study plans to a greater extent than what would otherwise be possible in traditional learning environments (Gilakjani, 2014). At the same time, previous work on language modeling (Wei et al., 2021) show that fine-tuning language models, such as GPT-3 (Brown et al., 2020), increases the zero-shot performance on unseen tasks substantially com- pared to untuned models and surpasses zero-shot GPT-3 on most of the evaluation tasks. In addition, fine-tuned models also demonstrate better generalization capabilities.

In this effort we aim to tackle the task of conditional text generation (CTG) in a language learning context by fine-tuning auto-regressive lan- guage models. The models will be fine-tuned by instruction-tuning (Wei et al., 2021) their pre-trained snapshots using sequence classification and sequence generation tasks. We rely on the Common European Framework of Reference (CEFR) as a method of measuring English competence. The generated texts will be in the form of sentences, covering various themes given a certain CEFR level. During the evaluation phase, we will assess whether the generated content matches the correct level. In order to train and evaluate the quality of the fine-tuned model, we will create a dataset of sentences labeled by CEFR level based on text from the Books3 corpus (Presser, 2020; Gao et al., 2021). Annotation will happen using a lexical resource such as EFLLex (Dürlich and Franc ̧ois, 2018). Model fine-tuning will be performed on freely available language model (Wang and Komat- suzaki, 2021) snapshots. A Positive outcome will hint that auto-regressive language models combined with instruction-tuning are good candidates for text generation for specific CEFR levels without sourcing carefully an- notated training corpora. The resulting fine-tuned models will be made available through Hugging Face model environment.


The role of collocations in text quality: towards a CEFR-graded list of collocations

Rocío Cuberos Vicente and Elisa Rosado Villegas

Gauging the quality of a text is a challenging endeavor. What becomes especially problematic is elucidating what criteria should be applied to decide what makes a text good, and this is true for both first (L1) and second language (L2) production. In finding the best way to express themselves, both native and non-native speakers/writers make decisions that affect the syntactic, the discursive and, crucially, the lexical level.

In L1 production, the development of the skills involved in text production is mainly shaped by the schooling level (Tolchinsky, 2004), while in the L2 this necessarily interacts with the level of competence (Rosado et al., 2014). As far as lexical development is concerned, previous research (Olinghouse & Leaird, 2009; Lu, 2012) has aimed at determining which lexical features are associated to the estimated quality of a text. For instance, in Cuberos (2019) the experienced changes in the lexical repertoire of L1 and L2 Spanish speakers were depicted for different ages and L2 level. The use of collocations not only revealed as a reliable correlate of text quality and contributed to explaining variation in the scores but also allowed to distinguish beginner from intermediate and advanced learners. These results indicated the relevance of collocations in assessing L2 proficiency.

Collocations, however, have not received the necessary attention in L2 curriculums or assessment materials (Higueras, 2017). In fact, as noticed by Paquot (2018), the CEFR adopts a very traditional understanding of phraseology, by referring mainly to stock pragmatic phrases, and failing to provide a flexible, practical definition of the term.

We depart from a deliberately wide concept of collocation, simultaneously based both in the framework provided by the distributional and the phraseological approaches (Granger & Paquot, 2008; Laso, 2009), thus combining linguistic as well as frequency criteria. What we present here is the use of an innovative tool developed and tested in the context of an empirical study of spontaneous native and non-native Spanish (Cuberos, 2019). This test for the identification of phraseological units (TIPUS) is a three-step procedure; for an identified unit to be considered a collocation, two out of the three following criteria must be met:

Step 1. Manual extraction of collocations. Researcher-based identification process; inter- rater reliability in 20% of the texts.

Step 2. Presence of collocations in the Diccionario combinatorio práctico del español contemporáneo (Bosque, 2006).

Step 3. Strength of the association and frequency of cooccurrence measured using log- Dice (Rychlý, 2008).

TIPUS allows for the extraction of collocations and aims to provide teachers and researchers with a CEFR-graded list of collocations in Spanish.

In this presentation we provide a comprehensive description of TIPUS and illustrate its use with examples of our L2 Spanish corpus. A model of the CEFR-graded list of collocations in Spanish is provided, as well as the report on inter-rater reliability testing. Finally, we discuss the implications of using TIPUS in corpus-based studies for capturing lexical development and assessing text quality.


Automatic Assessment of Spoken Language Proficiency Based on Three-Stage Learning

Kamel Nebhi, Bob Zhou, Farhad Nooralazadeh and Gyorgy Szaszak


This paper describes technology developed to automatically grade students on their English spontaneous spoken language proficiency with CEFR level. The students’ spoken answers are first transcribed by an automatic speech recognition (ASR) system and then scored using a Three-stage learning framework processing features extracted from the transcriptions.

This work argues that ASR features, handcrafted NLP features and transformer-based features are complementary for a valid Automatic Spoken Assessment system (Mayfield, E., 2020, Gretter, R. et al., 2019).

The contributions in this paper are twofold: 1) a three-stage learning framework is proposed to improve automatic assessment of spoken language; 2) a novel model is proposed to learn from acoustics, lexical, semantic, syntactic and coherence features.

During the first stage, answers are transcribed using our ASR model and acoustics features are computed. In the second stage, we compute traditional NLP features. During the third stage, we calculate topic relevancy and coherence score using transformers architecture. Finally, these scores are concatenated and fed to a boosting tree model for training.

Traditional NLP features and Transformer-based Discourse Features

Traditional NLP features are used by the vast majority of existing Automated Essay Scoring (AES) systems.

We are using length-based metrics (number of sentences, words, syllables), readability metrics (Flesch-Kincaid, TTR, RTTR, etc.) (Zesch et al., 2015), lexical metrics (MTLD (McCarthy, 2005), lexical frequency, vocabulary complexity) and syntactic metrics (ngram perplexity, syntax depth, sentence acceptability).

Discourse structure and coherence are important aspects of student answers and are often a part of grading rubrics. In this section we describe the Transformer-based Discourse Features that have been used to measure the topic relevancy and the coherence.

Topic-Relevant Score estimate the degree of relevancy of an answer to a topic. (Yin et al., 2015) proposed a method for using pre-trained Natural Language Inference (NLI) models as a ready-made zero-shot sequence classifier. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. Finally, the model provides a similarity score as a topic relevancy score.

Coherence modeling measures conceptual relations between different units within a response. Our approach measures overall coherence by calculating the semantic relatedness between adjacent sentences. We use the BERT pre-trained language model (Devlin et al., 2018) and fine-tune it using a fully connected perceptron layer. We leverage the Next Sentence Prediction objective of BERT and get a single representation for both sentences s1 and s2. To find the right order of the sentences we use topological sort (Prabhumoye, S. et al., 2020). Finally, we compute a coherence score based on the number of sentence permutations.


The EF Standard English Test dataset is based on a standardized test of the English language designed for non-native English speakers. EFSET contains around 4000 student tests annotated by teachers.

We train a XGBoost tree model for regression using the softmax objective. Finally, we map the results to the corresponding CEFR level.To evaluate our system, we use the Quadratic Weighted Kappa as this is the main metric used for evaluation and cross-system comparison of essay scoring systems.

Our system obtains a Kappa score of 0.73 on the test set which shows a substantial agreement.


In this work we show that our three-stage learning framework using acoustic features, handcrafted NLP features and transformer-based discourse features provide good results on automatic spoken assessment. Our future work will try to improve the adversarial samples detection and the quality of transcriptions.


Investigating collocations in English essays written by different L1 learners across the proficiency spectrum

Jen-Yu Li, Thomas Gaillat and Elisabeth Richard

Collocations, as a subset of phrasemes, are currently viewed as a necessary component of second language (L2) lexical competence, both for writing (Granger & Larsson, 2021) and for oral expression (Uchihara et al., 2021). Corpus-based phraseological analysis can help understand the development of L2 acquisition (Liu & Lu, 2020; Saito & Liu, 2021). Some researchers have reported that erroneous collocations are related to the learners’ L1 (Chang et al., 2008; Hong et al., 2011; Nesselhauf, 2003). Despite the obvious value of the English Profile project which creates a set of vocabulary and grammar-based descriptions of language competencies for English (Leńko-Szymańska, 2015; O’Keeffe, 2017), research about collocations with respect to learners language proficiency is still rare (Chen & Baker, 2016; Garner et al., 2020). It seems that erroneous collocation uses have not yet been studied in relation to CEFR levels. Likewise correct collocation uses could also benefit from a CEFR type of classification. Such classification would provide valuable resources for language learning. By mapping certain types of collocations to certain proficiency levels would help in the design of course material and syllabuses.

This research presents a preliminary study about the extraction of Verb-Noun collocations in the EF- Cambridge Open Language Database (EFCAMDAT) (Geertzen et al., 2013; Huang et al., 2018). The research question is: can a corpus of learner English be used to extract collocations per level ? Answering this question would help in assessing proficiency levels in learners’ productions.

In this study, we present a method to extract collocation in the writings of two L1 (Chinese and French) learners with six levels (CEFR proficiency scores from A1 to C2) are compared and discussed. A previously implemented collocation extraction algorithm is used (Author, 2020). The scripts are written in Python with an open-source toolkit, Natural Language Toolkit (NLTK) (Bird & Loper, 2004). Source code is available online. In the first part, collocation are extracted with respect to the language levels in the essays written by French learners. The performance were evaluated with manual verification on a random sample of 100 collocations. The second part was to compare the collocation use between French and Chinese learners. The two parts are complementary and can reflect different aspects concerning the second language acquisition process of different L1 speaker while learning English.


Compiling CEFR-graded vocabulary lists for Russian L2 learners based on 4 different sources of word frequency data

Antonina Laposhina and Maria Lebedeva

In an effort to make the framework universal and flexible, the developers of the CEFR do not provide lexical lists prescribed to be learned at each level of L2 acquisition. However, it is clear that in order to achieve the communicative competencies described in the CEFR, an L2 learner has to develop certain language-specific lexical and grammatical skills. The vocabulary necessary for acquisition at each level is traditionally described by lexical lists, of which lexical minimums (LM) are a special type. Such lists are relied upon by language testing systems, textbook developers, and automated text level assessment systems.

The problem of selecting or updating the vocabulary to be included in such lists is still a non-trivial task of applied linguistics. In line with the current trend toward a corpus-based approach, the common source of frequency data for vocabulary compilation is a large corpora of texts (Brysbaert and New, 2009, Kilgariff et al., 2014, Sharoff et al., 2013). This approach risks leaving behind the infrequent, but valuable for L2 learners vocabulary, which is described as the problem of oranges and bananas (Kilgarriff, 2010) or toothpaste problem (Volodina, 2018). The CEFRLex project offers an elegant solution to the problem of estimating the pedagogical value of a word by calculating its frequency from materials for L2 learners (François et al, 2014). This approach also has risks of falling into a vicious circle: textbooks are built on prescribed LM.

According to the modern principles of language curriculum design, the language content, including vocabulary, should be selected on the basis of representative samples of target discourse, or language use, surrounding accomplishment of target communicative tasks (Long 2018, p.3). This suggests to use as a valuable source of data the specially collected corpus of texts specified in the CEFR descriptors (e.g. ads, menus, personal correspondence etc.).

Given the diversity of relevant frequency data sources, it is reasonable to develop a technique for combining heterogeneous data from various collections of texts.

Current project is aimed to compile vocabulary lists for Russian L2 learners according to CEFR levels based on 4 different sources: existing LM, a textbook corpus, a general corpus of Russian, and specific collection of target discourse samples. In this paper, we demonstrate this approach on the example of the topic ”Home and interior”. To create a collection of target discourse, we crawled topic-specific texts, e.g. ads from real estate services, texts from topic-relevant media, and transcripts of YouTube room tours. The total volume of the corpus is 434 thousand words. The frequency list of this collection is the starting point of compiling a new list. The frequency of the candidate word in the corpus of RFL textbooks RuFoLa (Laposhina,

2020) and the fact of its presence at the existing minima (Andryshina NP et al., 2015) adds “points” to its pedagogical value, and its frequency according to the large corpus of texts collected from the Internet RuTenTen 2017 (Jakubíček et al., 2013) may indicate the general demand for a given word.


Learning to Classify Sentences into CEFR Levels for Second/Foreign Language Learning

Kuan-Lin Lee, Jason S. Chang, Shu-Li Lai and Wei-Chung Huang

Many requests are submitted to language learning services on the Web every day, and an increasing number of language learning services on the Web specifically target such requests. For example, Ludwig ( uses databases of newspaper stories, while online learners’ dictionaries such as the Cambridge English Dictionary resort to expert database and lexicographers to compile example sentences for most headwords listed.

Example sentences play a crucial role in language learning as they show how a word or a phrase is normally used, for example, its lexical grammar, collocation, and what kind of context it tends to appear in. Corpus-based tools can often retrieve ample authentic example sentences; these sentences, however, are often difficult to understand for language learners. Sketch Engine for Language Learning (SKELL, was one of the earliest tools to address this issue, using a series of rules to ensure that the sentences provided are easy to read and easy to understand for language learners. SKELL was a new milestone in the development of corpus-based language learning tools. However, many dictionary-like example sentences may not fit into a rigid rule, while many generated sentences that fit the rules do not necessarily look like good examples in a dictionary. For instance, if a learner searched for “baton” using SKELL, s/he may find example sentences that contain proper nouns (e.g., Baton Rouge, which is a destination with true Southern hospitality) or sentences that are still too complicated (e.g., “It blends seemingly disparate elements by keeping everything in motion, each ingredient seamlessly passing the baton to its partner.”).

To solve this problem, we propose a new system, Lead by Example (LBE), a prototype example sentence grader that can automatically determine whether a sentence is dictionary-like. We used example sentences from the online Cambridge English Dictionary and Wikipedia Simple English in the initial training of the system. The method involves (1) automatically generating training data with labelled sentences, (2) automatically learning to classify sentences into good or bad examples, and (3) automatically classifying sentences into Common European Framework of Reference for Languages (CEFR) levels. At run time, sentences are converted to vector representation for classification, and good examples are retained and their CEFR levels are further predicted using a trained model. A blind evaluation of a separate dataset of sentences on the Web showed that the proposed method achieved a high accuracy rate in filtering and grading good dictionary-like example sentences.

Our methodology combines the best of both worlds: lexicographic resources and abundant raw online data. The example sentences generated by the system carry some implications for teaching English to speakers of other languages, including generating vocabulary exercises and tests in computer-assisted language learning systems.


Comparing the validity of human-annotated CEFR labels from different sources using machine-learning classifiers: examples from Russian

Robert Reynolds

The last decade has seen a rise in research on readability classifica- tion, primarily focused on English, but also including French, German, Italian, Portuguese, and Swedish (Roll et al., 2007; Vor der Bru ̈ck et al., 2008; Aluisio et al., 2010; Francois and Watrin, 2011; Dell’Orletta et al., 2011; Hancke et al., 2012; Pila ́n et al., 2015; Reynolds, 2016). Virtually all studies of machine-learning approaches to automatic readability clas- sification rely on training corpora consisting of texts that have been rated by humans, whether teachers, publishers, or students. However, even for humans, readability classification is a very difficult task. The fact that many popular word processors include automatic readability evaluation tools indicates that even native speakers are frequently unsure about how difficult their own writing is. Even though there is some reason to doubt the validity of human ratings in a gold-standard readability corpus, with- out superior alternatives, most readability researchers use human ratings without examining their validity.

In this study, we compare nine CEFR-labeled Russian corpora, each labeled by human annotators with varying degrees of training and exper- tise. Using a variety of corpora has one important benefit. Assuming that the criteria used to determine the readability of texts differ between cor- pora, the likelihood of overfitting can be reduced by training on a variety of corpora. On the other hand, it opens the possibility that each corpus’ readability ratings are not well aligned; one corpus’ B1 rating might be closer to another corpus’ B2 rating, etc. In this article we demonstrate methods to automatically compare the validity of these labels.

The first method used to investigate the validity of each corpus’ ratings is based on a comparison of labels output by models trained on other subsets of the corpora. We demonstrate that among our subcorpora, a model trained one subcorpus in particular outputs labels almost an entire level above the gold-standard human labels of other subcorpora, on average. In addition, the models trained on the subcorpora consistently label this corpus’ texts almost an entire level below its gold-standard human labels, on average. The combination of these results suggests that this subcorpus’ labels are consistently too high, but cannot be conclusive.

The second method used to compare the validity of CEFR labels, from humans or machine-learning models, is to compute a Spearman rank correlation coefficient, which assesses how well the relationship between the gold level and predicted level can be described using a monotonic function. A high correlation means that both sets preserve the same order of elements when they are sorted according to reading level. For example, a classifier that consistently predicts one level too high would have a perfect Spearman correlation of 1.0, even though its accuracy is 0.0.

Our results highlight the need to develop more formal methods for establishing the validity of CEFR labels, at both the annotation stage and the machine-learning stage of research.


Comparing IRT-based Word Difficulty from a Vocabulary Test Data Set and the CEFR-J Vocabulary Profile for Assessing Readabiltiy of Scientific Texts

Yo Ehara

The CEFR-J Wordlist (CEFR-J Vocabulary Profile) is a dataset of word difficulty levels that have been manually annotated for English language teaching in Japan [5]. The dataset has been continously updated. The latest version was updated Mar. 24, 2020 1. For each English word (with a part of speech), a difficulty level has been assigned to the word using the CEFR scale from A1 to B2.

In contrast, word difficulty can be statistically calculated from a vocabulary test result dataset by using the item response theory [1]. These datasets record which test-taker answered which word questions correctly/incorrectly. Some of these datasets are publicly available [3]. The word difficulty parameter calculated by item response theory correlates well with the log of word frequency [2]. Using this relationship, it is possible to estimate the value of the difficulty parameter of words outside the vocabulary test. Such a technique has been applied in the past to reading comprehension support systems that translates words unknown to the learner [4].

Many English language learners study English to learn something other than English such as science and technology because there are an overwhelming number of articles and learning materials available in English. In the field of science and technology, it is often the case that words that seem difficult as English words actually represent very basic concepts; for example, electrons are a type of subatomic particles studied in high school physics, but the word “electron” was judged to be B1 level, which is a difficult category. Is it possible that a unique difficulty level for each word, such as the CEFR-J, would be useful for assessing the readability of scientific and technical papers? With this research question in mind, this study compares the word difficulty parameters calculated by IRT with the word difficulty of the CEFR-J Wordlist.

To answer this research question, by using the dataset [3], we built a vocabulary-based text readability assessor by using the idea of IRT. Given a text to assess, the assessor outputs the probability that the test-taker with the most average ability in the dataset [3] knows all the words appearing in the dataset as a text difficulty score.

To avoid biasing our analysis to one particular field, we obtained abstracts from the databases of two different fields: We analyzed 55,410 abstracts taken from PubMed for medical and life sciences, and 27,686 abstracts taken from the ACL Anthology for natural language processing. Our preliminary results show that, while the average readability score for the ACL Anthology was 18.45, that of PubMed was 31.25: a larger score indicates that the text is more difficult. Our vocabulary-based text readability assessor words particulary difficult for English-as-a-Second Language (ESL) learners. For example, the following words were assessed to be particularly difficult for ESL learners in PubMed: hemihydrate, engraftment. In contrast, those in the ACL Anthology: lexicosemantic, colingual.

In the presentation, we will also add CEFR-J based text readability assessor for com- parison and report how it assesses the readability of scientific texts.