Corpora
Corpor@uclouvain | Some of the corpora compiled by members of our research institute are distributed on the Corpor@uclouvain catalogue. This catalogue contains learner corpora and corpora of various other types. |
Learner corpora around the world | The Centre for English Corpus Linguistics maintains a list of learner corpora with relevant metadata and information about their availability for research purposes |
L2 learner corpora resource family | The CLARIN infrastructure provides access to 74 L2 learner corpora |
The core metadata schema for learner corpora (LC-meta)
The Core Metadata Schema for Learner Corpora contains a list of metadata fields that can be used to describe learner corpus data.
One of the earliest efforts to address the need for metadata standardisation is Granger & Paquot (2017). This initiative was revived in 2022 in the form of a collaborative project between the Centre for English Corpus Linguistics (UCLouvain, Belgium), the Institute for Applied Linguistics at Eurac Research (Bolzano, Italy) and CLARIN ERIC. The following table provides the versioning history of the schema:
Core Metadata Schema for Learner Corpora, version 2 (LC-meta) |
Paquot, M., König, A., Stemle, E. W., & Frey, J. (2024). « Core Metadata Schema for Learner Corpora (version 2) », https://doi.org/10.14428/DVN/AAUEM2, Open Data @ UCLouvain, UNF:6:D46/69S0DuhuxwMnT7rn9A== [fileUNF] The second version of the schema is described in Paquot, M., König, A., Stemle, E. W., & J.-C. Frey (forthcoming 2024). The Core Metadata Schema for Learner Corpora (LC-meta): Collaborative efforts to advance data discoverability, metadata quality and study comparability in L2 research. International Journal of Learner Corpus Research 10(2). |
Version 1 |
Paquot, M., König, A., Stemle, E. & Frey, J.-C (2023). Core Metadata Schema for Learner Corpora, https://doi.org/10.14428/DVN/4CDX3P, Open Data @ UCLouvain, V1, UNF:6:WhLZTg+knFe2FjjgxGg3Uw== [fileUNF] |
Draft version |
Granger, S. & Paquot, M. (2017). Towards standardization of metadata for L2 corpora. Invited talk at the CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, University of Gothenburg, Sweden. |
Other resources and tools
Some of the resources and tools developed by members of our research institute can be used to compile, annotate and analyze learner corpora:
A database of English dependencies with measures of frequency, association, range and keyness | The database of English dependencies with measures of frequency, association, range and keyness includes dependencies extracted from the Louvain Corpus of Research Articles (LOCRA). Each of the 982,906 lines of the TSV (tabulation-separated values) file gives frequency values from the LOCRA corpus (as target corpus) and the ENCOW16 corpus (as reference corpus), as well as various measures of association, range and keyness described later. |
Academic Keyword List | The Academic Keyword List contains 930 academic words that can be used to explore the lexical sophistication of L2 English learner language |
CEFRLex | The CEFRLex project proposes several lexical resources graded according to the Common European Framework of Reference for language skills (CEFR) |
FABRA | FABRA was first developed as a readability toolkit based on the aggregation of a large number of readability predictor variables targeting French. In practice, the tool computes a large number of complexity measures typically used in L2 research |
fsca | fsca is an open-source R package for the extraction of syntactic units from dependency-parsed French texts. |
Guide pratique de constitution de corpus | A set of guidelines (written in French) to help our students collect and document written and spoken corpora |
ICLE500 | The ICLE500 dataset contains 500 argumentative essays from the International Corpus of Learner English (Granger et al., 2020) together with basic metadata (see ICLE website for more info) and CEFR levels. The procedure to map the texts to the CEFR is carefully described in a technical report released with the dataset (Kanistra & Kollias, 2024). |
ICLE1300 |
ICLE1300 provides basic text metadata and proficiency information in the form of comparative judgement (CJ) scores for 1300 argumentative texts from the International Corpus of Learner English (ICLE, Granger et al., 2020). |
Recto-Verso | The software allows you to automatically introduce the 1990 spelling corrections into a text |
Resyf | French lexical resource with synonyms graded according to their level of difficulty |
TreeTagger | Web interface that facilitates the use of the TreeTagger tagger, developed at the Institute for Computational Linguistics at the University of Stuttgart |
UCLouvain Error Editor (UCLEEv2) | Software meant to facilitate the insertion of error tags and corrections into learner texts, as well as their subsequent processing |
Publications
CECL papers | The CECL Papers aim to make available to the academic community a series of articles, books and technical papers related to activities (conferences, corpus collection, corpus annotation, etc.) led by the CECL. Several of these publications focus on L2 research (e.g. The Louvain Error Tagging Manual). |
Learner Corpus Bibliography | The Learner Corpus Bibliography (LCB) is a collection of c. 2000 references related to Learner Corpus Research. The LCB was created and maintained by the CECL for many years. In 2013, the CECL agreed to share the LCB with the Learner Corpus Association, which currently maintains it in the form of a Zotero-based collection available to all its members. |
The International Journal of Learner Corpus Research | The International Journal of Learner Corpus Research (IJLCR) is a forum for researchers who collect, annotate, and analyse computer learner corpora and/or use them to investigate topics in Second Language Acquisition and linguistic theory in general, inform foreign language teaching, develop learner-corpus-informed tools (e.g. courseware, proficiency tests, dictionaries and grammars) or conduct natural language processing tasks (e.g. annotation, automatic spell- and grammar-checking, L1 identification). |