Learner corpus research


 The CECL is renowned for its compilation of learner corpora (some of them still ongoing). These include:

  • The International Corpus of Learner English (ICLE) was initiated by the CECL in the late 1980s. It includes argumentative writing by intermediate to advanced EFL learners. The second edition of ICLE (Granger et al. 2009) covers a total of sixteen mother tongue populations in comparison to the eleven found in the first version (2002).
  • The Louvain International Database of Spoken English Interlanguage (LINDSEI) followed suit and was compiled on the basis of spoken interviews of intermediate to advanced EFL learners from fourteen mother tongue backgrounds. The first version was released in 2010 (Gilquin, De Cock & Granger).
  • The Longitudinal Database of Learner English (LONGDALE) is an ongoing project (initiated in January 2008), the aim of which is to meet the need for truly longitudinal data in learner corpus research. Although compilation has started for learner argumentative writing, the database aims to include a variety of data types from essays to summaries, picture descriptions and oral interviews. The data are gathered from the same learners who are followed over a period of two or three years so as to be able to control for evolution in proficiency level. The EFL learners are representative of a wide range of mother tongue backgrounds and usually start off at an intermediate level of proficiency.
  • The Varieties of English for Specific Purposes Database (VESPA) is another ongoing project (also initiated in January 2008) whose aim is to develop a large corpus of texts written for specific purposes by students of L2 English. The corpus targets a range of disciplines (e.g. linguistics, law, medicine, biology etc) and genres (e.g. reports, papers, MA dissertations) from students at different proficiency levels (from learners in their first year at university to PhD students).
  • The French Interlanguage Database (FRIDA) corpus includes texts written by intermediate to advanced learners of French as a Foreign Language.
  • The Multilingual Student Translation (MUST) corpus, launched in March 2016, aims to build a bridge between Learner Corpus Research and Corpus-Based Translation Studies. It will contain translations produced by intermediate and/or advanced foreign language learners and trainee translators. 

When collecting corpus data, the CECL always takes care to record a series of important variables that are known to influence learner production. Information for each of the variables is obtained via learner questionnaires and subsequently encoded in the learner corpora where the variables can be used as search criteria. Hence, by selecting variables such as the learners’ L1, the assigned topic, the task setting (i.e. whether the task was timed or untimed and whether reference tools were allowed), the time spent in an English speaking country, or the learning context (i.e. the amount of exposure to English in the learners’ native countries), researchers can compile their customised learner corpus and compare for instance the number and type of errors included in timed and untimed essays or the use of academic vocabulary by learners from different mother tongue backgrounds.

Going hand in hand with the aforementioned learner corpora are two learner-corpus-based methodologies that are widely used by the CECL, viz. contrastive interlanguage analysis (CIA) (Granger 1996, 2015) and computer-aided error analysis (CEA) (Dagneaux et al. 1998) (see learner corpus bibliography). CIA involves comparing either learner data with native or expert data (L2 vs L1) or different types of learner data (L2 vs L2) while CEA aims to carry out a detailed analysis of the authentic errors found in learner corpora. These two methodologies have brought to light a number of findings concerning three learner phenomena, namely underuse and overuse, i.e. elements which learners use significantly more or significantly less than their native speaker counterparts, as well as misuse, i.e. learners’ authentic errors. CEA is largely dependent upon prior annotation of the data in the form of error tagging. The CECL has developed its own error tagging system for both L2 English and L2 French. The English 'error toolkit' contains a comprehensive error tagging manual (Dagneaux et al. 2008) which explains each of the 50-plus error tags, and the Université catholique de Louvain Error Editor (UCLEE) software which helps with the insertion of the error tags and the corrections in the data.

Learner corpora, especially if they are error-tagged, are a highly valuable resource for teaching, lexicographical or testing purposes. The get-it-right boxes in the Macmillan English Dictionary for Advanced Learners (MED2), which were developed by the CECL on the basis of an error-tagged version of ICLE, as well as the warnings included in the Louvain English for Academic Purposes Dictionary (LEAD), are testimony to this.

Learner-corpus events:

The CECL organized the first learner corpus symposium in Louvain-la-Neuve in 1995 and co-organized a second symposium with the Chinese University of Hong Kong in Hong Kong in 1998. With a view to helping researchers analyze learner corpus data, the CECL organized four summer schools (2004, 2006, 2007, 2014) which brought together both senior and junior researchers from a wide range of countries internationally. In 2008 the CECL organized a colloquium entirely devoted to the collection and analysis of spoken learner corpora and in September 2011 the CECL hosted an international conference entitled “20 years of learner corpus research: looking back, moving ahead” to mark the 20th anniversary of its creation (Granger, Gilquin & Meunier, 2013).