Learner corpora around the world

CECL

This list is still work in progress. We would like it to be as comprehensive as possible. If you have a learner corpus or know of one that is not listed on this webpage, send a message to Magali Paquot and we will add it to the list. We hope you will find the list useful for your research!

The list below only contains learner corpora, i.e. electronic collections of continuous written or spoken data produced by foreign or second language learners.

For a list of learner corpus-based datasets (treebanks, error lists, etc.), click here.

To refer to this list :

Centre for English Corpus Linguistics (date of access): Learner Corpora around the World. Louvain-la-Neuve: Université catholique de Louvain. https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html

 

© 2019, Université catholique de Louvain

Learner corpora

Use the query box below to search for specific keywords (e.g. languages, task type, medium).

Corpus

Target
language

First
language

Medium

Text type /
task type

Proficiency level

Size
in words

Project director

Availability

The Arabic Learner Corpus (ALC)

Arabic

66 languages

Written and spoken

Narrative and discussion

Intermediate and advanced

Written: c. 283,000

Audio: c. 3h30

Abdullah Alfaifi
Al Imam University, Saudi Arabia

Eric Atwell
University of Leeds, UK

Available

The Pilot Arabic Learner Corpus 

Arabic

English

Written

Narrative

Intermediate and advanced

c. 9,000

Ghazi Abuhakema
Reem Faraj
Anna Feldman
Eileen Fitzpatrick
Montclair State University, USA

/

The Jinan Chinese Learner Corpus (JCLC) 

Chinese

50 languages

Written

Exams and assignments

Beginners, intermediate and advanced

c. 6 m. Chinese characters

c. 9,000 texts

Maolin Wang
Jinan University, China

Shervin Malmasi
Macquarie University, Australia

Minggxuan Huang
Guangxi University of Finance and Economics, China

Free download upon contact with researchers

Croatian Learner Text Corpus (CroLTeC) 

Croatian

36 languages

Written

Exam essays, argumentative and literary essays, letters, diaries, picture descriptions, book reviews, short dialogues, etc.

A1-C2

c. 1 million

Nives Mikelić Preradović
University of Zagreb, Croatia 

Freely available

The AKCES/CZESL (Acquisition corpora of Czech/Czech as a second language) corpus 

Czech

Various

Written and spoken

Student essays and interviews

Various

2 m.

Karel Šebesta
Charles University/Technical University of Liberec, Czech Republic

Available

Leerdercorpus Nederlands als Vreemde Taal

Dutch

French

Written

 

 

 

Liesbeth Degand
Université catholique de Louvain, Belgium

 

Arab Learner English Corpus (ALEC)

English

Arabic

Written

Essays written by freshman students as part of first level college writing course

University students (second language learners)

Analysis 184749
Narrative 67527
Synthesis 66015
Argumentation 192298

Inas Mahfouz
American University of Kuwait, Kuwait

Available upon request for research purposes

The Aachen Corpus of Academic Writing (ACAW)

English

German

Written

Academic research writing

Advanced

c. 240,000 words

c. 225,000 words (L1 component)

Elma Kerz
RWTH Aachen University, Germany

Under development

The Advanced Learner English Corpus (ALEC)

English

Mainly Swedish

Written

Essays written by university students of English linguistics and English literature

Advanced

c. 1,3 m.

Tove Larsson
Uppsala University, Sweden
Université catholique de Louvain, Belgium

Not freely available

The ANGLISH corpus

English

French

Spoken

Readings of texts and sentences, spontaneous oral language

Various

c. 5h30

Anne Tortel
University of Provence, France

Freely available 

Asao Kojiro’s Learner Corpus Data

English

Japanese

Written

Essays and stories written or reproduced by Japanese college students

 

 

Asao Kojiro
 Ritsumeikan University, Japan

Texts available for download

The Barcelona English Language Corpus (BELC)

English

Spanish
Catalan

Spoken and written

4 tasks: written composition, oral narrative, oral interview, role-play

Longitudinal data (children and young adults learning English)

 Various

 

Carmen Muños
University of Barcelona, Spain

Available

The BATMAT Corpus

English

Swedish
Finnish

Written

BA dissertations
MA dissertations

Advanced

c. 2,5 m.

Tuija Virtanen-Ulfhielm
Åbo Akademi University, Finland

Not publicly available

Belarusian Learner Corpus of English (BELLCE)

English

Russian; Belarussian

Written

Argumentative essays

High intermediate to advanced

unknown

Anastasia Rakhuba

 

The Bilingual Corpus of Chinese English Learners (BICCEL)

English

Chinese

Spoken and written

Spoken: National Oral English test

Written: in-class assignments

 

c. 2 m.

Wen Qiufang
Beijing Foreign Studies University, China

 

The Brazilian Spoken Corpus of English Learners (BraSCEL)

English

Portuguese

Spoken

Informal interview + thought-provoking picture discussion

A1-C2 benchmarked to the CEFR

Under development

Mateus Miranda
Mary Immaculate College/University of Limerick, UK

The corpus (transcriptions of audio files) will be available to the scientific community upon request.

The British Academic Written English (BAWE) corpus

English

Mainly L1 speakers

Also includes data produced by L2 speakers

Written

ESP papers 

4 levels of study (from undergraduate levels to final year and taught masters level)

 

c. 6,5 m.

Hilary Nesi
Sheena Gardner
Coventry University, UK

Paul Thompson
University of Birmingham, UK

Paul Wickens
Oxford Brookes, UK

The BAWE corpus can be accessed through the corpus analysis interface, Sketch Engine.

The BUiD Arab Learner Corpus (BALC) 

English

Arabic

Written

School examination essays

Various

c. 290,000

Mick Randall
The British University in Dubai, United Arab Emirates

Nicholas Groom
University of Birmingham, UK

At present, copies of the current version of the corpus is available on request from mick.randall@buid.ac.ae

The Cambridge Learner Corpus (CLC)

English

Various

Written

Exam scripts

Various

c. 50 m.

Cambridge University Press and Cambridge ESOL
Cambridge University, UK

Commercial

The Corpus of Academic Learner English (CALE)

English

German

Written

Various academic text types that are typically produced in university courses of English (e.g. term papers, reading reports, research plans, abstract, reviews, and summaries)

Advanced

under development

Marcus Callies
University of Bremen, Germany

 

The Corpus of English Essays Written by Asian University Students (CEEAUS)

English

Various

Written

Student essays

Various

c. 200,000

Shin Ishikawa
Kobe University, Japan

 

The Chinese Academic Written English corpus
(CAWE)

English

Chinese

Written

Dissertations written by Chinese undergraduates majoring in English linguistics or applied linguistics

 

c. 400,000

David Yong Wey Lee
City University of Hong Kong, Hong Kong

 

The Chinese Learner English Corpus (CLEC) 

English

Chinese

Written

 

Various

c. 1 m.

Gui Shichun
Guangdong University of Foreign Studies

Yang Huizhong
Shanghai Jiao Tong University, China

The corpus can only be accessed by users in the Department of English at HKPU

The City University Corpus of Academic Spoken English (CUCASE)

English

Chinese

Also includes data produced by L1 speakers

Multimedia

 

 

c. 2 m.

David Yong Wey Lee
City University of Hong Kong, Hong Kong

 

The Cologne-Hanover Advanced Learner Corpus (CHALC) 

English

German

Written

Term papers and essays

Advanced

c. 210,000

Ute Römer
University of Michigan, USA

 

The College Learners’ Spoken English Corpus (COLSEC)

English

Chinese

Spoken

National spoken English test for non-English majors

 

c. 700,000

Yang Huizhong
Shanghai Jiao Tong University, China

Wei Naixing
Beihang University, China

The Corpus Archive of Learner English in Sabah/Sarawak (CALES) 

English

Malay

Written

Argumentative essays

Various

c. 400,000

Simon Botley @ Faizal Hakim
Doreen Dillah
Universiti Teknologi MARA Sarawak, Malaysia

 

Corpus Oral de Português como Língua Adicional-Brasil (CoPLA-BR)/Oral Corpus of Brazilian Portuguese as an Additional Language

Portuguese

Various

Spoken

Informal interview + thought-provoking picture discussion

Basic
Intermediate
Advanced

Under development

Mateus Miranda
Mary Immaculate College/University of Limerick, UK

The corpus (transcriptions of audio files) will be available to the scientific community upon request.

Corpus Escrito de Aprendices de Inglés como Lengua Extranjera en Ecuador (COREAILE) English Spanish (Ecuadorian) Written Narrative Beginners and intermediate 44,352 (210 texts) Miguel A. Macías Loor (Universidad Técnica de Manabí) Available upon contact with researcher  (mamacias@utm.edu.ec)

CORpus del ESPañol de los Italianos (CORESPI)

Spanish

Italian

Written

Written compositions

A1 to B2

c.125,000

Sonia Bailini
Università Cattolica del Sacro Cuore, Italy

Online access

CORpus del ITaliano de los Españoles (CORITE)

Italian

Spanish

Written

Written compositions

A1 to B2

c.103,000

Sonia Bailini
Università Cattolica del Sacro Cuore, Milan, Italy

Online access

The Corpus of Business Letters

English

Italian

Written

Tagged part: BEC1 writting tests (letters, emails, faxes, memos, reports)

Untagged part: business writing exam tests

 

c. 32,000

Anna Romagnuolo
University of La Tuscia, Italy

 

The Corpus of Multilingual Opinion Essays by College Students (MOECS)

English

varied

Written

Opinion essays

College students

unknown

Megumi Okugiri
University of the Sacred Heart, Japan

available

Corpus of writing, pronunciation, reading, and listening by learners of English as a Foreign Language 

English

Japanese

Written and spoken

Varied

Beginners to advanced

29h audio + 30.000 words

Katsunori Kotani
Kansai Gaidai University, Japan

Takehiko Yoshimi
Hiroaki Nanjo
Ryukoku University, Japan

Hitoshi Isahara
Toyohashi University of Technology, Japan

 

Corpus of Written Spanish, L2 and Heritage Speakers (COWS-L2H) Spanish English, Mandarin, Other Written Personal essays Beginner, intermediate, advanced, and heritage 1,138,097 Claudia H. Sánchez Gutiérrez (chsanchez@ucdavis.edu) Available on Github (https://github.com/ucdaviscl/cowsl2h)

The Corpus of Young Learner Interlanguage (CYLIL)

English

Dutch
French
Greek
Italian

Spoken

English L2 data elicited from European School pupils. Longitudinal data

Various

c. 500,000

Alex Housen
Vrije Universiteit Brussel, Belgium

 

Corpus and Repository of Writing (Crow)

English 24 languages; Predominantly Chinese and Arabic Written Analysis, Narrative, Literature Review, Argument, Empathy Writing, Proposal, Reflection High intermediate/advanced (TOEFL overall score 80-105); international undergraduate students in first-year writing classes 9 million words (in March 2020) Shelley Staples and Bradley Dilger Open access after registration

The DiSKo (Deutsch im Studium: Lernerkorpus/German at the University: Learner Corpus)

German

Various

Written

Standardized writing task from university admission language test (TestDaF), app. 400 tokens per text

B1-C2

Longitudinal, under development; targeted word number ~ 180,000 

Katrin Wisniewski
University of Leipzig, Germany

Will be freely available online under the ANNIS architecture

The Eastern European English learner corpus 

English

Russian
Ukrainian
Polish
Slovak

Spoken

Spontaneaous spoken production data elicited by means of a semi-structured interview

Various

c. 60,000

Elena Salakhian
Eberhard Karls University of Tübingen, Germany

 

The EFL Teacher Corpus (ETC) 

English

Korean

Spoken

Teacher talks in language classrooms

Upper-intermediate to advanced

c. 123,000

Ye-eun Kwon
Eun-Joo Lee

Under development

The English of Malaysian School Students corpus (EMAS)

English

Malay

Written

Student essays and oral interviews

various

c. 500,000

Arshad Abd. Samad
Universiti Putra Malaysia, Malaysia

 

The English Speech Corpus of Chinese Learners (ESCCL) 

English

Chinese

Spoken

Dialogue reading-aloud

Middle school and college

 

Chen Hua
Nantong University, China

Wen Qiufang
Beijing Foreign Studies University, China

Li Aijun
Chinese Academy of Social Sciences, China

 

The ETS Corpus of Non-Native Written English

English

11 languages

Written

12,100 TOEFL English essays

/

 

Daniel Blanchard

Information avout the score level is available for each essay

Samples are available

The Europarl corpus of Native Non-native and Translated Texts
(ENNTT)
 

English

24 EU languages

Written

Proceedings of the European Parliament

Advanced

NNS: c. 780,000

NS: c. 3 m.

Translated: c. 22m.

Sergiu Nisioi
University of Bucharest, Romania

Available

English Students’ Oral Corpus in Chile (ESOC-Chile)

English

Spanish

Spoken

Student Interviews

B1 - B2 - C1

73631

Chinger Zapata
Universidad Católica del Norte, Chile

The corpus (audio files or plain transcriptions of audio files in txt. format) will be available to the scientific community upon request to czapata@ucn.cl

The EVA Corpus of Norwegian School English 

English

Norwegian

Spoken

Picture-based tasks

 /

c. 35,000

Angela Hasselgren
University of Bergen, Norway

 

The FUSE (The Finnish Upper Secondary School Corpus of Spoken English)

English

Finnish (possibly other L1s too, information not collected)

Spoken

Role-tasks or mind-map tasks as part of a low-stakes, course examination in Finnish upper secondary/high schools

CEFR: A2-C1

N/A

 Lasse Ehrnrooth
University of Helsinki, Finland

Online access

The Gachon Learner Corpus

English

Korean (+ a few Chinese & Spanish speaking students) 

Written

Written Journal Assignments

Lower intermediate

c. 2,5 m.

Brian Carlstrom
Gachon University, South Korea

Freely available

The Gesprochene Wissenschaftssprache konstrastiv/Multilingual corpus of spoken academic language (GeWiss)

German

English, Polish, Bulgarian & diverse other L1 languages

Spoken

Academic papers, student presentations and academic oral examinations in German philology / Applied Linguistics / Language pedagogy as well as in Polish, English, and Italian philology

B2, C1

1.4 m.

Christian Fandrych
Leipzig University, Germany

Freely available upon registration: https://gewiss.uni-leipzig.de/index.php?id=home&L=1

The GICLE corpus (German component of ICLE)

English

German

Written

Mainly non-academic argumentative essays

Advanced

c. 234,000

 

 

The Giessen-Long Beach Chaplin Corpus (GLBCC)

English

German

Spoken

Transcribed interactions between native English speakers, ESL and EFL speakers

Various

c. 350,000

Andreas Jucker
Sara Smith
University of Giessen, Germany

Restricted use: apply for approval to get a copy.

The Hong Kong University of Science & Technology (HKUST) learner corpus

English

Chinese - mostly Cantonese

Written

Untimed assignments written for EFL courses and school leaving exams

University and advanced high school students

c. 25 m.

John Milton
Hong Kong University of Science &Technology, Hong Kong

 

The Indianapolis Business Learner Corpus (IBLC)

English

Various

Written

Job application letters and résumés of business communication students from the U.S., Belgium, Finland, Germany, and Thailand, spanning the years 1990-1998

 

 

Ulla Connor
Kristen Precht
Thomas Albin Upton
Indiana University, USA

 

The International Corpus of Crosslinguistic Interlanguage (ICCI)

English

Various

Written

Essays (20-min in-class tasks without the use of a dictionary)

 Beginner to lower-intermediate

9,000 essays

Yukio Tono
Tokyo University of Foreign Studies, Japan

Freely available

The International Corpus Network of Asian Learners of English (ICNALE)

English

Chinese
Indonesian
Japanese
Koren
Malay
etc.

Written and spoken

Controlled speeches and essays

L1 productions by 350 NS

Various

c. 1,8 m.

Shin'ichiro Ishikawa
Kobe University, Japan

Freely available

The International Corpus of Learner English (ICLE)

English

Various

Written

Argumentative and literary essays

High-intermediate to advanced

c. 3 m.

Sylviane Granger
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium

CD-Rom + handbook: order online

The International Teaching Assistants corpus (ITAcorp)

English

Various

Spoken

Learner language from a variety ofspoken classroom tasks: office hours role plays, presentations, discussions

 

c. 500,000

Steven L. Thorne
Paula Golombek
Jonathon Reinhardt
Pennsylvania State University, USA

 

The « Interphonolog of Contemporary English » corpus

English

French
Italian
Chinese
Spanish

Spoken Reading aloud, repeating words, guided interviews, interactions between two learners Various Under development

Nadine Herry-Bénit (Université Paris Nanterre)
Stéphanie Lopez (Northwesterne Polytechnical University)
Jeff Tennant
(University of Western Ontario)

Under development; samples available

The Iranian Corpus of Learner English

English

Farsi

Written

Expository essays

University students (English majors)

436,035

Parviz Maftoon
Parviz Birjandi
Hossein Khazaee
Islamic Azad University, Iran

CD-ROM, data gathered for PhD dissertation by Hossein Khazaee; this corpus is an intellectual property of Science and Research Branch, Islamic Azad University, Tehran, Iran

The ISLE speech corpus 

English

German
Italian

Spoken

Recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions)

Intermediate

 c. 18h

ecisle@nats.informatik.uni-hamburg.de
University of Hamburg, Germany

CD-Rom

The Israeli Learner Corpus of Written English

English

Hebrew

Written

Argumentative and descriptive essays

 

c. 750,000

Tina Waldman
Kibbutzim College of Education, Israel

 

The Janus Pannonius University (JPU) Corpus

English

Hungarian

Written

Essays and research papers

University students

c. 500,000

József Horváth
University of Pécs, Hungary

Searchable online

The Japanese English as a Foreign Language Learner (JEFLL) Corpus

English

Japanese

Written

Student essays

From beginning to intermediate

c. 700,000

Yukio Tono
Meikai University, Japan

jefll.inquiry@corpuscobo.net

The JEFLL Corpus will be freely available for research, first via the web query system (already available in Japanese) and then the entire data will be distributed under license in the future

Lancaster Corpus of Academic Written English (LANCAWE)

English

various

Written

IELTS academic writing tests (descriptive and argumentative tasks); assignments.
Longitudinal data.

 

 

 

 

The Lang-8 Learner Corpora

English

Various

Written

Texts from Lang-8, a social networking site for language learning

/

/

Toshikazu Tajiri
Mamoru Komachi
Nara Institute of Science and Technology, Japan

Available here

The LeaP (Learning Prosody in a Foreign Language) Corpus 

English

German

Spoken

Four types of speech styles were recorded: nonsense word lists, readings of a short story, retellings of the story, free speech in an interview situation

Various

 c. 12h

Ulrike Gut
Albert-Ludwigs-University Freiburg, Germany

The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg

The Learner Corpus of Engineering Abstracts (LCEA)

English

Malaysian

Written

Abstracts of the Computer and Communication Systems Engineering Final Year Projects

Various

c. 550,000

998 abstracts

Helen Tan
Ain Nadzimah Abdullah
Syamsiah bt Mashohor
University Putra Malaysia, Malaysia

Chan Swee Heng
Taylor's University, Malaysia

Available. Contact: Helen Tan

The Learner Corpus of English for Business Communication

English

Chinese

Written

Different types of business correspondence written for simulated business situations, including memos, faxes, reports, letters of enquiry and complaint letters

 

c. 117,500

Li Lan
Hong Kong Polytechnic University, Hong Kong

Searchable online

The Learner Corpus of Essays and Reports

English

 Chinese

Written

Essays and project reports covering a range of topics from Science, IT and New Media to Nursing, Business and Economics, and the Social Sciences

 

c. 188,000

Sima Sengupta
Hong Kong Polytechnic University, Hong Kong

 

Searchable online

A Learners' Corpus of Reading Texts

English

French

Spoken

Unprepared reading of English texts (the texts are short abstracts of fiction or made-up dialogues)

 University students

 

Sophie Herment 
Valérie Kerfelec
Laetitia Leonarduzzi
Gabor Turcsan

Freely available

The LONGDALE (LONGitudinal DAtabase of Learner English) project

English

Various

Spoken and written

Range of text types/task types. Longitudinal data

From intermediate to advanced

 

Fanny Meunier
Université catholique de Louvain, Belgium

Under development

The Longman Learners' Corpus

English

Various

Written

Essays and exam scripts

Various

c. 10 m.

Longman

Commercial

The Louvain International Database of Spoken English Interlanguage (LINDSEI)

English

Various

Spoken

Interviews and picture descriptions

High-intermediate to advanced

c. 800,000

Gaëtanelle Gilquin
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium

CD-Rom and handbook: order online

Multilingual Academic Corpus of Assignments - Writing and Speech (MACAWS) Portuguese and Russian 15 languages; Predominantly English and Spanish Written and spoken Classroom assignments and exams organized by Macrogenre (e.g., Analysis, Description, Evaluation, Exposition, Narration) and Topic (e.g., Art, Culture, Literature, Family, Food, Future Plans, Trip) Beginner, intermediate and advanced 212,064 (in March 2020) Shelley Staples, Aleksey Novikov, Adriana Picoral, Bruna Sommer-Farias Open access after registratio

The Malaysian Corpus of Learner English (MACLE)

English

Malay

Written

 

 

 

Gerry Knowles
Zuraidah Mohd. Don
University of Malay, Malaysia

/

The Malaysian Corpus of Students' Argumentative Writing (MCSAW)

English

Malay
Chinese Indian

Written

Argumentative essays

Form 4
Form 5
College

c. 565,500

Seyed Ali Rezvani Kalajahi
Jayakaran Mukundan
Universiti Putra Malaysia, Malaysia

Available from developers

The Michigan Corpus of Academic Spoken English (MICASE)

English

Mainly L1 speakers but also includes data produced by L2 speakers

Spoken

Transcipts of academic speech events

 

c. 1,8 m.

Ute Römer
Georgia State University, USA

micase@umich.edu

Searchable online

The Michigan Corpus of Upper-level Student Papers (MICUSP)

English

Semi-balanced sample of native and non-native speakers of English

Written

ESP papers
A-grade papers or ungraded papers that have been assessed and accepted (such as research proposals), but not published

 

c. 2,6 m.

Ute Römer
Georgia State University, USA

micusp@umich.edu

Searchable online

The Montclair Electronic Language Database (MELD) 

English

Various

Written

Student essays

Various

c. 100,000

Eileen Fitzpatrick
Milton S. Seegmiller
Monclair State University, USA

Contact Eileen Fitzpatrick

Includes error annotations

The Multimedia Adult ESL Learner Corpus (MAELC)

English

ESL environment

Multimedia

Video of classroom interaction and associated written materials

Beginner to upper-intermediate

 

Stephen Reder
Kathryn Harris
Kristen Setzler
Portland State University, USA

labschool@pdx.edu

The Lab School would like to share the extensive resources from MAELC with interested researchers and teacher trainers. Those interested should make inquiries to the Lab School by e-mail

The Neungyule Interlanguage Corpus of Korean Learners of English (NICKLE)

English

Korean

Spoken and written

Written part: student essays
Spoken part: student interviews and oral speech tests transcriptions

Mainly from beginning to intermediate 

Written:
c. 890,000

Spoken:
c. 100,000

 Ji-Myoung Choi
Yonsei University, Seoul, Korea

The corpus will be available to the scientific community for research purposes upon request

The Japanese Learner English Corpus (NICT JLE) 

English

Japanese

Spoken

English oral proficiency interview test

various

2 m.

Emi Izumi
Kiyotaka Uchimoto
Hitoshi Isahara
National Institute of Information and Communications Technology, Japan

Freely available (downloadable)

The NOn-native Spanish corpus of English (NOSE) 

English

Spanish

Written

Argumentative and descriptive student essays

Intermediate and upper-intermediate

c. 300,000 words

 Ana Diaz-Negrillo
Universidad de Granada, Spain

 

The NUS Corpus of Learner English (NUCLE) 

English

Several East Asian languages, predominantly Chinese

Written

Student essays on a wide range of topics including environmental pollution, healthcare, etc.  

various

c. 1 m.

Hwee Tou Ng
Siew Mei Wu
Daniel Dahlmeier
National University of Singapore, Singapore

Freely available

The PELCRA Learner English Corpus (PLEC)

English

Polish

Spoken and written

Written: Argumentative, descriptive, narrative and quasi-academic essays; formal letters

From beginning to post-advanced

Under development

Aim spoken:
c. 200,000

Aim written:
c.2,8 m.

Piotr Pęzik
Barbara Lewandowska-Tomaszczyk
University of Lodz, Poland

Online search engine and corpus analysis tools

The PICLE corpus (Polish component of ICLE)

English

Polish

Written

Student essays

Advanced

c. 330,000

Przemyslaw Kaszubski
Adam Mickiewicz University, Poland

Searchable online

The Qatar learner corpus

English

Arabic (mostly from Qatar)

Spoken

Spoken interviews with Qatari learners of English

 

 

Yun Zhao Helen
Carnegie Mellon University, USA

Freely available

The Québec learner corpus 

English

French (from Québec)

Written

Argumentative essays

Intermediate and advanced

c. 250,000

Tom Cobb
Université du Québec à Montréal, Canada

 /

The Romanian Corpus of Learner English (RoCLE)

English

Romanian

Written

Student essays

 

 

Chitez Madalina
Zurich University, Switzerland

 

Russian Error-Annotated English Learner Corpus

English

Russian

Written

Examination essays of the kind similar to IELTS Task 1 and Task 2, with errors annotated manually

Intermediate to Advanced

c.800,000 by November 2017 and growing (together with the old part of the corpus less consistently annotated or not annotated, available at http://realec.org/index.xhtml#/ - c.2,000,000)

Olga Vinogradova
National Research University Higher School of Economics

Freely available

The Russian Learner Translator Corpus (RusLTC)

English
Russian

Russian

Written

Translations produced by trainee translators

Trainee translators

c. 1.5 m. tokens

Andrey Kutuzov
University of Oslo, Norway

Maria Kunilovskaya
Tyumen State University, Russia

Freely available

The Santiago University Learner of English Corpus (SULEC)

English

Spanish

Spoken and written

Written: compositions or argumentative essays

Spoken: semistuctured interviews, short oral presentations and brief story descriptions

Various

Aim: c. 1 m. words

Ignacio M. Palacios Martínez
University of Santiago de Compostela

Available after registration

The Scientext English Learner Corpus

English

French

Written

Academic argumentative texts

 

 c. 1.1 m.

scientext@u-grenoble3.fr

Searchable online

Second Language Research Tasks (SLRT)

English

Various

Written and spoken

Written paragraphs
Various oral tasks

Various

c. 300,000

Bill Crawford
Northern Arizona University, USA

Kim McDonough
Concordia University, Canada

Under development

The Seoul National University Korean-speaking English Learner Corpus (SKELC)

English

Korean

Written

Student essays

Various

c. 900,000

Heokseung Kwon
Seoul National University, Korea

 /

The SILS Learner Corpus of English

English

Various (mainly Japanese)

Written

Student essays

Basic, intermediate and advanced

 c. 3.2 m.

(first and second drafts included)

Victoria Muehleisen
Waseda University, Japan

 

The Soochow Colber Student Corpus (SCSC)

English

Chinese

Written

Student essays

 

c. 227,000

Colman Bernath
Soochow University, Taiwan

 

The Spoken and Written English Corpus of Chinese Learners (SWECCL)

English

Chinese

Written (WECCL) and spoken (SECCL)

Written: argumentative and narrative essays
Spoken: National Spoken English Test – longitudinal data

 

c. 2 m.

Wei Qiufang
Liang Maocheng
Wang Lifei
Beijing Foreign Studies University

CD-rom

The Taiwanese Corpus of Learner English (TLCE)

English

Chinese

Written

Journals and essays (descriptive, narrative, expository, argumentative)

Intermediate to advanced

c. 2 m.

Rebecca Hsue-Huch Shih
Sun Yat-sen University, Taiwan

 

The Tawainese learner academic writing corpus (TaiwanLAWC)

English

Chinese

Written

Theses and dissertations written by Taiwanese graduate students.

 

 

Howard Chen
National Taiwan Normal University, Taiwan

 

The TELEC Secondary Learner Corpus (TSLC) 

English

Chinese

Written and spoken

Compostions from secondary classroom

 

c. 2 m.

Quentin Allan
University of Hong Kong, Hong Kong

 

The Telecollaborative Learner Corpus of English and German Telekorp

English

German

Written

Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005

 

c. 1,5 m.

Julie Belz
Pennsylvania State University, USA

Not publicly available

The Ten-Thousand English Compositions of Chinese Learners (TECCL)

English

Chinese

Written

Essays (various topics) written in and after class, and in testing context. Also contains some collaborative writing samples

Various (mainly undergraduates)

c. 1,8 m.

Jiajin Xu
Beijing Foreign Studies University, China

Raw texts and part-of-speech tagged texts are available

Tracking Written Learner Language (TRAWL)

Multilingual (English, French, German, Spanish)

Norwegian

Writing

Texts written as part of regular class work (tests, in-school writing, homework)

Longitudinal corpus (beginners/advanced)

 

Hildegunn Dirdal
University of Oslo, Norway

 

The Tswana Learner English Corpus (TLEC)

English

Tswana

Written

Argumentative essays

Advanced

c. 200,000

Bertus Van Rooy
North-West University, South Africa

Available in ICLE

The Undergraduate Learner Translator Corpus (ULTC)

Bidirectional;

English-Arabic or

French-Arabic

Bidirectional; English-Arabic or French-Arabic

 

Arabic is the native language of the learners and the main target language

Written and spoken

Translations produced by learners of translation from and into Arabic and a reference subcorpus of published translations

From beginners to advanced levels

Under development

Reem Alfuraih
Princess Nora bint Abdul Rahman University, Saudi Arabia

Available via https://arabicparallelultc.com/

The Uppsala Student English Corpus (USE)

English

Swedish

Written

Student essays

Various

c. 1,200,000

Ylva Berglund Prytz
Margareta Westergren Axelsson
Uppsala University, Sweden

The corpus can be used for research and educational purposes. It can be accessed on the Internet from the Oxford Text Archive

The Uppsala WordReference Corpus

English, Spanish, French, Italian

Various

Written

Forum posts

 

 

English learner subcorpus: 38 m.
English native subcorpus: 50 m.
Spanish learner subcorpus: 5 m.
Spanish native subcorpus: 22 m.
French learner subcorpus: 4 m.
French native subcorpus: 7 m.
Italian learner subcorpus: 1 m.
Italien native subcorpus: 3 m.

Aleksandrs Berdicevskis
Uppsala University

Freely available 

The UPF Learner Translation Corpus

English

Catalan

Written

Translations written by the students of the Translation and Interpreting degree at UPF

 

 c. 200,000

Anna Espunya
Pompeu Fabra University, Spain 

 

The UPV Learner Corpus

English

Catalan

Written

Essays

Various

c. 150,000

Universitat Politècnica de València, Spain

 

The Varieties of English for Specific Purposes dAtabase learner corpus (VESPA)

English

Various

Written

ESP texts (term papers, reports, MA dissertations)

Various

c. 220,000 (under development)

Magali Paquot
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium

 

The Written Corpus of Learner English corpus (WriCLE)

English

Spanish

Written

Essays

Various

c. 750,000

Paul Rollinson
Universidad Autonoma de Madrid, Spain

The corpus is available for free, and can be downloaded from this website. There is also a search interface to retrieve sentences and clauses

The Yonsei English Learner Corpus (YELC)

English

Korean

Written

Yonsei University English Diagnostic Tests (Part 1: Descriptive task, max. 100 words; Part 2: Argumentative tast, max. 300 words)

9 levels
(A1, A1+, A2, B1, B1+, B2, B2+, C1, C2)

c. 1 m.

Seok-Chae Rhee
CK Jung
Yonsei University, Korea

The YELC corpus will be available to the scientific community for research purposes from 31 March 2012

The Young Learner Corpus of English (YOLECORE)

English

Greek

Spoken

Pedagogic Corpus of video-recorded EFL language classes

 

170 school hours (126  hours of videotaped material)

1,5 m. types

Marina Mattheoudakis
Aristotle University of Thessaloniki, Greece

Thomas Zapounidis

 

The Estonian Interlanguage Corpus of Tallinn University (EIC)

Estonian

Russian
Finnish
English
German
Latvian
Lithuanian
Ukrainian
Belorussian

Written

Spontaneously produced texts in language learning situations: argumentative and literary essays, written stories, letters, term papers, reading reports.

A1-C2

c. 1 m.

Pille Eslon
Tallinn University, Estonia

Restricted online access

Linguistic Basis of the Common European Framework for L2 English and L2 Finnish (CEFLING)

Finnish
English

Various

Written

Various

Various

 

Maisa Martin
University of Jyväskylä, Finland

 Download the corpus data set here

Paths in Second Language Acquisition (TOPLING)

Finnish
English
Swedish

Various

Written

Various

Various

 

Maisa Martin
University of Jyväskylä, Finland

Available (see here for instructions on how to access the corpora) 

The Advanced Finnish Learner Corpus (LAS2)

Finnish

 Russian
Czech
Swedish
Estonian
Lithuanian
Komi
English
Hungarian
German
Icelandic
Japanese

Written

Exam essays, theses, essays and writings

Advanced

c. 630,000

Kirsti Siitonen
Ilmari Ivaska
University of Turky, Finland

 

The Finnish National Foreign Language Certificate Corpus (YKI)

Finnish

English
Finnish
French
German
Italian
Lappish (Sami)
Spanish
Swedish
Russian

Written and spoken

Various

Beginner, intermediate and advanced

 

Ari Maijanen
Tiina Lammervo
University of Jyväskylä, Finland

Available with user ID and Password

The International Corpus of Learner Finnish (ICLFI)

Finnish

Various

Written

Finnish learners’ spontaneously produced texts in language learning situations, large variety of text types

Beginner, intermediate and advanced

Under development

Jarmo Harri Jantunen
University of Oulu, Finland

Free download after applying for a user licence

The Chy-FLE (Cypriot Learner Corpus of French)

French

Modern Greek
(and Cypriot Greek)

Written

Argumentative and descriptive essays

From intermediate to advanced

c. 250,000 (under development)

Freiderikos Valetopoulos
Université de Poitiers, France
(In collaboration with the University of Cyprus)

 

The COREIL corpus 

French
English

 

Spoken

 

 

 

Elisabeth Delais-Roussarie
Hiyon Yoo
Université Paris-Diderot, France

 

The "Dire Autrement" corpus

French (Second Language)

Mainly L1 speakers of English

Written

Narrative, injunctive, persuasivle and informative texts

 

c. 50,000

Marie-Josée Hamel
Jasmina Milicevic
Dalhousie University, Canada

Available after registration

French Interlanguage Database (FRIDA)

French

Various

Written

Free compositions: desciptive, argumentative and narrative texts, news & mail

 Intermediate

 

Sylviane Granger
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium

 

French Learner Language Oral Corpora (FLLOC)

French

Various

Spoken

See description of the 7 corpora

Various

 

Florence Myles
Newcastle University, UK

Rosamund Mitchell
University of Southampton, UK

The contents of the database are being made freely available to the research community, in the form of digital sound files and related transcripts formatted using CHILDES software.

Searchable online

The InterFra corpus

French

Swedish

Spoken

Interviews, retellings of video clips and picture stories

Various

 

Inge Bartning
Stockholm University, Sweden

interfra@fraita.su.se

Available

The "Interphonologie du Français Contemporain" corpus (IPFC)

French

Cypriot Greek
Dutch
English (Canada)
German
Japanese
Norwegian
Spanish
 

Spoken

Reading aloud, repeating words, guided interviews, interactions between two learners

Various

Under development

Sylvain Detey
Waseda University, Japan
Université de Rouen, France

Isabelle Racine
Université de Genève, Switzerland

Yuji Kawaguchi
Tokyo University of Foreign Studies, Japan

Under development; samples available

The Learner Corpus French (LCF)

French

Dutch

Written

Argumentative essays
Informative texts 
Journalistic texts
Formal letters
Summaries

Written compositions by Flemish students of French

Intermediate to advanced

c. 500,000

Hans Paulussen
KU Leuven/UGent/Lessius, Belgium

Under development

The Lund CEFLE Corpus (Corpus Écrit de Français Langue Étrangère)

French

Swedish

Written

Descriptive and narrative essays; picture-based stories

Various

c. 100,000

Malin Ågren
Lund University, Sweden

A sub-part of the corpus is available online.

The University of the West Indies learner corpus (UWi)

French

English

Jamaican Creole

Spoken

Conversations during oral exams and in informal contexts

Various

 

Hugues Peters
University of New South Wales, Australia

Corpus is available freely here (last updated 2017)

Comasan Labhairt ann an Gàidhlig (CLAG)

Gaelic Adult Proficiency (GAP)

Gaelic

Various

Spoken

Conversation task
Narrative
Elicited oral imitation task
Question and answer activity

Various

 

Roibeard Ó Maolalaigh
Nicola Carty
University of Glasgow, UK

 

The AleSKO corpus

German

Chinese

Also German L1 data from the FALKO corpus

Written

Argumentative essays

 

 c. 13,600

Heike Zinsmeister
University of Konstanz, Germany

Margrit Breckle
Vilnius Pedagogical University, Lithuania

 

Analyzing Discourse Strategies: A Computer Learner Corpus

 

German

English
(mainly American English)

Written

Threaded Discussion
Chat
Essays
Longitudinal data

From beginner to intermediate-mid

Under development

Christina Frei
Edward Nixon
University of Pennsylvania, USA

 

The Corpus of Learner German (CLEG13) 

German

English

Written

Argumentative, free compositions
Longitudinal over 4 years, undergraduate students

Intermediate to advanced

c. 320,000

 

Ursula Maden-Weinberger

 

 

Online access through the FALKO platform.
The corpus is also available as txt files to the scientific community. Please contact U. Maden-Weinberger at uschi@miralis.co.uk

The deL1L2IM corpus

German

Russian-Belorussian bilinguals

Written

Instant messaging dialogues

Advanced

c. 52,000

Sviatlana Höhn
University of Luxembourg, Luxembourg

Available

The Fehlerannotiertes Lernerkorpus (‘error annotated learner corpus’) (FALKO)

German

Learner subcorpus: various

Native subcorpus: German

Written

1. Summaries
2. Essays
3. Letters, fiction writing, journal articles, book reviews (= longitudinal data from American learners)

1. Advanced
2. Advanced
3. Beginners - advanced

 

1. c. 40,000 (learner subcorpus) + c. 20,000 (native subcorpus)

2. c. 150,000 (learner corpus) + c. 70,000 (native subcorpus)

3. c. 78,000 (learner subcorpus)

Anke Lüdeling
Maik Walter
Humboldt-Universität zu Berlin, Germany

falko-korpus@hu-berlin.de

Online access

The KOLIPSI corpus

German

Italian

Written

Two written language production tasks of a standardized test (email/letter)

A2-C1

under development

Andrea Abel
Aivars Glaznieks
European Academy of Bolzano/Bozen, Italy

 

The Learning the Prosody of a Foreign Language (LeaP)

German

Various

Spoken

The LeaP corpus covers four different types of speech: read speech, prepared speech, free speech, nonsense word lists

Various

 62 speakers

Ulrike Gut
University of Augsburg, Germany

The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg

Manual

The LeKo (Lernerkorpus) corpus

German

 

 

 

 

c. 55,000

Anke Lüdeling
Humboldt-Universität Berlin, Germany

Online access (password protected)

Register here

The LINCS Corpus

1. German
2. German
3. German

1. English
2. German

1. Written
2. Written
3. Written

1. Essays, examination, answers (longitudinal and cross-sectional data)
2. Essays
3. Teaching output

1. Intermediate to Advanced
2. Advanced

Under development

Elizabeth Thoday
Heriot-Watt University Edinburgh, UK

Not currently publicly available

Multilingual Platform for the European Reference Levels: Exploring Interlanguage in Context (MERLIN)

German

Italian

Czech

Various

Written

Writing tasks from standardized tests (telc/UJOP)

A1 to C1

c. 280,000

Katrin Wisniewski
Leipzig University, Germany

Available

Rhodes University Deutsch als Fremdsprache (RUDaF)

German

 English, Afrikaans, isiXhosa, XiTsonga

Written

Short descriptive and argumentative writing paragraphs (300 words each)

A2-B2

34,000

Gwyndolen Ortner
Undine S. Weber
Rhodes University, South Africa

Not available

The Telecollaborative Learner Corpus of English and German Telekorp

German

English

Written

Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005

 

c. 1,5 m.

Julie Belz
Pennsylvania State University, USA

 

Not publicly available

The Langman corpus

Hungarian

Chinese

Spoken

Interviews conducted in 1994 with 11 Chinese immigrants living in Hungary.
Interviews focused on issues related to their arrival in Hungary as well as their daily life activities

 

 

Juliet Langman
University of Texas at San Antonio, USA

Freely available

Corpus di Apprendenti di Italiano L2 (CAIL2)

Italian

Various

Written

Essays

Intermediate to advanced

c. 237,000

Stefania Spina
Università per Stranieri di Perugia, Italy

Searchable via CQPweb

Corpus parlato di italiano L2

Italian

English
German
Japanese

Spoken

Transcriptions of interviews

Various

 

Stefania Spina
Silvio Pazzaglia
Mirco Perini
Università per Stranieri di Perugia, Italy

Searchable online

The KOLIPSI corpus

Italian

German

Written

Two written language production tasks of a standardized test (email/letter)

A2-C1

Under development

Andrea Abel
European Academy of Bolzano/Bozen, Italy

 

The Lexicon of Spoken Italian by Foreigners (LIPS)

Italian

Various

Spoken

Proficiency exams of the Certification of Italian as a Foreign Language (CILS)

A1-C2

c. 700,000

Francesca Gallina
Università per Stranieri di Siena, Italy

Freely available

MISTiC (Multiple Italian Student TranslatIon Corpus)

Italian

English, French

Written

Translations produced by trainee translators (mainly specialised texts)

post-graduate trainee translators

ca. 125,000 (English-Italian), ca. 50,000 (French-Italian)

Sara Castagnoli
University of Bologna, Italy

not available

Varietà di Apprendimento della Lingua Italiana: Corpus Online (VALICO)

Italian

Various

Written

 

Various

c. 570,000

Manuel Barbera
Carla Marello
Elisa Corino
University of Turin, Italy

Freely available and searchable online.

Longitudinal Corpus of Chinese Learners of Italian (LOCCLI)

Italian

Chinese

Written

Essays

Beginners and pre-intermediate

97,000 

Stefania Spina
Università per Stranieri di Perugia, Italy

Anna Siyanova-Chanturia
Victoria University of Wellington, New Zealand

It is freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpweb/

Corpus of Chinese Learners of Italian (COLI)

Italian

Chinese

Written and spoken

Essays and answers to open questions, interviews

Intermediate and advanced

82,300

 

Stefania Spina
Università per Stranieri di Perugia, Italy

The COLI is freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpweb/

The Korean learner corpus

Korean

Various

Written

Various: letters, essays, formal writing, etc.

Beginner and intermediate

c. 10,000

Seok Bae Jang
Georgetown University, USA

Sun Hee Lee
Wellesley College, USA

Sang kyu Seo
Yonsei University, South Korea

 

ESAM

Latvian and Lithuanian

Latvian and Lithuanian

Written

 

Beginner

52,000

Inga Znotiņa
Rīga Stradiņš University, Latvia

Available online 

The ASK corpus

Norwegian

German
Dutch
English
Spanish
Russian
Polish
Bosnian-Croatian-Serbian
Albanian
Vietnamese
Somali

Written

Essays from language tests

 B1 and B2

 

Kari Tenfjord
University of Bergen, Norway

Apply for a licence here

The Persian Learner Corpus (PLC)

Persian (Farsi)

Various

Written

Narratives and essays

Intermediate and advanced

Academic/Restricted online access

Saeed Safari
University of Belgrade, Serbia

Academic/Restricted online access

The Salam Farsi Learner Corpus (SFLC)

Persian (Farsi)

Serbian

Written

Narratives, descriptive essays

Beginner and upper-intermediate

Under development

Saeed Safari
University of Belgrade, Serbia

Academic, under development

Learner Corpus of Portuguese L2 (COPLE2)

Portuguese

15 languages: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian and Swedish

Written and spoken

Exams and assignments

A1-C1

written: 171.461
oral: 25.783

Iria del Río
Universidade de Lisboa, Portugal

Available

Russian Learner Corpus

Russian

varied

Written and spoken

Academic and non academic

Teachers and heritage speakers

unknown

Ekaterina Rakhilina
National Research University Higher School of Economics, Russia

Available online

The PIKUST pilot learner corpus 

Slovene

Various

Written

Mostly argumentative essays

Majority advanced – but also intermediate and beginner

c. 35,000

Mojca Stritar
University of Ljubljana, Slovenia

 

The Anglia Polytechnic University (APU) Learner Spanish Corpus

Spanish

Various

Written

 

 

c. 120,000

Anne Ife
Anglia Ruskin University, UK

 

Aprescrilov ("Aprendera Escribiren Lovaina")

Spanish

Dutch

Written

Written assignments and tests; several text types (letters, expository, descriptive, argumentative, narrative)

A1 to C1

c. 1 m.

Kris Buyse
KU Leuven, Belgium

Restricted online access

The Corpus de aprendices de español (CAES)

Spanish

Various

Written

 

A1 to C1

c. 575,000

CAES team
Universidade de Santiago de Compostela

Online access

Corpus Escrito del Español L2 (CEDEL2 version 1.0)

Spanish

English, Greek

Written

Written compositions by learners of Spanish

All proficiency levels (lower beginner to upper advanced)

802,019 words coming from 2,578 participants

Cristobal Lozano
Universidad de Granada, Spain

 Downloadable/browsable via the CEDEL2 webpage: http://cedel2.learnercorpora.com/

Corpus de textos escritos para el análisis de errores de aprendices de E/LE (CORANE)

Spanish

Various

Written

Essays

A2 to C1

/

Ana M. Cestero Mancera 
Inmaculada Penadés Martínez
Universidad de Alcalá Henares

CD-ROM available

The Corpus of Taiwanese Learners of Spanish (Corpus de Aprendices Taiwaneses de Español) (CATE)

Spanish

Chinese

Written

Student essays

Various

c. 340,000

hclu@mail.ncku.edu.tw

Under development

The DIAZ corpus

Spanish

German
Swedish
Icelandic
Korean
Chinese

Spoken

Semi-spontaneous (structured interviews) and experimental (structured questionnaires) Adult Spanish L2/L3 oral data

Various

 

Lourdes Diaz Rodriguez
Universitat Pompeu Fabra, Spain

Freely available

The Japanese learner corpus of Spanish

Spanish

Japanese

Written

Student essays

 

c. 83,400

Yoshihito Kamakura
University of Birmingham, UK

 

The Spanish Corpus Proficiency Level Training (SPT)

Spanish

English (heritage language learners)

Spoken

Dialogues about a given set of questions

Beginner to advanced

 

Dale Koike
Austin Liberal Arts Instructional Technology Center/University of Texas, USA

Videos are available

Spanish Learner Language Oral Corpus (SPLLOC)

Spanish

English

Spoken

Learner narratives, interviews and picture description tasks

Beginner to advanced

c. 50,000

Laura Dominguez
University of Southampton, UK

Searchable online
Data freely available for download

Spanish Learner Oral Corpus

Spanish

Various
(9+ languages - especially Portuguese, French, Italian)

Spoken

Semi-spontaneous interviews, narrative and descriptive tasks

A2-B1

c. 50,000 words

Leonardo Campillos Llanos
Universidad Autonoma de Madrid, Spain

Online access

The Tartu Learner Corpus of Spanish as a L3+

Spanish

Estonian

Written

Academic research writing

Advanced

c. 885,000

Mari Kruse
University of Tartu, Estonia

 

The ASU corpus

Swedish

 Chinese
English
German
Greek
Polish
Portuguese
Spanish
...

Spoken and written

Transcribed audio-recorded conversations and written texts from adult learners of Swedish – longitudinal data

 

c. 490,000 words
(c. 415,000 spoken and c. 75,000 written)

Björn Hammarberg
Stockholm University, Sweden

Available

Leiden Learner Corpus

Multilingual (Dutch, French, Italian, Portuguese and Spanish)

various

Written and spoken

Written data: short essays; oral data: picture-based story telling

Various

200 participants

M. Carmen Parafita Couto
University of Leiden, Netherlands

 

The European Science Foundation Second Language Database (ESF database)

Multilingual:

Dutch
English
French
German
Swedish

Punjabi
Italian
Turkish
Arabic
Spanish
Finnish

Spoken

Spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, and their communication with native speakers in the respective host countries

Various

 

Wolfgang Klein
Clive Perdue
Max Planck Institut, Netherlands

Freely available

The Foreign Language Examination Corpus (FLEC)

Multilingual

Polish

Written

Data from the Warsaw University
Certification Exams

Various

Under development

Piotr Banski
Romuald Gozdawa-Golebiowski
Warsaw University, Poland

 

The MeLLANGE Learner Translator Corpus (LTC)

Multilingual

various

Written

Legal, technical, administrative and journalistic texts

Trainee translators

 

Natalie Kübler
Université Paris Diderot, France

mellange_p7@eila.univ-paris-diderot.fr

Searchable online

The MiLC Corpus

Multilingual:

Catalan
English
French
Spanish

Catalan

Written

Formal and informal letters, summaries, curriculum vitae, essays, reports, translations, synchronous and asynchronous communication exchanges, business letters

 

c. 150,000

Angeles Andreu Andrés
Universidad Polytecnica de Valencia, Spain

 

The Multilingual Learner Corpus (MLC)

Multilingual: English
German
Italian
Spanish

Brazilian Portuguese

Written

Argumentative and marrative essays

 

 Aim: c. 200,000

Stella E. O. Tagnin
University of São Paulo, Brazil

Accessible online to registered researchers

The Padova Learner Corpus

Multilingual:

English
French
Spanish

Italian

CMC
(Computer-Mediated Communication)

Student work produced in blended language courses using FirstClass conferencing software.
Variety of genres: diaries, debate contributions, formal reports, résumés, etc. 
Longitudinal data

 

 

Under development

Fiona Dalziel
Francesca Helm
University of Padua, Italy

 

The corpus PARallèle Oral en Langue Etrangère (PAROLE)

 

Multilingual:

English
French
Italian

(Mainly L2 speakers but also includes data produced by L1 speakers)

Various

Spoken

5 oral production tasks

Various

 

Heather Hilton
John Osborne
Marie-Jo Derive
Nejma Succo
Jean O'Donnell
Sandra Billard
Sandrine Rutigliano-Daspet
Université de Savoie, France

Manual

The University of Toronto Romance Phonetics Database (RPD)

Multilingual:

English
French
Italian
Portuguese
Romanian
Spanish

Various
(including English, Mandarin, Russian, Spanish, etc.)

Spoken

Elicited production: sentence and passage reading, story narration, description of favourite meal

Various

 

Laura Colantoni
Jeffrey Steele
University of Toronto, Canada

Password available from directors

  

Learner corpus-based datasets

 

  

Corpus Target language First language Medium Text type / task type Proficiency level Size in words Project director Availability

 The Treebank of Learner English
(TLE)

 English

Various

written

 Sentences from the CLC FCE (annotated with syntactic trees)

 Upper-intermediate

 97,681
(5,124 sentences)

Yevgeni Berzak

Publicly available through the UD repository ('English-ESL')