This list is still work in progress. We would like it to be as comprehensive as possible. If you have a learner corpus or know of one that is not listed on this webpage, send a message to Magali Paquot and we will add it to the list. We hope you will find the list useful for your research!
The list below only contains learner corpora, i.e. electronic collections of continuous written or spoken data produced by foreign or second language learners.
For a list of learner corpus-based datasets (treebanks, error lists, etc.), click here.
To refer to this list : Centre for English Corpus Linguistics (date of access): Learner Corpora around the World. Louvain-la-Neuve: Université catholique de Louvain. https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html
|
Learner corpora
Last updated 4 April 2023
Use the query box below to search for specific keywords (e.g. languages, task type, medium).
Corpus | Target language | First language | Medium | Text type / task type | Proficiency level | Size in words | Project director | Availability |
---|---|---|---|---|---|---|---|---|
The Arabic Learner Corpus (ALC) | Arabic | 66 languages | Written and spoken | Narrative and discussion | Intermediate and advanced | Written: c. 283,000 Audio: c. 3h30 |
Abdullah Alfaifi (Al Imam University, Saudi Arabia) Eric Atwell (University of Leeds, UK) |
Available |
The Pilot Arabic Learner Corpus | Arabic | English | Written | Narrative | Intermediate and advanced | c. 9,000 | Ghazi Abuhakema (College of Charleston, USA) Reem Faraj (Columbia University, USA) Anna Feldman (Montclair State University, USA) Eileen Fitzpatrick (Montclair State University, USA) |
/ |
The Jinan Chinese Learner Corpus (JCLC) | Chinese | 50 languages | Written | Exams and assignments | Beginners, intermediate and advanced | c. 6 m. Chinese characters c. 9,000 texts |
Maolin Wang (Jinan University, China) Shervin Malmasi (Macquarie University, Australia) Mingxuan Huang (Guangxi University of Finance and Economics, China) |
Free download upon contact with researchers |
Croatian Learner Text Corpus (CroLTeC) | Croatian | 36 languages | Written | Exam essays, argumentative and literary essays, letters, diaries, picture descriptions, book reviews, short dialogues, etc. | A1-C2 | c. 1 m. | Nives Mikelić Preradović (University of Zagreb, Croatia) | Freely available |
The AKCES/CZESL (Acquisition corpora of Czech/Czech as a second language) corpus | Czech | Various | Written and spoken | Student essays and interviews | Various | 2 m. | Karel Šebesta (Charles University/Technical University, Czech Republic) | Available |
Leerdercorpus Nederlands als Vreemde Taal | Dutch | French | Written | Liesbeth Degand (Université catholique de Louvain, Belgium) | ||||
Arab Learner English Corpus (ALEC) | English | Arabic | Written | Essays written by freshman students as part of first level college writing course | University students (second language learners) | Analysis: 184,749 Narrative: 67,527 Synthesis: 66,015 Argumentation: 192,298 |
Inas Mahfouz (American University of Kuwait, Kuwait) |
Available upon request for research purposes A part-of-speech tagged version of the corpus is now available |
The Aachen Corpus of Academic Writing (ACAW) | English | German | Written | Academic research writing | Advanced | c. 240,000 c. 225,000 (L1 component) |
Elma Kerz (RWTH Aachen University, Germany) | Under development |
The Advanced Learner English Corpus (ALEC) | English | Mainly Swedish | Written | Essays written by university students of English linguistics and English literature | Advanced | c. 1.3 m. | Tove Larsson (Uppsala University, Sweden; Université catholique de Louvain, Belgium) | Not freely available |
The ANGLISH corpus | English | French | Spoken | Readings of texts and sentences, spontaneous oral language | Various | c. 5h30 | Anne Tortel (University of Provence, France) | Freely available |
Asao Kojiro’s Learner Corpus Data | English | Japanese | Written | Essays and stories written or reproduced by Japanese college students | Asao Kojiro (Ritsumeikan University, Japan) | Texts available for download | ||
The Barcelona English Language Corpus (BELC) | English | Spanish Catalan |
Spoken and written | 4 tasks: written composition, oral narrative, oral interview, role-play Longitudinal data (children and young adults learning English) |
Various | Carmen Muños (University of Barcelona, Spain) | Available | |
The BATMAT Corpus | English | Swedish Finnish |
Written | BA dissertations, MA dissertations | Advanced | c. 2,5 m. | Tuija Virtanen-Ulfhielm (Åbo Akademi University, Finland) | Not publicly available |
Belarusian Learner Corpus of English (BELLCE) |
English | Russian Belarussian |
Written | Argumentative essays | High intermediate to advanced | Unknown | Anastasia Rakhuba | |
The Bilingual Corpus of Chinese English Learners (BICCEL) | English | Chinese | Spoken and written | Spoken: National Oral English test Written: in-class assignments |
c. 2 m. | Wen Qiufang (Beijing Foreign Studies University, China) | ||
The Brazilian Spoken Corpus of English Learners (BraSCEL) | English | Portuguese | Spoken | Informal interview + thought-provoking picture discussion | A1-C2 benchmarked to the CEFR | Under development | Mateus Miranda (Mary Immaculate College/University of Limerick, UK) | The corpus (transcriptions of audio files) will be available to the scientific community upon request |
The British Academic Written English (BAWE) corpus | English | Mainly L1 speakers Also includes data produced by L2 speakers |
Written | ESP papers | 4 levels of study (from undergraduate levels to final year and taught masters level) | c. 6.5 m. | Hilary Nesi (Coventry University, UK) Sheena Gardner (Coventry University, UK) Paul Thompson (University of Birmingham, UK) Paul Wickens (Oxford Brookes, UK) |
The BAWE corpus can be accessed through the corpus analysis interface, Sketch Engine. |
The BUiD Arab Learner Corpus (BALC) | English | Arabic | Written | School examination essays | Various | c. 290,000 | Mick Randall (The British University in Dubai, United Arab Emirates) Nicholas Groom (University of Birmingham, UK) |
At present, copies of the current version of the corpus is available on request from mick.randall@buid.ac.ae |
The Cambridge Learner Corpus (CLC) | English | Various | Written | Exam scripts | Various | c. 50 m. | Cambridge University Press and Cambridge ESOL (Cambridge University, UK) |
Commercial Available on SketchEngine |
Canadian job cover letter corpus | English | Multiple L1s for permanent residents of Canada. Self-reported L1s for those who consented to archive their letters include: Mandarin (33), Farsi (25), Punjabi (17), Korean (15), Chinese (11), Spanish (9), Arabic (5), Tagalog (5), Cantonese (4), Russian (3), French (2), Hindi (2), Taiwanese (2), Turkish (2), Albanian (1), Armenian (1), Bengali (1), Bulgarian (1), Cebuano (1), Czech (1), Hakka (1), Ilocano (1), Japanese (1), Karen (1), Khmer (1), Kurdish (1), Pashto (1), Serbian (1), Vietnamese (1), and Waray (1). | Written | Job application cover letters (obtained in a simulated task conducted 2015 to 2016 in ESL classrooms in British Columbia, Canada. 201 letters were collected, and 151 were archived with consent) | Low to high intermediate English (ranging from CLB 3 to 8) | circa 29,000 (for the archived 151 letters) | Dr. Terri Everest, teverest@alumni.ubc.ca (under PhD supervisor Dr. Grisel Garcia Perez). Dr. Everest obtained English cover letters from learners and NES (as well as model letters from L1 English books) in both a pilot study and dissertation study. | Creative Commons Non-Commercial ShareAlike 4.0 International license. See https://open.library.ubc.ca/cIRcle/collections/ubctheses/33426/items/1.0417287 (folder 5, Everest_cover_letter_corpora_meta_data, in zip file). Users may adapt materials but must attribute Dr. Everest and share it under the same terms. Commercial use is not permitted [Three L1 English cover letter corpora from the same project are also available: (1) 21 NES letters from pilot study participants in one community; (2) 40 NES letters from dissertation study participants, university students and alumni; and (3) 100 sample/model letters from Canadian and American books on cover letter writing, with 10 letters each from 10 job fields, all by different authors or editors.]. |
CEFR-ASAG corpus | English | French | Written | Short answers to an open-ended question targeting different proficiency levels | A1-C | 712 learner texts | ALTISSIA International & CENTAL, UCLouvain, Belgium · cental@uclouvain.be | Available |
The CELI corpus | Italian | Various | Written | Written tasks from language certification exams (article, blog, email, letter, story, report, essay) | B1-C2 | ca. 600,000 | Stefania Spina, Irene Fioravanti, Luciana Forti, Valentino Santucci, Angela Scerra, Fabio Zanda (University for Foreigners of Perugia, Italy) | Freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpwebnew/celi/ |
Corpus de ELE en Japón (CELEN) | Spanish | Japanese | Written |
Texts from a) University courses of Spanish in Japan (exams and assignments) and b) Informal learning contexts on the Internet (electronic blogs and forums) |
A1 to C2 | c. 658,000 | Pilar Valverde Ibañez (Kansai Gaidai University, Japan) | Online access |
The Chinese/English Political Interpreting Corpus (CEPIC) | English/Chinese [Cantonese/Putonghua] | Chinese [Cantonese/Putonghua] / English | Spoken and Written | Political speeches | Professional (Near-Native) | 6,393,994 | Jun PAN (janicepan@hkbu.edu.hk) | Open Access |
The Corpus of Academic Learner English (CALE) | English | German | Written | Various academic text types that are typically produced in university courses of English (e.g. term papers, reading reports, research plans, abstract, reviews, and summaries) | Advanced | Under development | Marcus Callies (University of Bremen, Germany) | |
The Chinese Academic Written English corpus (CAWE) | English | Chinese | Written | Dissertations written by Chinese undergraduates majoring in English linguistics or applied linguistics | c. 400,000 | David Yong Wey Lee (City University of Hong Kong, Hong Kong) | ||
The Chinese Learner English Corpus (CLEC) | English | Chinese | Written | Various | c. 1 m. | Gui Shichun (Guangdong University of Foreign Studies, China) Yang Huizhong (Shanghai Jiao Tong University, China) |
The corpus can only be accessed by users in the Department of English at HKPU | |
The City University Corpus of Academic Spoken English (CUCASE) | English | Chinese Also includes data produced by L1 speakers |
Multimedia | c. 2 m. | David Yong Wey Lee (City University of Hong Kong, Hong Kong) | |||
The Cologne-Hanover Advanced Learner Corpus (CHALC) | English | German | Written | Term papers and essays | Advanced | c. 210,000 | Ute Römer (University of Michigan, USA) | |
The College Learners’ Spoken English Corpus (COLSEC) | English | Chinese | Spoken | National spoken English test for non-English majors | c. 700,000 | Yang Huizhong (Shanghai Jiao Tong University, China) Wei Naixing (Beihang University, China) |
||
The Corpus Archive of Learner English in Sabah/Sarawak (CALES) | English | Malay | Written | Argumentative essays | Various | c. 400,000 | Simon Botley@Faizal Hakim, Doreen Dillah (Universiti Teknologi MARA Sarawak, Malaysia) | |
Corpus Oral de Português como Língua Adicional-Brasil (CoPLA-BR)/Oral Corpus of Brazilian Portuguese as an Additional Language | Portuguese | Various | Spoken | Informal interview + thought-provoking picture discussion | Basic Intermediate Advanced | Under development | Mateus Miranda (Mary Immaculate College/University of Limerick, UK) | The corpus (transcriptions of audio files) will be available to the scientific community upon request. |
Corpus Escrito de Aprendices de Inglés como Lengua Extranjera en Ecuador (COREAILE) | English | Spanish (Ecuadorian) | Written | Narrative | Beginners and intermediate | 44,352 (210 texts) | Miguel A. Macías Loor (Universidad Técnica de Manabí, Ecuador) | Available upon contact with researcher (miguel.macias@utm.edu.ec) |
CORpus del ESPañol de los Italianos (CORESPI) | Spanish | Italian | Written | Written compositions | A1 to B2 | c. 125,000 | Sonia Bailini (Università Cattolica del Sacro Cuore, Italy) | Online access |
CORpus del ITaliano de los Españoles (CORITE) | Italian | Spanish | Written | Written compositions | A1 to B2 | c. 103,000 | Sonia Bailini (Università Cattolica del Sacro Cuore, Italy) | Online access |
The Corpus of Business Letters | English | Italian | Written | Tagged part: BEC1 writting tests (letters, emails, faxes, memos, reports) Untagged part: business writing exam tests |
c. 32,000 | Anna Romagnuolo (University of La Tuscia, Italy) | ||
The Corpus of Multilingual Opinion Essays by College Students (MOECS) | English | Varied | Written | Opinion essays | College students | Unknown | Megumi Okugiri (University of the Sacred Heart, Japan) | Available |
Corpus of writing, pronunciation, reading, and listening by learners of English as a Foreign Language | English | Japanese | Written and spoken | Varied | Beginners to advanced | 29h audio 30,000 words |
Katsunori Kotani (Kansai Gaidai University, Japan) Takehiko Yoshimi (Ryukoku University, Japan) Hiroaki Nanjo (Ryukoku University, Japan) Hitoshi Isahara (Toyohashi University of Technology, Japan) |
|
Corpus of Written Spanish, L2 and Heritage Speakers (COWS-L2H) | Spanish | English Mandarin Other |
Written | Personal essays | Beginner, intermediate, advanced, and heritage | 1,138,097 | Claudia H. Sánchez Gutiérrez (University of California, Davis, USA) | Available on Github (https://github.com/ucdaviscl/cowsl2h) |
The Corpus of Young Learner Interlanguage (CYLIL) | English | Dutch French Greek Italian |
Spoken | English L2 data elicited from European School pupils – longitudinal data | Various | c. 500,000 | Alex Housen (Vrije Universiteit Brussel, Belgium) | |
Corpus and Repository of Writing (Crow) | English | 24 languages, predominantly Chinese and Arabic | Written | Analysis, narrative, literature review, argument, empathy writing, proposal, reflection | High intermediate/advanced (TOEFL overall score 80-105); international undergraduate students in first-year writing classes | 9 m. (in March 2020) | Shelley Staples (University of Arizona, USA) Bradley Dilger (Purdue University, USA) |
Open access after registration |
DISKO (Deutsch im Studium: Lernerkorpus/German at university: Learner Corpus) | German | Various | Written | Standardized writing task from university admission language test (TestDaF), app. 400 tokens per text | B1-C2 | c. 240,000 (DISKO_L2), c. 55,000 (DISKO L1), c. 12,000 (DISKO_DSH), c. 90,000 (DISKO_WebTestDaF) | Katrin Wisniewski (University of Leipzig, Germany) | Available online under the ANNIS architecture, please refer to the corpus handbook |
The Eastern European English learner corpus | English | Russian Ukrainian Polish Slovak |
Spoken | Spontaneaous spoken production data elicited by means of a semi-structured interview | Various | c. 60,000 | Elena Salakhian (Eberhard Karls University of Tübingen, Germany) | |
The EFL Teacher Corpus (ETC) | English | Korean | Spoken | Teacher talks in language classrooms | Upper-intermediate to advanced | c. 123,000 | Ye-eun Kwon, Eun-Joo Lee (Kunsan National University, South Korea) | Complete. Available at https://www.lextutor.ca/conc/eng/ |
The English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) Corpus | English | Various | Written | Essays written during state-wide standardized annual testing in the United States | Various | 6,482 texts of 427.793 words on average |
|
|
The English of Malaysian School Students corpus (EMAS) | English | Malay | Written | Student essays and oral interviews | Various | c. 500,000 | Arshad Abd. Samad (Universiti Putra Malaysia, Malaysia) | |
The English Speech Corpus of Chinese Learners (ESCCL) | English | Chinese | Spoken | Dialogue reading-aloud | Middle school and college | Chen Hua (Nantong University, China) Wen Qiufang (Beijing Foreign Studies University, China) Li Aijun (Chinese Academy of Social Sciences, China) |
||
The ETS Corpus of Non-Native Written English | English | 11 languages | Written | 12,100 TOEFL English essays | / | Daniel Blanchard | Information about the score level is available for each essay Samples are available | |
The Europarl corpus of Native Non-native and Translated Texts (ENNTT) | English | 24 EU languages | Written | Proceedings of the European Parliament | Advanced | NNS: c. 780,000 NS: c. 3 m. Translated: c. 22 m. |
Sergiu Nisioi (University of Bucharest, Romania) | Available |
English Students’ Oral Corpus in Chile (ESOC-Chile) | English | Spanish | Spoken | Student Interviews | B1, B2, C1 | 73,631 | Chinger Zapata (Universidad Católica del Norte, Chile) | The corpus (audio files or plain transcriptions of audio files in txt format) will be available to the scientific community upon request to Chinger Zapata |
The EVA Corpus of Norwegian School English | English | Norwegian | Spoken | Picture-based tasks | / | c. 35,000 | Angela Hasselgren (University of Bergen, Norway) | |
The FUSE (The Finnish Upper Secondary School Corpus of Spoken English) | English | Finnish (possibly other L1s too, information not collected) | Spoken | Role-tasks or mind-map tasks as part of a low-stakes, course examination in Finnish upper secondary/high schools | CEFR: A2-C1 | N/A | Lasse Ehrnrooth (University of Helsinki, Finland) | Online access |
The Gachon Learner Corpus | English | Korean (+ a few Chinese and Spanish speaking students) | Written | Written Journal Assignments | Lower intermediate | c. 2,5 m. | Brian Carlstrom (Gachon University, South Korea) | Freely available |
The Gesprochene Wissenschaftssprache konstrastiv/Multilingual corpus of spoken academic language (GeWiss) | German | English, Polish, Bulgarian and diverse other L1 languages | Spoken | Academic papers, student presentations and academic oral examinations in German philology / Applied Linguistics / Language pedagogy as well as in Polish, English, and Italian philology | B2, C1 | 1.4 m. | Christian Fandrych (Leipzig University, Germany) | Freely available upon registration: https://gewiss.uni-leipzig.de/index.php?id=home&L=1 |
The GICLE corpus (German component of ICLE) | English | German | Written | Mainly non-academic argumentative essays | Advanced | c. 234,000 | ||
The Giessen-Long Beach Chaplin Corpus (GLBCC) | English | German | Spoken | Transcribed interactions between native English speakers, ESL and EFL speakers | Various | c. 350,000 | Andreas Jucker, Sara Smith (University of Giessen, Germany) | Restricted use: apply for approval to get a copy |
The Hong Kong University of Science & Technology (HKUST) learner corpus | English | Chinese (mostly Cantonese) | Written | Untimed assignments written for EFL courses and school leaving exams | University and advanced high school students | c. 25 m. | John Milton (Hong Kong University of Science &Technology, Hong Kong) | |
The Indianapolis Business Learner Corpus (IBLC) | English | Various | Written | Job application letters and résumés of business communication students from the U.S., Belgium, Finland, Germany, and Thailand, spanning the years 1990-1998 | Ulla Connor, Kristen Precht, Thomas Albin (Upton Indiana University, USA) | |||
The International Corpus of Crosslinguistic Interlanguage (ICCI) | English | Various | Written | Essays (20-min in-class tasks without the use of a dictionary) | Beginner to lower-intermediate | 9,000 essays | Yukio Tono (Tokyo University of Foreign Studies, Japan) | Freely available |
The Icelandic L2 Error Corpus (IceL2EC) | Icelandic | 13 languages | Written | Student essays and assignments | Various | c.125,000 | Anton Karl Ingason, Lilja Björk Stefánsdóttir, Xindan Xu, Isidora Glišić (University of Iceland, Iceland) | Open access |
The International Corpus Network of Asian Learners of English (ICNALE) | English | Chinese, Filipino, Indonesian, Japanese, Korean, Malay, Thai, and Urdu | Written and spoken | Essays/ Monologues/ Dialogues | A2, B1, B2+ | c 3.5 m. | Shin'ichiro Ishikawa (Kobe University, Japan) | Open access |
The International Corpus of Learner English (ICLE) | English | Various | Written | Argumentative and literary essays | High-intermediate to advanced | c. 3 m. | Sylviane Granger (Centre for English Corpus Linguistics, Université catholique de Louvain, Belgium) | CD-Rom + handbook: order online |
The International Teaching Assistants corpus (ITAcorp) | English | Various | Spoken | Learner language from a variety of spoken classroom tasks: office hours role plays, presentations, discussions | c. 500,000 | Steven L. Thorne, Paula Golombek, Jonathon Reinhardt (Pennsylvania State University, USA) | ||
The Interphonology of Contemporary English corpus (IPCE-IPAC) | English | French Italian Chinese Spanish |
Spoken | Reading aloud, repeating words, guided interviews, interactions between two learners | Various | Under development | Nadine Herry-Bénit (Université Paris Nanterre, France) Stéphanie Lopez (Northwesterne Polytechnical University, China) Jeff Tennant (University of Western Ontario, Canada) |
Under development; samples available |
The Iranian Corpus of Learner English | English | Farsi | Written | Expository essays | University students (English majors) | 436,035 | Parviz Maftoon, Parviz Birjandi, Hossein Khazaee (Islamic Azad University, Iran) | CD-ROM, data gathered for PhD dissertation by Hossein Khazaee; this corpus is an intellectual property of Science and Research Branch, Islamic Azad University, Tehran, Iran |
The ISLE speech corpus | English | German Italian |
Spoken | Recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions) | Intermediate | c. 18h | ecisle@nats.informatik.uni-hamburg.de (University of Hamburg, Germany) | CD-Rom |
The Israeli Learner Corpus of Written English | English | Hebrew | Written | Argumentative and descriptive essays | c. 750,000 | Tina Waldman (Kibbutzim College of Education, Israel) | ||
The Janus Pannonius University (JPU) Corpus | English | Hungarian | Written | Essays and research papers | University students | c. 500,000 | József Horváth (University of Pécs, Hungary) | Searchable online |
The Japanese English as a Foreign Language Learner (JEFLL) Corpus | English | Japanese | Written | Student essays | From beginning to intermediate | c. 700,000 | Yukio Tono (Meikai University, Japan) jefll.inquiry@corpuscobo.net |
The JEFLL Corpus will be freely available for research, first via the web query system (already available in Japanese) and then the entire data will be distributed under license in the future |
Kansas Developmental Learner corpus (KANDEL) | German | English | Written | Short essays: personal narratives and personal accounts (e.g. your family, your daily routine), argumentative tasks (e.g. a book review) | At or below A2 | c. 420,000 | Nina Vyatkina | Freely and publicly available for online searches and for download using the Creative Commons license |
Korean English Learners’ Spoken Corpus (KELSC) |
English | Korean | Spoken |
1. Two speaking tests using real-time video conferencing software. 2. Integrated Tasks. 2.1 Listen to a passage (60 seconds, 90~100 words) and summarize the context of the listened to passage. Preparation time: 60 seconds, Response time: 60 seconds. 2.2 Read a passage (60 seconds, 110~120 words) and summarize the context of the read passage. Preparation time: 60 seconds, Response time: 60 seconds. |
CEFR: A1, A2, B1, B2, C1, C2 | 36,588 |
CK Jung & Kory Lauzon (Institute for Corpus Research, Incheon National University, South Korea) Email: ckjung@inu.ac.kr |
Available upon request for research purposes: Institute for Corpus Research, Incheon National University URL: http://icr.or.kr |
Kolipsi Corpus Family | Italian German |
Italian German |
Written | Written productions from upper secondary school pupils (narrative and argumentative texts) | c. 1 m. |
|
All sub-corpora of the Kolipsi Corpus Family can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository (from mid 2021 onwards) | |
Korpus slovenščine kot tujega jezika (KOST 1.0) | Slovene | Albanian, Bosnian, Chinese, Croatian, Czech, English, French, German, Greek, Hungarian, Italian, Japanese, Korean, Macedonian, Polish, Russian, Slovak, Serbian, Spanish, Ukrainian | Written | Essays (homework assignments and exams) | Various | 1,000,000 (6311 texts) |
Mojca Stritar Kučuk |
Available here: https://www.clarin.si/repository/xmlui/handle/11356/1753 |
The L2 component of the Spoken Chinese Corpus | Chinese (Putonghua used in mainland China) | English (12 New Zealanders and two Australian who were native English speakers of non-Chinese ethnicity) | Spoken | Informal interaction (non-task/test settings) | Intermediate to advanced | 220,792 | Lin Li | Available via GitHub https://github.com/blculyn |
Ladder: a corpus for pragmatic competences in Italian L1/L2 | Italian | German | Written and oral (monological) | DCT requests, cancellations, apologies (Emails, Instant messages, Voice messages) | Intermediate and upperintermediate university students (B1, B2, C1) | 73,000 ca | Nicola Brocca (Universität Innsbruck) https://www.uibk.ac.at/de/ifd/sprachen/team/nicola-brocca/ |
Freely available: https://arche.acdh.oeaw.ac.at/browser/oeaw_detail/531126 |
Lancaster Corpus of Academic Written English (LANCAWE) | English | Various | Written | IELTS academic writing tests (descriptive and argumentative tasks); assignments – longitudinal data | ||||
LANGSNAP | Spanish and French | English | Spoken and written |
Oral interviews, story retelling, argumentative writing |
700,000 | Nicole Tracy-Ventura & colleagues | ||
The Lang-8 Learner Corpora | English | Various | Written | Texts from Lang-8, a social networking site for language learning | / | / | Toshikazu Tajiri, Mamoru Komachi (Nara Institute of Science and Technology, Japan) | Available here |
Learner Corpus of Latvian (LaVA) | Latvian | 35 different languages (German (37%), Swedish (11%), Finish (9%), Norwegian, Italian, Arabic, Turkish, Portuguese, Russian, Persian, Urdu, Spanish, Sinhala, French, Tamil, Hindi, Punjabi, Chinese, Flemish, Hebrew etc.) | Written (handwritten texts) | Student essays | A1, A2 | 192K words | Ilze Auziņa |
1) freely available on corpus website: https://lava.korpuss.lv/en/ 2) noSketchEngine: 3) CLARIN-lv: |
The LeaP (Learning Prosody in a Foreign Language) Corpus | English | German | Spoken | Four types of speech styles were recorded: nonsense word lists, readings of a short story, retellings of the story, free speech in an interview situation | Various | c. 12h | Ulrike Gut (Albert-Ludwigs-University Freiburg, Germany) | The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg |
The Learner Corpus of Engineering Abstracts (LCEA) | English | Malaysian | Written | Abstracts of the Computer and Communication Systems Engineering Final Year Projects | Various | c. 550,000 998 abstracts |
Helen Tan (University Putra Malaysia, Malaysia) Ain Nadzimah Abdullah (University Putra Malaysia, Malaysia) Syamsiah bt Mashohor (University Putra Malaysia, Malaysia) Chan Swee Heng (Taylor's University, Malaysia) |
Available. Contact: Helen Tan |
The Learner Corpus of English for Business Communication | English | Chinese | Written | Different types of business correspondence written for simulated business situations, including memos, faxes, reports, letters of enquiry and complaint letters | c. 117,500 | Li Lan (Hong Kong Polytechnic University, Hong Kong) | Searchable online | |
The Learner Corpus of Essays and Reports | English | Chinese | Written | Essays and project reports covering a range of topics from Science, IT and New Media to Nursing, Business and Economics, and the Social Sciences | c. 188,000 | Sima Sengupta (Hong Kong Polytechnic University, Hong Kong) | Searchable online | |
A Learners' Corpus of Reading Texts | English | French | Spoken | Unprepared reading of English texts (the texts are short abstracts of fiction or made-up dialogues) | University students | Sophie Herment, Valérie Kerfelec, Laetitia Leonarduzzi, Gabor Turcsan (Aix-Marseille University, France) | Freely available | |
The Longitudinal LEarner COrpus iN Italiano, Deutsch, English (LEONIDE) | Italian German English |
Italian German |
Written | Written productions from secondary school pupils (narrative and opinion texts) | 237,000 | The Corpus can be queried via the ANNIS interface. It will be available for download on the Eurac Research Clarin Repository in summer 2021. | ||
The LONGDALE (LONGitudinal DAtabase of Learner English) project | English | Various | Spoken and written | Range of text types/task types – longitudinal data | From intermediate to advanced | Fanny Meunier (Centre for English Corpus Linguistics, Université catholique de Louvain, Belgium) | Under development | |
The Longman Learners' Corpus | English | Various | Written | Essays and exam scripts | Various | c. 10 m. | Longman | Commercial |
Learner of Persian Spoken Corpus (LoPSC) | Persian | English | Spoken | Informal conversations | Upper-Intermediate | c. 30,000 (ongoing) |
Sepideh Daghbandan University of Edinburgh, UK) |
Please contact project director for access to the corpus |
The Louvain International Database of Spoken English Interlanguage (LINDSEI) | English | Various | Spoken | Interviews and picture descriptions | High-intermediate to advanced | c. 800,000 | Gaëtanelle Gilquin (Centre for English Corpus Linguistics, Université catholique de Louvain, Belgium) | CD-Rom and handbook: order online |
The MERLIN Corpus | Italian German Czech |
Various | Written | Various (informal and formal email/letter for different purposes, opinion text on different topics), based on standardised language tests | c. 340,000 | The MERLIN Corpus can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository. | ||
Mexican Learners Corpus MexLeC | English | Mexican Spanish | Spoken | Semi-structured interview on spare time, occupation, friends and family. Monologue: narratives and opinion questions | A1-B1 Longitudinal (1st. Stage) | Up to 200,000 (Under development) | Abigahil Flores. Conacyt PostDoc Researcher / Pauline Moore. Universidad Autónoma del Estado de México | Available soon at: MexLeC |
Moroccan Learner English Corpus (MoLEC) | English | Various | Written | Argumentative essays | Undergraduate EFL students | 44,783 (185 texts) | Ennaciri El Mehdi, Iabdounane Yassine | |
Multilingual Academic Corpus of Assignments - Writing and Speech (MACAWS) | Portuguese Russian |
15 languages (predominantly English and Spanish) | Written and spoken | Classroom assignments and exams organized by Macrogenre (e.g., Analysis, Description, Evaluation, Exposition, Narration) and Topic (e.g., Art, Culture, Literature, Family, Food, Future Plans, Trip) | Beginner, intermediate and advanced | 212,064 (in March 2020) | Shelley Staples (University of Arizona, USA) Aleksey Novikov Adriana Picoral Bruna Sommer-Farias |
Open access after registratio |
Multilingual Corpus of Second Language Speech (MuSSeL) | Mandarin Chinese, French, Portuguese, Spanish | Mainly English | Spoken | Recordings in response to the Interpersonal Listening/Speaking (ILS) section of ACTFL Assessment of Performance toward Proficiency in Languages (AAPPL) and ACTFL’s Oral Proficiency Interview by Computer (OPIc) | Novice to Advanced | 111,267 words (2,597 texts) collected from 152 learners (as of Nov 17, 2021) | Fernando Rubio | Publicly available via L2TReC and Talkbank |
The Malaysian Corpus of Learner English (MACLE) | English | Malay | Written | Gerry Knowles, Zuraidah Mohd. Don (University of Malay, Malaysia) | / | |||
The Malaysian Corpus of Students' Argumentative Writing (MCSAW) | English | Malay Chinese Indian |
Written | Argumentative essays | Form 4 Form 5 College | c. 565,500 | Seyed Ali Rezvani Kalajahi, Jayakaran Mukundan (University Putra Malaysia, Malaysia) | Available from developers |
The Michigan Corpus of Academic Spoken English (MICASE) | English | Mainly L1 speakers but also includes data produced by L2 speakers | Spoken | Transcipts of academic speech events | c. 1,8 m. | Ute Römer (University of Michigan, USA) micase@umich.edu |
Searchable online | |
The Michigan Corpus of Upper-level Student Papers (MICUSP) | English | Semi-balanced sample of native and non-native speakers of English | Written | ESP papers A-grade papers or ungraded papers that have been assessed and accepted (such as research proposals), but not published | c. 2,6 m. | Ute Römer (University of Michigan, USA) micase@umich.edu |
Searchable online | |
The Montclair Electronic Language Database (MELD) | English | Various | Written | Student essays | Various | c. 100,000 | Eileen Fitzpatrick, Milton S. Seegmiller (Monclair State University, USA) | Contact Eileen Fitzpatrick Includes error annotations |
The Multimedia Adult ESL Learner Corpus (MAELC) | English | ESL environment | Multimedia | Video of classroom interaction and associated written materials | Beginner to upper-intermediate | Stephen Reder, Kathryn Harris, Kristen Setzler (Portland State University, USA) labschool@pdx.edu |
The Lab School would like to share the extensive resources from MAELC with interested researchers and teacher trainers. Those interested should make inquiries to the Lab School by e-mail | |
The Neungyule Interlanguage Corpus of Korean Learners of English (NICKLE) | English | Korean | Spoken and written | Written part: student essays Spoken part: student interviews and oral speech tests transcriptions | Mainly from beginning to intermediate | Written: c. 890,000 Spoken: c. 100,000 |
Ji-Myoung Choi (Yonsei University, Souh Korea) | The corpus will be available to the scientific community for research purposes upon request |
The Japanese Learner English Corpus (NICT JLE) | English | Japanese | Spoken | English oral proficiency interview test | Various | 2 m. | Emi Izumi, Kiyotaka Uchimoto, Hitoshi Isahara (National Institute of Information and Communications Technology, Japan) | Freely available (downloadable) |
The NOn-native Spanish corpus of English (NOSE) | English | Spanish | Written | Argumentative and descriptive student essays | Intermediate and upper-intermediate | c. 300,000 | Ana Diaz-Negrillo (Universidad de Granada, Spain) | |
The NORINT Corpus. The NORINT Corpus consists of three sub-corpora: NORINT Speech, NORINT Recited, NORINT Text |
Norwegian | Various | Spoken and written |
NORINT Speech: interviews with and conversations between informants NORINT Recited: the informants read out a short story as well as 60 non-contextualized sentences NORINT Text: written language corpus comprising exam papers |
B1 or higher |
NORINT Speech: 103719 tokens NORINT Recited: 36873 tokens NORINT Text: 53247 tokens |
Annely Tomson https://www.hf.uio.no/iln/english/people/aca/norwegian-for-international-students/tenured/annelyt/index.html |
Glossa (search and post-processing tool) supports login with CLARIN and Feide. Contact the Text Laboratory (tekstlab-post@iln.uio.no) if you don’t have the possibility to access via Feide or CLARIN
|
The NUS Corpus of Learner English (NUCLE) | English | Several East Asian languages, predominantly Chinese | Written | Student essays on a wide range of topics including environmental pollution, healthcare, etc. | Various | c. 1 m. | Hwee Tou Ng, Siew Mei Wu, Daniel Dahlmeier (National University of Singapore, Singapore) | Freely available |
The PELCRA Learner English Corpus (PLEC) | English | Polish | Spoken and written | Written: argumentative, descriptive, narrative and quasi-academic essays; formal letters | From beginning to post-advanced | Under development Aim spoken: c. 200,000 Aim written: c.2,8 m. |
Piotr Pęzik, Barbara Lewandowska-Tomaszczyk (University of Lodz, Poland) | Online search engine and corpus analysis tools |
The PICLE corpus (Polish component of ICLE) | English | Polish | Written | Student essays | Advanced | c. 330,000 | Przemyslaw Kaszubski (Adam Mickiewicz University, Poland) | Searchable online |
The Polish Learner Corpus PoLKo | Polish | Various | Written | Essays, descriptions, argumentative essays, private and official letters, reviews, short messages, interviews etc. | Various | c. 8000 (26.03.21) Under development |
Adrian Jan Zasina (Charles University, Czech Republic) Elżbieta Kaczmarska (University of Warsaw, Poland) |
Available upon request |
The Qatar learner corpus | English | Arabic (mostly from Qatar) | Spoken | Spoken interviews with Qatari learners of English | Yun Zhao Helen (Carnegie Mellon University, USA) | Freely available | ||
The Québec learner corpus | English | French (from Québec) | Written | Argumentative essays | Intermediate and advanced | c. 250,000 | Tom Cobb (Université du Québec à Montréal, Canada) | / |
The Romanian Corpus of Learner English (RoCLE) | English | Romanian | Written | Student essays | Chitez Madalina (Zurich University, Switzerland) | |||
Russian Error-Annotated English Learner Corpus | English | Russian | Written | Examination essays of the kind similar to IELTS Task 1 and Task 2, with errors annotated manually | Intermediate to Advanced | c.800,000 by November 2017 and growing (together with the old part of the corpus less consistently annotated or not annotated, available at http://realec.org/index.xhtml#/, c. 2,000,000) | Olga Vinogradova (National Research University Higher School of Economics, Russia) | Freely available |
The Russian Learner Translator Corpus (RusLTC) | English Russian |
Russian | Written | Translations produced by trainee translators | Trainee translators | c. 1.5 m. tokens | Andrey Kutuzov (University of Oslo, Norway) Maria Kunilovskaya (Tyumen State University, Russia) |
Freely available |
The Santiago University Learner of English Corpus (SULEC) |
English | Spanish | Spoken and written | Written: compositions or argumentative essays Spoken: semistuctured interviews, short oral presentations and brief story descriptions |
Various | Aim: c. 1 m. | Ignacio M. Palacios Martínez (University of Santiago de Compostela, Spain) | Available after registration |
The Scientext English Learner Corpus |
English | French | Written | Academic argumentative texts | c. 1.1 m. | scientext@u-grenoble3.fr (Université Stendhal/Grenoble-III, France) | Searchable online | |
Second Language Research Tasks (SLRT) | English | Various | Written and spoken | Written paragraphs Various oral tasks | Various | c. 300,000 | Bill Crawford (Northern Arizona University, USA) Kim McDonough (Concordia University, Canada) |
Under development |
The Seoul National University Korean-speaking English Learner Corpus (SKELC) | English | Korean | Written | Student essays | Various | c. 900,000 | Heokseung Kwon (Seoul National University, South Korea) | / |
The SILS Learner Corpus of English | English | Various (mainly Japanese) | Written | Student essays | Basic, intermediate and advanced | c. 3.2 m. (first and second drafts included) | Victoria Muehleisen (Waseda University, Japan) | |
The Soochow Colber Student Corpus (SCSC) | English | Chinese | Written | Student essays | c. 227,000 | Colman Bernath (Soochow University, Taiwan) | ||
The Spoken and Written English Corpus of Chinese Learners (SWECCL) | English | Chinese | Written (WECCL) and spoken (SECCL) | Written: argumentative and narrative essays Spoken: National Spoken English Test – longitudinal data |
c. 2 m. | Wei Qiufang, Liang Maocheng, Wang Lifei (Beijing Foreign Studies University, China) | CD-rom | |
The Taiwanese Corpus of Learner English (TLCE) | English | Chinese | Written | Journals and essays (descriptive, narrative, expository, argumentative) | Intermediate to advanced | c. 2 m. | Rebecca Hsue-Huch Shih (Sun Yat-sen University, Taiwan) | |
The Tawainese learner academic writing corpus (TaiwanLAWC) | English | Chinese | Written | Theses and dissertations written by Taiwanese graduate students. | Howard Chen (National Taiwan Normal University, Taiwan) | |||
The TELEC Secondary Learner Corpus (TSLC) | English | Chinese | Written and spoken | Compostions from secondary classroom | c. 2 m. | Quentin Allan (University of Hong Kong, Hong Kong) | ||
The Telecollaborative Learner Corpus of English and German Telekorp | English | German | Written | Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005 | c. 1,5 m. | Julie Belz (Pennsylvania State University, USA) | Not publicly available | |
The Ten-Thousand English Compositions of Chinese Learners (TECCL) | English | Chinese | Written | Essays (various topics) written in and after class, and in testing context. Also contains some collaborative writing samples | Various (mainly undergraduates) | c. 1.8 m. | Jiajin Xu (Beijing Foreign Studies University, China) | Raw texts and part-of-speech tagged texts are available |
Tracking Written Learner Language (TRAWL) | Multilingual: English French German Spanish |
Norwegian | Writing | Texts written as part of regular class work (tests, in-school writing, homework) | Longitudinal corpus (beginners/advanced) | Hildegunn Dirdal (University of Oslo, Norway) | ||
Trinity Lancaster Learner Spoken Corpus | English | Various | Spoken |
Presentation |
B1-C2 | c. 4 million | Dana Gablasova, Vaclav Brezina | |
The Tswana Learner English Corpus (TLEC) | English | Tswana | Written | Argumentative essays | Advanced | c. 200,000 | Bertus Van Rooy (University of Amsterdam, Netherlands) | Available in ICLE |
The Undergraduate Learner Translator Corpus (ULTC) | Bidirectional: English-Arabic French-Arabic |
Bidirectional: English-Arabic French-Arabic Arabic is the native language of the learners and the main target language |
Written and spoken | Translations produced by learners of translation from and into Arabic and a reference subcorpus of published translations | From beginners to advanced levels | Under development | Reem Alfuraih (Princess Nora bint Abdul Rahman University, Saudi Arabia) | Available via https://arabicparallelultc.com/ |
The Uppsala Student English Corpus (USE) | English | Swedish | Written | Student essays | Various | c. 1,200,000 | Ylva Berglund Prytz, Margareta Westergren Axelsson (Uppsala University, Sweden) | The corpus can be used for research and educational purposes. It can be accessed on the Internet from the Oxford Text Archive. |
The Uppsala WordReference Corpus | English Spanish French Italian |
Various | Written | Forum posts | English learner subcorpus: 38 m., English native subcorpus: 50 m., Spanish learner subcorpus: 5 m., Spanish native subcorpus: 22 m., French learner subcorpus: 4 m., French native subcorpus: 7 m., Italian learner subcorpus: 1 m., Italian native subcorpus: 3 m. | Aleksandrs Berdicevskis (Uppsala University, Sweden) | Freely available | |
The UPF Learner Translation Corpus | English | Catalan | Written | Translations written by the students of the Translation and Interpreting degree at UPF | c. 200,000 | Anna Espunya (Pompeu Fabra University, Spain) | ||
The UPV Learner Corpus | English | Catalan | Written | Essays | Various | c. 150,000 | Angeles Andreu Andrés (Universitat Politècnica de València, Spain) | |
The Varieties of English for Specific Purposes dAtabase learner corpus (VESPA) | English | Various | Written | ESP texts (term papers, reports, MA dissertations) | Various | c. 220,000 (under development) | Magali Paquot (Centre for English Corpus Linguistics, Université catholique de Louvain, Belgium) | |
The Written Corpus of Learner English corpus (WriCLE) | English | Spanish | Written | Essays | Various | c. 750,000 | Paul Rollinson (Universidad Autonoma de Madrid, Spain) | |
The Yonsei English Learner Corpus (YELC) | English | Korean | Written | Yonsei University English Diagnostic Tests (Part 1: descriptive task, max. 100 words; Part 2: argumentative tast, max. 300 words) | 9 levels (A1, A1+, A2, B1, B1+, B2, B2+, C1, C2) | c. 1 m. | Seok-Chae Rhee (Yonsei University, South Korea), CK Jung (Incheon National University, South Korea) | The YELC corpus will be available to the scientific community for research purposes from 31 March 2012 |
The Young Learner Corpus of English (YOLECORE) | English | Greek | Spoken | Pedagogic Corpus of video-recorded EFL language classes | 170 school hours (126 hours of videotaped material) 1,5 m. types |
Marina Mattheoudakis, Thomas Zapounidis (Aristotle University of Thessaloniki, Greece) | ||
The Estonian Interlanguage Corpus of Tallinn University (EIC) | Estonian | Russian Finnish English German Latvian Lithuanian Ukrainian Belorussian |
Written | Spontaneously produced texts in language learning situations: argumentative and literary essays, written stories, letters, term papers, reading reports. | A1-C2 | c. 1 m. | Pille Eslon (Tallinn University, Estonia) | |
Linguistic Basis of the Common European Framework for L2 English and L2 Finnish (CEFLING) | Finnish English |
Various | Written | Various | Various | Maisa Martin (University of Jyväskylä, Finland) | ||
Paths in Second Language Acquisition (TOPLING) | Finnish English Swedish |
Various | Written | Various | Various | Maisa Martin (University of Jyväskylä, Finland) | Available (see here for instructions on how to access the corpora) | |
The Advanced Finnish Learner Corpus (LAS2) | Finnish | Russian Czech Swedish Estonian Lithuanian Komi English Hungarian German Icelandic Japanese |
Written | Exam essays, theses, essays and writings | Advanced | c. 630,000 | Kirsti Siitonen, Ilmari Ivaska (University of Turku, Finland) | Available |
The Finnish National Foreign Language Certificate Corpus (YKI) | Finnish | English Finnish French German Italian Lappish (Sami) Spanish Swedish Russian |
Written and spoken | Various | Beginner, intermediate and advanced | Ari Maijanen, Tiina Lammervo (University of Jyväskylä, Finland) | Available with user ID and Password | |
The International Corpus of Learner Finnish (ICLFI) | Finnish | Various | Written | Finnish learners’ spontaneously produced texts in language learning situations, large variety of text types | Beginner, intermediate and advanced | Under development | Jarmo Harri Jantunen (University of Oulu, Finland) | Free download after applying for a user licence |
The Chy-FLE (Cypriot Learner Corpus of French) | French | Modern Greek (and Cypriot Greek) | Written | Argumentative and descriptive essays | From intermediate to advanced | c. 250,000 (under development) | Freiderikos Valetopoulos (Université de Poitiers, France, in collaboration with the University of Cyprus) | |
The COREIL corpus | French English | Spoken | Elisabeth Delais-Roussarie, Hiyon Yoo (Université Paris-Diderot, France) | |||||
The "Dire Autrement" corpus | French (Second Language) | Mainly L1 speakers of English | Written | Narrative, injunctive, persuasivle and informative texts | c. 50,000 | Marie-Josée Hamel, Jasmina Milićević (Dalhousie University, Canada) | Available after registration | |
French Interlanguage Database (FRIDA) | French | Various | Written | Free compositions: desciptive, argumentative and narrative texts, news & mail | Intermediate | Sylviane Granger (Centre for English Corpus Linguistics, Université catholique de Louvain, Belgium) | ||
French Learner Language Oral Corpora (FLLOC) | French | Various | Spoken | See description of the 7 corpora | Various | Florence Myles (Newcastle University, UK) Rosamund Mitchell (University of Southampton, UK) |
The contents of the database are being made freely available to the research community, in the form of digital sound files and related transcripts formatted using CHILDES software. Searchable online | |
The InterFra corpus | French | Swedish | Spoken | Interviews, retellings of video clips and picture stories | Various | Inge Bartning (Stockholm University, Sweden) interfra@fraita.su.se |
Available | |
The "Interphonologie du Français Contemporain" corpus (IPFC) | French | Cypriot Greek Dutch English (Canada) German Japanese Norwegian Spanish |
Spoken | Reading aloud, repeating words, guided interviews, interactions between two learners | Various | Under development | Sylvain Detey (Waseda University, Japan; Université de Rouen, France) Isabelle Racine (Université de Genève, Switzerland) Yuji Kawaguchi (Tokyo University of Foreign Studies, Japan) |
Under development; samples available |
The Learner Corpus French (LCF) | French | Dutch | Written | Argumentative essays, informative texts, journalistic texts, formal letters, summaries, written compositions by Flemish students of French | Intermediate to advanced | c. 500,000 | Hans Paulussen (K.U.Leuven/Ugent/Lessius, Belgium) | Under development |
The Lund CEFLE Corpus (Corpus Écrit de Français Langue Étrangère) | French | Swedish | Written | Descriptive and narrative essays; picture-based stories | Various | c. 100,000 | Malin Ågren (Lund University, Sweden) | A sub-part of the corpus is available online. |
The University of the West Indies learner corpus (UWi) | French | English Jamaican Creole |
Spoken | Conversations during oral exams and in informal contexts | Various | Hugues Peters (University of New South Wales, Australia) | Corpus is available freely here (last updated 2017) | |
Comasan Labhairt ann an Gàidhlig (CLAG) Gaelic Adult Proficiency (GAP) | Gaelic | Various | Spoken | Conversation task Narrative Elicited oral imitation task Question and answer activity | Various | Roibeard Ó Maolalaigh, Nicola Carty (University of Glasgow, UK) | ||
The AleSKO corpus | German | Chinese Also German L1 data from the FALKO corpus |
Written | Argumentative essays | c. 13,600 | Heike Zinsmeister (University of Konstanz, Germany) Margrit Breckle (Vilnius Pedagogical University, Lithuania) |
||
Analyzing Discourse Strategies: A Computer Learner Corpus | German | English (mainly American English) | Written | Threaded discussion, chat, essays – longitudinal data | From beginner to intermediate-mid | Under development | Christina Frei, Edward Nixon (University of Pennsylvania, USA) | |
The Corpus of Learner German (CLEG13) | German | English | Written | Argumentative, free compositions Longitudinal over 4 years, undergraduate students |
Intermediate to advanced | c. 320,000 | Ursula Maden-Weinberger (Edge Hill University, UK) | Online access through the FALKO platform. The corpus is also available as txt files to the scientific community. Please contact U. Maden-Weinberger at uschi@miralis.co.uk |
The deL1L2IM corpus | German | Russian-Belorussian bilinguals | Written | Instant messaging dialogues | Advanced | c. 52,000 | Sviatlana Höhn (University of Luxembourg, Luxemburg) | Available |
The Fehlerannotiertes Lernerkorpus (‘error annotated learner corpus’) (FALKO) | German | Learner subcorpus: various Native subcorpus: German |
Written | 1. Summaries 2. Essays 3. Letters, fiction writing, journal articles, book reviews (= longitudinal data from American learners) |
1. Advanced 2. Advanced 3. Beginners - advanced |
1. c. 40,000 (learner subcorpus) + c. 20,000 (native subcorpus) 2. c. 150,000 (learner corpus) + c. 70,000 (native subcorpus) 3. c. 78,000 (learner subcorpus) |
Anke Lüdeling, Maik Walter (Humboldt-Universität zu Berlin, Germany) falko-korpus@hu-berlin.de |
Online access |
The KOLIPSI corpus | German | Italian | Written | Two written language production tasks of a standardized test (email/letter) | A2-C1 | Under development | Andrea Abel, Aivars Glaznieks (European Academy Bolzano/Bozen, Italy) | |
The Learning the Prosody of a Foreign Language (LeaP) | German | Various | Spoken | The LeaP corpus covers four different types of speech: read speech, prepared speech, free speech, nonsense word lists | Various | 62 speakers | Ulrike Gut (University of Augsburg, Germany) | The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg Manual |
The LeKo (Lernerkorpus) corpus | German | c. 55,000 | Anke Lüdeling (Humboldt-Universität zu Berlin, Germany) | Online access (password protected) Register here | ||||
The LINCS Corpus | 1. German 2. German 3. German |
1. English 2. German |
1. Written 2. Written 3. Written |
1. Essays, examination, answers (longitudinal and cross-sectional data) 2. Essays 3. Teaching output |
1. Intermediate to Advanced 2. Advanced |
Under development | Elizabeth Thoday (Heriot-Watt University Edinburgh, UK) | Not currently publicly available |
Multilingual Platform for the European Reference Levels: Exploring Interlanguage in Context (MERLIN) | German Italian Czech |
Various | Written | Writing tasks from standardized tests (telc/UJOP) | A1 to C1 | c. 280,000 | Katrin Wisniewski (Leipzig University, Germany) | Available |
Rhodes University Deutsch als Fremdsprache (RUDaF) | German | English Afrikaans isiXhosa XiTsonga |
Written | Short descriptive and argumentative writing paragraphs (300 words each) | A2-B2 | 34,000 | Gwyndolen Ortner, Undine S. Weber (Rhodes University, South Africa) | Not available |
The Telecollaborative Learner Corpus of English and German Telekorp | German | English | Written | Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005 | c. 1,5 m. | Julie Belz (Pennsylvania State University, USA) | Not publicly available | |
The Langman corpus | Hungarian | Chinese | Spoken | Interviews conducted in 1994 with 11 Chinese immigrants living in Hungary. Interviews focused on issues related to their arrival in Hungary as well as their daily life activities | Juliet Langman (University of Texas at San Antonio, USA) | Freely available | ||
Corpus di Apprendenti di Italiano L2 (CAIL2) | Italian | Various | Written | Essays | Intermediate to advanced | c. 237,000 | Stefania Spina (Università per Stranieri di Perugia, Italy) | |
Corpus parlato di italiano L2 | Italian | English German Japanese |
Spoken | Transcriptions of interviews | Various | Stefania Spina, Silvio Pazzaglia, Mirco Perini (Università per Stranieri di Perugia, Italy) | ||
The KOLIPSI corpus | Italian | German | Written | Two written language production tasks of a standardized test (email/letter) | A2-C1 | Under development | Andrea Abel (European Academy Bolzano/Bozen, Italy) | |
The Lexicon of Spoken Italian by Foreigners (LIPS) | Italian | Various | Spoken | Proficiency exams of the Certification of Italian as a Foreign Language (CILS) | A1-C2 | c. 700,000 | Francesca Gallina (Università per Stranieri di Siena, Italy) | Freely available |
MISTiC (Multiple Italian Student TranslatIon Corpus) | Italian | English French |
Written | Translations produced by trainee translators (mainly specialised texts) | Post-graduate trainee translators | ca. 125,000 (English-Italian), ca. 50,000 (French-Italian) | Sara Castagnoli (University of Macerata, Italy) | Not available |
Varietà di Apprendimento della Lingua Italiana: Corpus Online (VALICO) | Italian | Various | Written | Various | c. 570,000 | Manuel Barbera, Carla Marello, Elisa Corino (University of Turin, Italy) | ||
Longitudinal Corpus of Chinese Learners of Italian (LOCCLI) | Italian | Chinese | Written | Essays | Beginners and pre-intermediate | 97,000 | Stefania Spina (Università per Stranieri di Perugia, Italy) Anna Siyanova-Chanturia (Victoria University of Wellington, New Zealand) |
It is freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpweb/ |
Corpus of Chinese Learners of Italian (COLI) | Italian | Chinese | Written and spoken | Essays and answers to open questions, interviews | Intermediate and advanced | 82,300 | Stefania Spina (Università per Stranieri di Perugia, Italy) | |
The Korean learner corpus | Korean | Various | Written | Various: letters, essays, formal writing, etc. | Beginner and intermediate | c. 10,000 |
Jungyeul Park (University of British Columbia, Vancouvert) |
Available through Jungyeul Park's GitHub: https://github.com/jungyeul/korean-learner-corpus |
ESAM | Latvian and Lithuanian | Latvian and Lithuanian | Written | Beginner | 52,000 | Inga Znotiņa (Rīga Stradiņš University, Latvia) | Available online | |
The ASK corpus | Norwegian | German Dutch English Spanish Russian Polish Bosnian-Croatian-Serbian Albanian Vietnamese Somali |
Written | Essays from language tests | B1 and B2 | Kari Tenfjord (University of Bergen, Norway) | Apply for a licence here | |
The Persian Learner Corpus (PLC) | Persian (Farsi) | Various | Written | Narratives and essays | Intermediate and advanced | Academic/Restricted online access | Saeed Safari (University of Belgrade, Serbia) | |
The Salam Farsi Learner Corpus (SFLC) | Persian (Farsi) | Serbian | Written | Narratives, descriptive essays | Beginner and upper-intermediate | Under development | Saeed Safari (University of Belgrade, Serbia) | Academic, under development |
Learner Corpus of Portuguese L2 (COPLE2) | Portuguese | 15 languages: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian and Swedish | Written and spoken | Exams and assignments | A1-C1 | Written: 171,461 Oral: 25,783 |
Iria del Río (Universidade de Lisboa, Portugal) | Available |
Russian Learner Corpus | Russian | Varied | Written and spoken | Academic and non academic | Teachers and heritage speakers | Unknown | Ekaterina Rakhilina (National Research University Higher School of Economics, Russia) | Available online |
The University of Pittsburgh English Language Institute Corpus (PELIC) | English | 30 languages | Written (spoken data to be released in the future) | Variety of General English and EAP tasks and text types | Pre-Intermediate to Advanced | 4.2 m. | Alan Juffs (University of Pittsburgh, USA) | Publicly available on GitHub |
The Anglia Polytechnic University (APU) Learner Spanish Corpus | Spanish | Various | Written | c. 120,000 | Anne Ife (Anglia Ruskin University, UK) | |||
Aprescrilov ("Aprendera Escribiren Lovaina") | Spanish | Dutch | Written | Written assignments and tests; several text types (letters, expository, descriptive, argumentative, narrative) | A1 to C1 | c. 1 m. | Kris Buyse (KU Leuven, Belgium) | Restricted online access |
The Corpus de aprendices de español (CAES) | Spanish | Various | Written | A1 to C1 | c. 575,000 | CAES team (Universidade de Santiago de Compostela, Spain) | Online access | |
Corpus Escrito del Español L2 (CEDEL2 version 2.0) | Spanish | English German Dutch French Portuguese Italian Greek Russian Japanese Chinese Arabic |
Written (and some spoken) | Written (and some spoken) compositions by learners of Spanish | All proficiency levels (lower beginner to upper advanced) | 1,105,936 words coming from 4,399 participants | Cristobal Lozano (Universidad de Granada, Spain) | Downloadable/browsable via the CEDEL2 webpage: http://cedel2.learnercorpora.com/ |
Corpus de textos escritos para el análisis de errores de aprendices de E/LE (CORANE) | Spanish | Various | Written | Essays | A2 to C1 | / | Ana M. Cestero Mancera, Inmaculada Penadés Martínez (Universidad de Alcalá Henares, Spain) | CD-ROM available |
The Corpus of Taiwanese Learners of Spanish (Corpus de Aprendices Taiwaneses de Español) (CATE) | Spanish | Chinese | Written | Student essays | Various | c. 340,000 | hclu@mail.ncku.edu.tw (National Cheng Kung University, Taiwan) | Under development |
The DIAZ corpus | Spanish | German Swedish Icelandic Korean Chinese |
Spoken | Semi-spontaneous (structured interviews) and experimental (structured questionnaires) Adult Spanish L2/L3 oral data | Various | Lourdes Díaz Rodríguez (Universitat Pompeu Fabra, Spain) | ||
The Japanese learner corpus of Spanish | Spanish | Japanese | Written | Student essays | c. 87,000 | Yoshihito Kamakura (University of Birmingham, UK) | Online access | |
The Spanish Corpus Proficiency Level Training (SPT) | Spanish | English (heritage language learners) | Spoken | Dialogues about a given set of questions | Beginner to advanced | Dale Koike (University of Texas at Austin/Liberal Arts Instructional Technology Center, USA) | Videos are available | |
Spanish Learner Language Oral Corpus (SPLLOC) | Spanish | English | Spoken | Learner narratives, interviews and picture description tasks | Beginner to advanced | c. 50,000 | Laura Domínguez (University of Southampton, UK) | Searchable online Data freely available for download |
Spanish Learner Oral Corpus | Spanish | Various (9+ languages, especially Portuguese, French, Italian) | Spoken | Semi-spontaneous interviews, narrative and descriptive tasks | A2-B1 | c. 50,000 | Leonardo Campillos Llanos (Universidad Autonoma de Madrid, Spain) | Online access |
The Tartu Learner Corpus of Spanish as a L3+ | Spanish | Estonian | Written | Academic research writing | Advanced | c. 885,000 | Mari Kruse (University of Tartu, Estonia) | |
The ASU corpus | Swedish | Chinese English German Greek Polish Portuguese Spanish ... |
Spoken and written | Transcribed audio-recorded conversations and written texts from adult learners of Swedish – longitudinal data | c. 490,000 (c. 415,000 spoken and c. 75,000 written) | Björn Hammarberg (Stockholm University, Sweden) | Available | |
Leiden Learner Corpus | Multilingual: Dutch French Italian Portuguese Spanish) |
Various | Written and spoken | Written data: short essays; oral data: picture-based story telling | Various | 200 participants | M. Carmen Parafita Couto (University of Leiden, Netherlands) | |
The European Science Foundation Second Language Database (ESF database) | Multilingual: Dutch English French German Swedish |
Punjabi Italian Turkish Arabic Spanish Finnish |
Spoken | Spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, and their communication with native speakers in the respective host countries | Various | Wolfgang Klein, Clive Perdue (Max Planck Institut, Netherlands) | Freely available | |
The Foreign Language Examination Corpus (FLEC) | Multilingual | Polish | Written | Data from the Warsaw University Certification Exams | Various | Under development | Piotr Banski, Romuald Gozdawa-Golebiowski (Warsaw University, Poland) | |
The MeLLANGE Learner Translator Corpus (LTC) | Multilingual | Various | Written | Legal, technical, administrative and journalistic texts | Trainee translators | Natalie Kübler (Université Paris Diderot, France) mellange_p7@eila.univ-paris-diderot.fr |
Searchable online | |
The MiLC Corpus | Multilingual: Catalan English French Spanish |
Catalan | Written | Formal and informal letters, summaries, curriculum vitae, essays, reports, translations, synchronous and asynchronous communication exchanges, business letters | c. 150,000 | Angeles Andreu Andrés (Universidad Polytecnica de Valencia, Spain) | ||
The Multilingual Learner Corpus (MLC) | Multilingual: English German Italian Spanish |
Brazilian Portuguese | Written | Argumentative and marrative essays | Aim: c. 200,000 | Stella E.O. Tagnin (University of São Paulo, Brazil) | Accessible online to registered researchers | |
The Padova Learner Corpus | Multilingual: English French Spanish |
Italian | CMC (Computer-Mediated Communication) | Student work produced in blended language courses using FirstClass conferencing software. Variety of genres: diaries, debate contributions, formal reports, résumés, etc. Longitudinal data | Under development | Fiona Dalziel, Francesca Helm (University of Padua, Italy) | ||
The corpus PARallèle Oral en Langue Etrangère (PAROLE) | Multilingual: English French Italian (Mainly L2 speakers but also includes data produced by L1 speakers) |
Various | Spoken | 5 oral production tasks | Various | Heather Hilton, John Osborne, Marie-Jo Derive, Nejma Succo, Jean O'Donnell, Sandra Billard, Sandrine Rutigliano-Daspet (Université de Savoie, France) | ||
The University of Toronto Romance Phonetics Database (RPD) | Multilingual: English French Italian Portuguese Romanian Spanish |
Various (including English, Mandarin, Russian, Spanish, etc.) | Spoken | Elicited production: sentence and passage reading, story narration, description of favourite meal | Various | Laura Colantoni, Jeffrey Steele (University of Toronto, Canada) | Password available from directors |
Learner corpus-based datasets
Corpus | Target language | First language | Medium | Text type / task type | Proficiency level | Size in words | Project director | Availability |
---|---|---|---|---|---|---|---|---|
The Treebank of Learner English (TLE) | English | Various | Written | Sentences from the CLC FCE (annotated with syntactic trees) | Upper-intermediate | 97,681 (5,124 sentences) | Yevgeni Berzak | Publicly available through the UD repository ('English-ESL') |
VALICO-UD | Italian | German, English, French, Spanish | Written | Comic strip elicited texts | From first year to forth year of study of Italian (various proficiencies) | 6,784 (learner texts) + 6,832 (target hypotheses) | Elisa Di Nuovo, Cristina Bosco, Elisa Corino | Released in the Universal Dependencies repository |