- ICLEv2 includes 5 new sub-corpora of essays written by native speakers of Chinese, Japanese, Norwegian, Turkish and Tswana. It consists of 6,085 essays and totals c. 3,7 million words.
- The new CD-Rom also includes an in-built concordancer, viz. Unitex, into the software package to allow researchers to query learner sub-corpora. Unitex was developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Paris-Est Marne-la-Vallée (France). It is an open-source collection of programs that was inspired by the features of Intex, a non-open source software tool developed by Max Silberztein (e.g. Silberztein 1993).
- All learner essays were lemmatized and POS-tagged with CLAWS thanks to Paul Rayson (Lancaster University). It is thus possible to search for all occurrences of a lemma (e.g. the lemma ‘use’ used as a noun), a POS-tag (e.g. all the adverbs used in the corpus) or a sequence of POS-tags (e.g. a plural noun followed by a lexical verb) via the in-built concordancer. It is, however, not possible to build a POS-tagged sub-corpus and analyze it with the help of other software tools.
- The corpus is available from www.i6doc.com
- Have a look at http://www-igm.univ-mlv.fr/~unitex/index.html
- We conducted a small pilot study at the Centre for English Corpus Linguistics in which we tagged 51 learner essays representing the 16 mother tongue backgrounds available in the International Corpus of Learner English (c. 42,000 words) and examined the success rate of the CLAWS tagger. All essays had an accuracy rate between 95% and 99.1%.
- Multiword units can be retrieved with the < > symbols. If you want to extract all occurrences of the multiword unit "as far as", you should search for <as far as>. To make sure that all occurrences of "as far as" have been tagged as a multiword unit by CLAWS, you should also search for "as far as" without angle brackets.
- Click on ‘Tools’ and ‘View word lists’. All sequences of words tagged as multiword units by CLAWS are listed under ‘compound lexical entries’.
- “While lemmatizers are potentially very useful for lexical analyses of interlanguage, researchers have to be aware that only the standard realisations of a lemma will be retrieved, i.e. for the lemma LOSE, the standard forms lose/loses/losing/lost, but not the (sometimes equally frequent!) non-standard forms loose/looses/loosing/loosed” (Granger 2008).
- It is important to bear in mind that if you look for the preposition 'up' (search query: <up.Prep>) in the corpus, instances of 'up' included in compound lexical entries will not be retrieved as the various tokens of a compound lexical entry are given a single POS tag (e.g. 'fed up' is tagged as an adjective and 'up to' as an adverb). To retrieve all compound lexical entries that include the token 'up', use the following query which involves a morphological filter:
<CDIC><<up>>: matches any multiword unit that is present in the corpus and which includes the token 'up'
To access the complete list of sequences of words tagged as compound lexical entries in ICLEv2, click on 'Tools' and select 'View word lists'. All sequences of words tagged as multiword units by CLAWS are listed under ‘compound lexical entries’.
Granger S. (2008) Learner Corpora. In Lüdeling, A. and M. Kytö (eds) Handbook on Corpus Linguistics. Mouton de Gruyter