ICLEv2 CD-Rom: Frequently asked questions

General questions

What are the differences between ICLEv1 and ICLEv2?
Where can I buy ICLEv2?
Where can I get more information about Unitex?
Which tagger was used to lemmatize and part-of-speech tag the International Corpus of Learner English?
What is the success rate of the CLAWS tagger when used on learner data?

Search syntax

How do I search for multiword units?
How do I know that a word sequence has been tagged as a multiword unit by CLAWS?
If I search for the lemma LOSE, will I retrieve all occurrences of the verb?
When I search for the preposition UP, not all occurrences of the preposition are retrieved.

General questions

What are the differences between ICLEv1 and ICLEv2?

ICLEv2 includes 5 new sub-corpora of essays written by native speakers of Chinese, Japanese, Norwegian, Turkish and Tswana. It consists of 6,085 essays and totals c. 3,7 million words.
The new CD-Rom also includes an in-built concordancer, viz. Unitex, into the software package to allow researchers to query learner sub-corpora. Unitex was developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Paris-Est Marne-la-Vallée (France). It is an open-source collection of programs that was inspired by the features of Intex, a non-open source software tool developed by Max Silberztein (e.g. Silberztein 1993).
All learner essays were lemmatized and POS-tagged with CLAWS thanks to Paul Rayson (Lancaster University). It is thus possible to search for all occurrences of a lemma (e.g. the lemma ‘use’ used as a noun), a POS-tag (e.g. all the adverbs used in the corpus) or a sequence of POS-tags (e.g. a plural noun followed by a lexical verb) via the in-built concordancer. It is, however, not possible to build a POS-tagged sub-corpus and analyze it with the help of other software tools.

Where can I buy ICLEv2?

The corpus is available from www.i6doc.com

Where can I get more information about Unitex?

Have a look at http://www-igm.univ-mlv.fr/~unitex/index.html

Which tagger was used to lemmatize and part-of-speech tag the International Corpus of Learner English?

CLAWS C7

What is the success rate of the CLAWS tagger when used on learner data?

We conducted a small pilot study at the Centre for English Corpus Linguistics in which we tagged 51 learner essays representing the 16 mother tongue backgrounds available in the International Corpus of Learner English (c. 42,000 words) and examined the success rate of the CLAWS tagger. All essays had an accuracy rate between 95% and 99.1%.

Search syntax

How do I search for multiword units?

Multiword units can be retrieved with the < > symbols. If you want to extract all occurrences of the multiword unit "as far as", you should search for <as far as>. To make sure that all occurrences of "as far as" have been tagged as a multiword unit by CLAWS, you should also search for "as far as" without angle brackets.

How do I know that a word sequence has been tagged as a multiword unit by CLAWS?

Click on ‘Tools’ and ‘View word lists’. All sequences of words tagged as multiword units by CLAWS are listed under ‘compound lexical entries’.

If I search for the lemma LOSE, will I retrieve all occurrences of the verb?

“While lemmatizers are potentially very useful for lexical analyses of interlanguage, researchers have to be aware that only the standard realisations of a lemma will be retrieved, i.e. for the lemma LOSE, the standard forms lose/loses/losing/lost, but not the (sometimes equally frequent!) non-standard forms loose/looses/loosing/loosed” (Granger 2008).

When I search for the preposition UP, not all occurrences of the preposition are retrieved.

It is important to bear in mind that if you look for the preposition 'up' (search query: <up.Prep>) in the corpus, instances of 'up' included in compound lexical entries will not be retrieved as the various tokens of a compound lexical entry are given a single POS tag (e.g. 'fed up' is tagged as an adjective and 'up to' as an adverb). To retrieve all compound lexical entries that include the token 'up', use the following query which involves a morphological filter:

<CDIC><<up>>: matches any multiword unit that is present in the corpus and which includes the token 'up'

To access the complete list of sequences of words tagged as compound lexical entries in ICLEv2, click on 'Tools' and select 'View word lists'. All sequences of words tagged as multiword units by CLAWS are listed under ‘compound lexical entries’.

References

Granger S. (2008) Learner Corpora. In Lüdeling, A. and M. Kytö (eds) Handbook on Corpus Linguistics. Mouton de Gruyter