Corpus collection guidelines

1. Request students to fill in a learner profile

The VESPA learner profile has been created in order to provide researchers with information about contributors which will enable meaningful conclusions to be drawn from the results obtained when the corpus is analysed. Using the profile, it will be possible both to draw general conclusions about advanced learner writing, and also to examine subsections e.g. Spanish mother tongue learners, learners who speak some English at home, learners for whom German is the second language and English is the third language. It will also be possible to examine more sociolinguistic aspects such as for instance male/female comparisons. If the corpus is used as a basis for developing specifically adapted teaching tools, the potential advantages of this facility are clear.

The VESPA learner profile is available in two forms. International partners can either:

ask their students to fill in a paper version of the questionnaire and a permission form. In that case, VESPA partners have to encode students' answers in an Excel file .
ask their students to fill in an online questionnaire (including permission form) hosted on one of the Université catholique de Louvain's servers (contact the VESPA project director for more detail).

Each partner will have to attribute a code to each student and ask them to use this code when they fill in the learner profile (and to be very careful to type it correctly!). A student code consists of 3 letters for the institution + 4 digits for the student. Thus, at the Université catholique de Louvain, we give students codes starting with:

UCL0001
UCL0002
UCL0003

A student should only be given one code (and not a code per course!) if (s)he contributes several texts to the VESPA corpus. This is the only way we'll be able to identify several texts written by the same student while ensuring anonymity.

2. Collect the right type of material

The corpus will consist entirely of L2 academic writing in a wide range of:

disciplines (linguistics, business, medicine, law, biology, etc),
genres (papers, reports, MA dissertations), and
degrees of writer expertise in academic settings (from first-year students to PhD students).

Texts should be at least 500 words long (e.g. lab reports) but may be much longer (e.g. MA dissertations). They should be handed in in electronic format. This reduces the time spent typing up student texts and minimises the risk of introducing errors into the text.

Work should be entirely the students' own, i.e. no help should be sought from third parties, but reference tools such as dictionaries and grammar books are acceptable (use of reference tools should be indicated on the learner profile questionnaire). Texts produced by more than one student (e.g. collaborative work) and revised versions of texts (e.g. following teachers' comments) should not be included in the corpus.

Argumentative, descriptive and narrative texts written on general subjects in language courses are not of interest. For this reason, the following types of titles should be avoided:

"Crime does not pay"
"Feminism has done more harm to the cause of women than good"
"Pollution : a silent conspiracy"
"The joys of the English countryside"
"My year in America"

Texts should only be collected in disciplinary content courses.

3. Text format

Student texts are usually submitted to the VESPA corpus as Microsoft Word documents. However, this format proves impractical for efficient processing of a corpus. The documents need to be converted to plain text format, which in turn requires pre-processing them to avoid loss of relevant information.

A number of computer tools, viz. Word macros and Perl scripts, enabling semi-automatic and automatic processing of the texts collected were developed by A. Heuboeck (Reading University, UK) to facilitate the encoding and mark-up process. The VESPA macros and Perl scripts are largely based on what was developed for the British Academic Written English (BAWE) corpus (cf. Ebeling & Heuboeck 2007; Heuboeck et al. 2008).

Concerning the encoding of the VESPA corpus, a decision was made to apply the encoding standard proposed by the Text Encoding Initiative (TEI).

There are 3 main steps involved in the preparation of student texts for the VESPA corpus:

Step 1: Manual annotation of titles, sections, quotes, examples, etc.
Step 2: Automatic conversion to XML format.
Step 3: A cascade of Perl scripts is used to finalize the formatting process: normalization of hyphens and dashes, transformation of Microsoft XML input to TEI-conformant tags, importation of contextual information from external spreadsheets, mark-up of sentence boundaries, etc.

An interface for interactive manual annotation (Step 1) was developed in the form of a series of Word macros, written in Visual Basic and making use of graphical user interface possibilities. This interface has been set up to guide the tagger through the annotation process step by step.

As put by Ebeling & Heuboeck, it facilitates the human tagger’s task in various respects:

Operating within Word, the human tagger still has the original formatting available during the tagging process. Interpreting formatted text involves considerably less effort than interpreting unformatted text;
Tags can be selected from options, thus avoiding any typing. The options appear as checkboxes, radio buttons, drop down lists or labelled keys;
By organising the tagging process in two layers, i.e., first selecting from the functions available and then annotating these functions, the tagging interface, changing throughout the process, is always focussed on the function being annotated. The tagger only chooses relevant options for this function; and,
Thus, the tagging interface is tailored to the requirements of the [VESPA] corpus: both layers of annotation, functions and specific options describing their realisation, are designed to direct and limit the annotator’s choice […] (Ebeling & Heuboeck 2007: 251-252).

Step 2 relies on a Word macro that partners just need to run on a batch of VESPA files to convert them to XML format.

When they have a batch of VESPA texts that have gone through Steps 1 and 2, partners should run the files through the Perl scripts.

The corpus processing is described in details in the VESPA manual.

References

Ebeling, S.O. & Heuboeck, A. (2007). Encoding document information in a corpus of student writing: the British Academic Written English Corpus. Corpora 2(2): 241-256.
Heuboeck, A., Holmes, J. & Nesi, H. (2008). The BAWE Corpus Manual. http://www.reading.ac.uk/AcaDepts/ll/app_ling/internal/bawe/BAWE.documentation.pdf