The TeMa corpus: brief description

The TeMa corpus has been collected in the framework of a research project on phraseology in language learning and teaching carried out at the CECL. The textbooks used for the compilation of the TeMa corpus were selected among recent best sellers on the international ELT market and in similar proportion among the most renowned publishers. Thirty-two volumes of English for General Purposes coursebooks were chosen for inclusion in TeMa, which corresponds to the student’s book and workbook of ten textbook series, all at the advanced level, and seven of them at the intermediate level, too. In total, the TeMa corpus contains 724,174 words. 

The TeMa corpus is divided in subcorpora corresponding to its various components: a first subdivision is based on the textbook series, the levels, the student’s book and the workbook; a second subdivision then classifies the types of pedagogical material (texts, tapescripts, vocabulary exercises and guidelines to the exercises). The first stage of the annotation process is to assign an identification number to each section of the corpus. Each textbook series is first given a code number. New Headway, for instance, has been assigned number 6. The two levels, intermediate and advanced, are then given a two-digit code: 61 for New Headway advanced and 62 for New Headway intermediate. Each level is then further split into student’s book and workbook, which are assigned the additional 1 and 2 digits respectively. Finally each student’s book and each workbook are again broken down into four subcorpora, the texts (1), the tapescripts (2), the vocabulary exercises (3) and the guidelines to these exercises (4). Each course book series is thus divided in 8 subcorpora when all these components are available. Such mark-up makes it possible to link any excerpt to its origin.  

The second stage of the annotation process, namely pedagogical tagging, has been applied to the vocabulary subcorpus. This pedagogical tagging, whose main purpose is to label the pedagogical tasks the learner has to perform, was applied to all the vocabulary exercises on the basis of the learning activities learners have to engage in, and on the pedagogical status of the lexical items focused on. Figures 1 and 2 present one sample of pedagogically annoted exercise from the student’s book of Clockwise Intermediate (Forsyth 2003). Figure 1 is a scan of the actual page of the book, whilst Figure 2 presents its annotated corpus version.

Figure 1

  Figure 1: Vocabulary exercise as it appears in the textbook before coding and inclusion in the corpus









1213(BC)that #$


1213(BC)too #$



1213(CB)He’s completely different 1213(AB)from# her$

1213(CB)They’re quite similar 1213(AB)to# each other in age$

1213(CB)I think she’s 1213(AB)too# young for him. She’ll get bored with him$

1213(CB)They’ve got a lot 1213(AB)in# common$

1213(CB)I think they’re quite a good couple : they look 1213(AB)very# similar$

1213(CB)the single woman looks quite like 1213(AB)no word# the older man-except 1213(AB)that# she’s a woman of course !$

1213(CB)there are so many differences 1213(AB)between# them ; they’ll split up before long !$

1213(CB)She looks about the same height 1213(AB)as# him$

Figure 2: Vocabulary exercise as it appears in the corpus, after annotation

As can be seen from Figure 2, each exercise is given a unique reference. The example presented, i.e. <CLISB-U6-P24-E1>, is taken from CLockwise Intermediate Student’s Book – Unit 6Page 24Exercise 1. The four-digit tag before each word or sentence (1213 in this case) refers to the identification number. The two-letter tags between brackets (BC) indicate the “pedagogical status” of the lexical items presented. (BC) is used when words are presented in a box (B) to be used to complete (C) exercise sentences, hence BC standing for “box to complete”. As for the introductory tag in front of each exercise line, it gives information on the pedagogical task that has to be carried out. In this example, (CB) means “complete the sentence with words from a box”. The tags (e.g. AB), within the exercise sentences, refer to the status of the lexical items; “from”, “to”, “too” etc, are answers from a box, hence (AB). Each sentence ends with a dollar sign ($) and within the sentences, the elements which are focused on (e.g. answers, highlighted elements etc.) are followed by a hash (#). These additional signs make it easier to spot the beginning and the end of sentences as well as the exact lexical items practised.

The research goals are both descriptive and didactic.

On a descriptive level, using pedagogically tagged corpora makes it possible:

  1. to provide a solid empirical description of the material under analysis;
  2. to carry out longitudinal studies (e.g.: comparing vocabulary selection from beginning to advanced levels);
  3. to perform cross-sectional analyses (e.g. what does ‘advanced’ mean in terms of vocabulary selection, grammatical content, or discourse focus?);  
  4. and even to raise the editors’ or textbook writers’ awareness of for instance the types of exercises they propose, in what proportion, to what audience, etc.

On a more didactically-oriented level, results of corpus analyses carried out on pedagogically tagged corpora will enable researchers:

  1. to suggest ways of improving existing material by taking the best of many different worlds (e.g. as cognitively oriented SLA research has demonstrated the importance of noticing, extrapolation and rehearsal (see among others De Bot et al. 2005), a contrastive analysis of pedagogically tagged corpora vs reference corpora will help us define areas where noticing could be favoured),
  2. to suggest the implementation of possible targeted add-ons to existing textbooks. As is the case with learners’ dictionaries, which now almost invariably include a CD-Rom version containing extra material such as concordance lines, thesaurus, extra examples, or exercises, it would be reasonable to expect a similar evolution for textbooks where the accompanying CD-Rom could include the audio files of listening comprehensions but also more texts, and why not even easy-to-search corpora, more exercises, in a word more opportunities for both explicit and implicit learning.

Corpus availability:

TeMa cannot be distributed but if you are interested in accessing sections of TeMa for research purposes only, please contact Fanny Meunier (Fanny Meunier).