‘Corpus Linguistics with R’ bootcamp

The ‘Corpus Linguistics with R’ bootcamp (12-16 Aug 2019) is a hands-on introduction to using the programming language R for the analysis of textual data (mostly corpora, but theoretically also literary works, web data, etc.). It is based on the second edition (2016) of Gries’s textbook Quantitative corpus linguistics with R and introduces a variety of programming constructs required for text processing and corpus exploration including but not limited to

  • building word frequency lists and computing type-token ratios;
  • computing dispersion and key words statistics;
  • extracting concordance lines.

For that, we will discuss different relevant functions and data structures, control flow structures such as loops and conditionals, and a sizable number of regular expressions; in addition and time permitting, we will also cover very elementary basics of data visualization. The kinds of data dealt with in this course come from a variety of differently formatted/annotated corpora and will also include 1-2 examples of literary works and/or XML processing.

‘Statistics for linguistics with R’ bootcamp

The ‘Statistics for linguistics with R’ bootcamp (19-23 Aug 2019) is a hands-on introduction to statistical methods for both graduate students and seasoned researchers and is based on the second edition (2013) of Gries’s textbook Statistics for linguistics with R. The course is intended for linguists who already have a basic knowledge in statistics and some experience using R, and who wish to improve their proficiency in statistical analysis of linguistic data. Using the open source software and programming language R, we will:

  • briefly recap basic aspects of statistical evaluation as well as several descriptive statistics;
  • briefly discuss a selection of monofactorial statistical tests for frequencies, means, correlations and how they constitute special (limiting) cases of regression methods;
  • explore different kinds of multifactorial and multivariate methods, in particular different kinds of regression approaches (fixed-effects only and mixed-effect modelling) as well as classification trees and random forests.

For all statistical methods to be explored, we will discuss how to test their assumptions and visualize their results with nice and annotated statistical graphs, and sometimes we will reanalyze published data from corpus-linguistic studies. The participants will also get small functions they can use for their own statistical applications. Also, time permitting, there will be a small section on how to write small statistical/visualization functions yourself.

Typical schedule

Week day




9.00-9.30 Welcome
9.30-12.30 Class

2.00 - 5.00 Class
7.00 Welcome dinner


9.00-12.15 Class

1.45 - 5.00 Class


9.00-12.15 Class

1.45 - 5.00 Class


9.00-12.15 Class

1.45 - 5.00 Class


9.00-12.15 Class

1.45 - 5.00 Class

Class sessions of more than two hours include a 15-minute break.