

The OpenCorpusCollection project is co-funded by the MiiL and the Cental.

Its aim is to provide research projects and researchers with open ressources from social media (Twitter, Reddit, Instagram, TikTok...) whose sampling methods are clearly described and scientifically rooted. The theoretical framework undergoing this project is the grounded theory approach from Lai and To (2015) and Tromble et al. (2017).

Corpora are text- and image-based in various languages, e.g. English, French, Norwegian and Dutch.

These ressources are provided with metada detailing sampling methods, and other information.

OpenCorpusCollection is also developing a request tool for non IT-friendly users (in French for the moment).

Project's supervisers:

  • Cougnon Louise-Amélie (MiiL)
  • Watrin Patrick (Cental)


  • Cougnon, L.-A., de Viron, L. and Watrin, P. (2022). Collection of Twitter Corpora for Human and Social Sciences: Sampling Methodology and Evaluation. White paper published on SocArXiv, 7 pages.
  • Lai, L. S. L., & To, W. M. (2015). Content analysis of social media: A grounded theory approach. Journal of Electronic Commerce Research, 16(2), 138-152.
  • Tromble, R., Storz, A., and Stockmann, D. (2017). We don’t know what we don’t know: When and how the use of twitter's public APIs biases scientific inference. Social Science Research Network, n°3079927.