**Population of interest and sampling**

A major step in preparing the data collection is to identify the **population of interest**. This means identifying what unit or observation should be studied, but also in what timeframe, geographical place or conditions. For instance, the population of interest could be rainbow trouts over 20cm found in the Mackenzie River between 2016 and 2018, or Belgian men who are patients at a specific hospital and take medication for high blood pressure.

In some cases, the whole population can be studied, but in many cases, this is not possible due to logistical, time, budget or ethical constraints. In this case, a **sampling** stage is needed. Sampling consists in selecting a subset of the population of interest which will be used to estimate the characteristics of the entire population.

In order to sample the population, a **sampling frame** needs to be identified. This is a list of all units in the population, which can be used to draw a sample. For instance, possible sampling frames for the two examples above are all trouts which can be caught in the Mackenzie river and meet the requirements or a list of patient addresses obtained from the hospital.

**Representativity**

An important characteristic of a sample is its **representativity**. If the information collected on the sample is to be used to estimate population characteristics, the sample needs to be representative of the population. This means that all observations in the sample need to be part of the population of interest and need to reflect its characteristics.

Ideally, a sample should be representative of the population in terms of all possible characteristics. Drawing a **random sample** is a way to achieve this. In some cases, this type of sampling is however not possible or suitable, and a sample representative of the population on **key parameters** is selected. For instance, a sample of rainbow trouts could be representative of the population of rainbow trouts in terms of weight, gender or age. A sample of high blood pressure male patients could be representative of the population in terms of age, body-mass index and education level.

**Sample size**

The sample size has an effect on the accuracy and validity of the research results. It is important to identify in advance of the data collection what would constitute a sufficient sample.

It is possible to compute an ideal sample size before the data collection, based for instance on the expected effect size, variability in the population, a wished-for level of significance or margin of error. Generally, a larger sample size leads to more accurate results, but requires more resources for data collection. The ideal sample size is therefore a trade-off between competing constraints.

**Designing the data collection phase**

Different approaches can be used to design a data collection phase, depending on the research field, population of interest and research objectives. Experimental planning consists in designing experiments in the most efficient way, randomization can be used to set up randomized controlled trials and sampling methods aim at drawing samples from a sampling frame.

**Experimental planning**

To obtain the maximum benefit from a series of experiments, they must be properly designed. How can the experimental program be designed to achieve the experimental objectives in the simplest manner with the minimum number of measurements and the least expense? A successfully designed experiment is a series of organized trials which enables one to obtain the most experimental information with the least amount of effort. Three important questions to consider when designing experiments are:

· What are the types of errors to avoid?

· What is the minimum number of experiments that must be performed?

· When should we consider repeating experiments?

**Randomization**

Randomisation is the process of assigning participants to groups so that each participant has an equal chance of being assigned to a given group. Randomisation is the best method removing selection bias between groups of patients.

Randomisation is often used in medical research. It ensures different groups being studied have similar characteristics when the study begins, allowing a fair comparison.

**Sampling methods**

When the data collection can rely on a sampling frame, different sampling methods can be used to select the sample. Common sampling methods include:

Simple random sampling: units are randomly selected in the population – all samples of the same size have the same probability of being selected and all individuals have the same probability of being selected.

Systematic sampling: based on a sampling interval k, each kth unit is selected, starting from a randomly selected unit.

Stratified sampling: units are sampled independently in homogeneous sub-groups of the population called strata, for instance regions in a country or age groups.

Cluster sampling: units are sampled independently in heterogeneous, naturally-occurring sub-groups of the population, for instance classes in a school or departments in a company.

These sampling methods are probabilistic and aim at estimating parameters of interest in the population. They can be combined to produce complex multi-level designs.

**Useful resources:**

SMCS trainings

Sampling

https://sites.uclouvain.be/smcs-gateway/index.php?page=documentation&spage=methodes&id=450&l=fr

Sample size calculation

https://sites.uclouvain.be/smcs-gateway/index.php?page=documentation&spage=methodes&id=436&l=fr

Randomization

https://sites.uclouvain.be/smcs-gateway/index.php?page=documentation&spage=methodes&id=456&l=fr