3 Population and samples
Statistical inference is the process of drawing conclusions about a population based on a random sample taken from it. A population refers to the complete set of units we aim to analyze and summarize. In a broader sense, a population can also be understood as a process that generates data — or, more precisely, the entire set of possible data that the process could produce if repeated infinitely many times. In such cases, the population may be infinite and/or inherently inaccessible. A sample, on the other hand, is a subset of the population, selected either randomly or in a manner intended to approximate randomness.
The following examples can help gain the necessary intuitions.
3.1 Population and sample: examples
We draw a sample of 1000 American voters and ask about their preferences in the upcoming elections. Based on this, we estimate support for the Republican candidate in the population of the USA. What are the reasons that we don't we analyze the entire population?
We randomly select 20 points on a globe (our sample) and use them to estimate the share (the proportion) of land on Earth's surface (the population). This example illustrate that while the sample has a fixed size, the population can be infinitely large.
We draw 20 bottles of wine (sample) from a new vintage (population) and check how much the selected wine tasters like them. (Note that if we were to send the entire population to the tasters, we would obtain highly precise information, which would be of little use to us.)
Sometimes it is more convenient to think – and speak(?) – about the process that generates data instead of the population:
Other examples where a sample and population appear:
We roll a die 100 times (the sample). The number six comes up 40 times. Based on this, we try to determine whether the die (or the data-generating process generating results) is fair (unbiased, balanced).
We take 6 KLT crates produced by an injection molding machine and conduct strength tests, determining the maximum compressive force (as in the production plant of Schoeller Allibert in Zabrze). What is the population here?
3.2 Sampling with and without replacement
When selecting objects from a set (e.g., drawing observations from a population into a sample), we can use sampling with replacement (where each selected object is returned to the pool and may be drawn again) or sampling without replacement (where selected objects are not returned and cannot be chosen again).
The easiest way to explain this is with examples:
If we roll a typical die three times, we are drawing integers from 1 to 6 with replacement.
If we have cards numbered from 1 to 6 and randomly choose three of them (simultaneously), we are dealing with sampling without replacement. In this second case, we will not get the number six twice.
Note! It is not always possible to clearly determine whether we are dealing with sampling with or without replacement. Moreover, later on, we will sometimes assume by default that for a large population (large relative to the sample taken), the mathematical formulas for sampling with and without replacement are very similar, and we will not worry about it...
3.3 Questions
Question 3.1 Which of the examples described in section 3.1 is sampling with replacement, and which without replacement?
Question 3.2 Find in the media or come up with other examples where inference about a population is made based on a sample.
Question 3.3 In a certain city's street, random people were stopped and asked the question: "Are you happy?" What can be considered the population in this example?