7 Internal validity

So far, you have learnt to ask a RQ, select a study type, and select a sample.

In this chapter, you will learn about internally validity for experimental studies. You will learn to:

maximise the internal validity of studies.
manage confounding in studies.
explain, identify and manage the Hawthorne, observer, placebo and carry-over effect in studies.
explain different types of blinding.

7.1 Introduction

A well-designed study is needed to draw solid conclusions: a study with high external validity (Sect. 3.1) and high internal validity (Sect. 3.2). Some research design decisions to maximise internal validity are discussed in this chapter.

Example 7.1 (Importance of internal validity) Beaman et al. (2013) describe an experiment where free fertilizer was provided to a sample of female farmers in Mali (at the recommended rate, or at half the recommended rate).

All farmers knew they were part of a study, so changed their farm management: they employed more hired labour and used more herbicide than usual. Consequently, the yields for all farmers improved. Knowing if changes in yield were the result of applying the fertilizer is difficult, as the study had poor internal validity.

Specific design strategies for maximising internal validity include:

managing confounding (Sect. 7.2).
managing the Hawthorne effect by blinding individuals (Sect. 7.3).
managing the observer effect by blinding the researchers (Sect. 7.4).
managing the placebo effect by using controls, objective measures and blinding (Sect. 7.5).
managing the carry-over effect by using washouts (Sect. 7.6).

Not all of these strategies will be relevant to every study.

7.2 Managing confounding

Example 7.2 (Himalaya study) Consider this relational RQ (based on Bird et al. (2008)):

Among Australians, is the average faecal weight the same for people eating provided food made from wholegrain Himalaya 292 compared to eating provided food made from refined cereal?

Suppose that the researchers created two groups of individuals for this experimental study:

Group A: women recruited from a female-only gym.
Group B: men recruited from a local nursing home.

The researchers gave Himalaya 292 to Group A, and the refined cereal to Group B. If a difference in faecal weight was detected between the two groups, the difference may due to:

the different diets (the explanatory variable) for each group;
the different sexes in each group (Group A was all women; Group B was all men);
the different ages in each group (Group A is likely to be younger on average than those in Group B); or
the different overall health in each group (Group A would generally be healthier than those in Group B).

Any difference in faecal weight detected between the two groups may not be due to the diets (Table 7.1): the study has very poor internal validity, due to poor research design.

Sex, age and overall health are confounding variables (Def. 3.6): they are associated with the type of diet (explanatory variable) and faecal weight (response variable). For example, the age of the subject may be associated with faecal weight (older people tend to eat less, and eat differently, than younger people), and the research design means older people are more likely to be consuming the refined cereal. This is an extreme case of confounding (Fig. 7.1); usually, confounding is more subtle (and more difficult to detect) than in this example.

TABLE 7.1: Comparing Groups A and B: an extreme example of confounding.

Group A		Group B
Women	Sex	Men
Younger (in general)	Age	Older (in general)
Himalaya 292	Cereal	Refined
Very fit (in general)	Fitness	Less fit (in general)

FIGURE 7.1: An extreme example of confounding.

The groups being compared should be as similar as possible, apart from the difference being studied.

Since the groups being compared should be as similar as possible, apart from what is being studied, researchers often compare the comparison groups on potential confounding variables.

In experimental studies, an excellent way to manage confounding is:

Randomly allocating individuals to the comparison groups.

Random allocation should ensure that the values of potential confounding variables are approximately evenly distributed between the comparison groups. This is true for identified potential confounders (such as age), and also for variables not even considered as confounders, or are hard to measure or observe (such as genetic conditions). One of the comparison groups is often a control group (Def. 2.16)

Example 7.3 Lian et al. (2024) studied alleviating post-operative thirst experienced by patients admitted to the intensive care unit, by comparing standard procedures with the use of ice-water spray. To use random allocation of patients to the two groups, the researchers:

... assigned unique numbers from \(1\) to \(56\) according to [students'] admission order [...] two-digit numbers were read from the random number table's rows and columns, generating random values that were matched with the respective admission numbers [...]

Any student assigned a number between \(1\) to \(28\) (inclusive) was allocated to the control group, while students assigned numbers \(29\) to \(56\) were assigned to the experimental group.

Example 7.4 Witmer and Pipas (2020) studied using bear faeces to prevent bears damaging trees in an Idaho forest. The researchers painted the bear faeces on sample of trees. As a control, researchers could take observations from trees that they had not approached, and hence had no bear faeces applied. However, if a difference was found between the trees with bear faeces and trees they had not approached, the difference may have been due to the presence of humans near the trees rather than the treatment (i.e., poor internal validity).

For this reason, the control group comprised trees on which the researchers applied water. This is a better control, since trees in both groups (faeces, water) had been approached by humans. Now, if a difference was found between the faeces and water-sprayed trees, the presence-of-humans explanation has been eliminated.

Randomly allocating individuals to comparison groups is not possible in observational or quasi-experimental studies. For this reason, confounding is often a major threat to internal validity in these studies, as individuals who are in one comparison group may be different, in general, to those who are in another group.

Fortunately, other (though less effective) means for managing confounding also exist.

Restricting the study to a certain subgroup of the population.

Sometimes, specifically excluding or including members of the population is helpful for reducing confounding. For the Himalaya 292 study, for example, age is a potential confounder: older people have different dietary needs, general health and gut health when compared to younger people. Hence, the researchers may decide to use an inclusion criteria, restricting the study to people aged from \(30\) to \(50\).

In addition, some people may have specific conditions or diseases that mean participating in the study will be problematic. For instance, coeliacs have an autoimmune disorder which results in an intolerance to gluten (found in wheat, barley and rye). Hence, the researchers may decide to use exclusion criteria, excluding coeliacs from participating in the study. Those individuals that are excluded from the population are not less important than those individuals that are included.

Inclusion and exclusion criteria may be applied for other reasons too; for example, to clarify a population of interest, to address ethical concerns (i.e., by excluding children) or to exclude rare and unusual individuals.

Definition 7.1 (Inclusion and exclusion criteria) Inclusion criteria are characteristics that individuals must meet explicitly to be included in the study.

Exclusion criteria are characteristics that explicitly disqualify potential individuals from being included in the study.

Exclusion and inclusion criteria clarify which individuals are explicitly included or excluded from the population for the purposes of the study, and should be explained when their purpose is not obvious. Exclusion and inclusion criteria are not necessary; none, one or both may be used. These variables a type of control variable (Def. 3.5).

Example 7.5 (Inclusion and exclusion criteria) In a strength study where the population is 'concrete test cylinders', cylinders with severe cracks may be excluded.

In a study of exercise regimes for people over \(60\), severe asthmatics may be excluded from the study for health reasons.

Example 7.6 (Inclusion, exclusion criteria) Mackowiak, Wasserman, and Levine (1992) studied men and women aged \(18\) to \(40\); this is the population. The exclusion criteria include people under \(18\) years of age and over \(40\) years of age; alternatively, the inclusion criteria are people aged between \(18\) and \(40\) years of age. Either of these can be stated; both are not needed.

In a study on the influenza vaccine, Kheok et al. (2008) listed the Population as 'health-care workers' (Kheok et al. 2008, 466), and the sample comprised healthcare workers at two specific hospitals. The population was refined using exclusion criteria: those (p. 466)

...declining to give consent, a history of egg protein allergy, and neurological or immunological conditions that are contraindications to the influenza vaccine.

Example 7.7 (Inclusion and exclusion criteria) Guirao et al. (2017) studied the walking abilities of amputees. Inclusion criteria included (p. 27):

... length of the femur of the amputated limb of at least \(15\,\text{cm}\) measured from the greater trochanter; use of the prosthesis for at least \(12\) months prior to enrollment and more than \(6\,\text{h}\)/day...

Exclusion criteria included (p. 27) people with:

... cognitive impairment hindering the ability to follow instructions and/or perform the tests; body weight over \(100\,\text{kg}\)...

Blocking, when units of analysis are arranged into different groups containing individuals that are similar to each another (see Sect. 29.7 for an example).

For the Himalaya 292 study, for example, subjects may be paired (i.e., groups of two). That is, each person is paired with another person of the same sex and of a similar age; one of the pair is given the Himalaya 292 diet, and the other is given the refined cereal diet.

Definition 7.2 (Blocking) Blocking occurs when units of analysis are analysed as separate groups of similar units (called blocks).

Analysing using special methods (beyond this book), after recording the values of potential confounding variables.

To use this approach, recording all potential extraneous variables is important. Most studies involving people record the participants' age and sex if possible, as these two variables are common confounders. Once a sample is obtained, recording this extra information usually requires little extra effort. Then, these extraneous variables can be factored into the analysis.

Restricting and blocking are useful if one or two confounding variables are suspected. Multiple approaches can be used, such as randomly allocating individuals to groups, and recording other variables that can be managed through analysis.

Randomly allocating is superior when possible, because confounding is reduced for variables not even suspected as being confounders. Hence, experimental studies should use random allocation whenever possible.

For any study (but especially for observational and quasi-experimental studies), recording the values of any potential confounding variables is useful, so special analysis methods can be used to manage confounding.

Record all the extraneous variables likely to be important for understanding the data (Sect. 7.8). This may include information about the individuals in the study, and the circumstances of the individuals in the study (that is, the circumstances the individuals find themselves in; these may not be measured on the individuals themselves).

Example 7.8 (Managing confounding: experimental study) For the Himalaya study, different methods can be used to manage confounding due to age.

The study could be restricted to people under \(30\). Age would be a control variable.

Blocking could be used by finding similar pairs of subjects (e.g., pairs of subjects of the same sex, with similar age and weight). One of each pair is given the refined cereal diet, and one given the Himalaya 292 diet. The differences in faecal weight for each pair can be analysed using special methods (see Chap. 29 for example).

Information about the individuals could be recorded, such as age and pre-study weight. Information about the circumstances of the individuals could also be recorded, such as where they live. Then, special methods of analysis could be used to analyse the data.

Since the study is experimental, participants could be randomly allocated into one of two groups, so both groups would have a similar distribution of ages (and other potential confounders). Then groups could be randomly allocated to receive one of the diets (Fig. 7.2).

In the Himalaya 292 study, individuals were randomly allocated to the diets (p. \(1\,033\)), which manages confounding due to age and other potential confounding variables also.

FIGURE 7.2: Random allocation can occur in two places for the Himalaya study.

An experiment to study the effect of using ginkgo to enhance memory (Solomon et al. 2002) compared two groups: one using ginkgo (\(n = 111\)), and one using a fake, non-active supplement (\(n = 108\)). The authors randomly allocated participants to each group, then compared the two groups to ensure that no obvious differences initially existed between the groups that might explain differences in the response variable (Table 7.2).

Two groups are similar in terms of age, education and gender distribution. Any difference in outcome between the groups is probably due to the treatment.

TABLE 7.2: Comparing the two groups in the ginkgo-memory study.
Characteristic	Group A (ginkgo)	Group B (Fake)
Average age (in years)	68.7	69.9
Men (number; percentage)	46 (41%)	45 (42%)
Average years of education	14.4	14.0

Researchers explored the use of dominant and non-dominant hands for chest compression in student paramedics using an experimental study (Cross et al. 2019). Students were randomly divided into two groups: DHOS (dominant hand on chest) and NDHOC (non-dominant hand on chest). The two groups were then compared:

Demographic	All participants (\(n = 75\))	DHOC (\(n = 37\))	NDHOC (\(n = 38\))
Average age (years)	\(23.4\)	\(22.5\)	\(24.3\)
Gender: percentage Female	\(51\)%	\(53\)%	\(47\)%

The two groups appear to be very similar in terms of average age of participants, and the percentage of female participants. If differences are observed in the study between the DHOC and NDHOC groups, it is probably due to the treatment. The study should have reasonable internal validity.

Example 7.9 (Managing confounding: observational study) Froud, Beresford, and Cogger (2018) studied \(2\,599\) kiwifruit orchards using an observational study, exploring the relationship between the time since a bacterial canker was first detected (in weeks) as the explanatory variable, and the orchard productivity (in tray-equivalents per hectare) as the response variable.

The researchers also recorded extraneous variables such as 'whether the farm was organic', 'elevation of the orchard' and 'whether general fungicides were used'. These variables were used in their analysis to manage the potential effects of confounding.

Example 7.10 (Comparing study groups: observational study) An observational study compared the iron levels of active and sedentary women aged \(18\) to \(35\) (Woolf et al. 2009). The active women (\(n = 28\)) and sedentary women (\(n = 28\)) were compared on a variety of characteristics (Table 7.3). The active women were similar to the sedentary women on these characteristics, but were (in general) slightly younger, slightly heavier, and slightly more likely to use hormonal contraceptives.

TABLE 7.3: The demographic information for those in the study of iron levels in women.
Characteristic	Active women	Sedentary women
Average age (in years)	\(20\)	\(24\)
Average weight (in kg)	\(68\)	\(62\)
Percentage using hormonal contraceptives	\(13\)	\(11\)

A study (Gunnarsson et al. 2017) examined the difference between two types of helicopter transfer (physician-staffed; non-physician-staffed) of patients with a specific type of myocardial infarction (stemi). The purpose of the study was:

...to evaluate the characteristics and outcomes of physician-staffed hems (Physician-HEMS) versus non-physician-staffed (Standard-hems) in patients with STEMI.

--- Gunnarsson et al. (2017), p. 1

The researchers

...studied \(398\) stemi patients transferred by either Physician-hems (\(n = 327\)) or Standard-hems (\(n = 71\)) for [...] intervention at \(2\) hospitals between 2006 and 2014.

--- Gunnarsson et al. (2017), p. 1

Since the study is an observational study (patients were not allocated by the researchers to the type of helicopter transport), the researchers recorded information about the patients being transported. They compared the patients in both groups, and found (for example) that both groups had similar average ages, and similar percentages of females and smokers, and so on. They also compared information about the transportation, and found (for example) that both groups had similar average flight times and flight distances.

One conclusion from the study was that 'Patients with stemi transported by Standard-hems had longer transport times' (p. 1), but one limitation of the study was that:

The patient cohorts received treatment by \(2\) different care teams at two hospitals, which is a potential confounder despite similar baseline characteristics

--- Gunnarsson et al. (2017), p. 5

In other words, the difference between hospitals and the staff may have been a confounding variable.

Observational studies can (and often do) have control groups. Indeed, one specific type of observational study is called a case-control study (Sect. 4.6.2). However, individuals are not allocated to the control group by the researchers in observational studies, so the control and study groups may be very different, which may explain any differences in the outcome.

Random sampling and random allocation are different concepts (Fig. 7.3), with different purposes, but are often confused:

Random sampling impacts external validity. Its purpose is finding individuals to study, and is possible in both observational and experimental studies.
Random allocation helps eliminate confounding issues, by distributing possible confounders across treatment groups, and is only possible in experimental studies. Random allocation impacts internal validity. Its purpose is allocating treatments to individuals, which does not occur in observational studies.

FIGURE 7.3: Comparing random allocation and random sampling.

7.3 Hawthorne effect and blinding individuals

People, and perhaps animals, may behave differently if they know (or think) they are being watched, which could compromise the internal validity of the study. This is called the Hawthorne effect.

Definition 7.3 (Hawthorne effect) The Hawthorne effect is the tendency of individuals to change their behaviour if they know (or think) they are being observed.

Example 7.11 (Hawthorne effect: observational study) Wu et al. (2018) examined hand hygiene (HH) of staff in a tertiary teaching hospital, using covert observers (observers not obviously watching the HH practices) and overt observers (observers obviously about watching the HH practices). HH compliance was higher with overt observation (\(78\)%) than with covert observation (\(55\)%).

The impact of the Hawthorne effect can be minimized by blinding the individuals in the experiment, so that:

the individuals do not know that they are participating in a study; and/or
the individuals do not know the aims of the study; and/or
the individuals do not know which comparison group they are in.

In experimental studies, people are often informed that they are in a study, due to ethics requirements (Sect. 5.2); they may not, however, know which treatment they have received. In observational studies, individuals may or may not know they are being observed. For instance, in an observational study where subjects' blood pressure is measured, subjects clearly know they are being observed, which has the potential to alter the subjects' behaviour (for example, people become tense, called 'white-coat hypertension'). As far as possible, efforts should be made to ensure that individuals do not know that they are being observed (the participants are blinded).

Example 7.12 (Hawthorne effect: experimental study) For the Himalaya study (Example 7.2), the article reports that (p. \(1\,033\)):

The study was explained fully to the subjects, both verbally and in writing, and each gave their written, informed consent...

That is, the subjects knew they were in a study, and knew the aims of the study, so the Hawthorne effect may influence the results in this study. However, the subjects did not know which diet they were given.

Example 7.13 (Hawthorne effect: experimental study) People are more health-conscious if they know they will be examined regularly. For example, a study aiming to increase fruit and vegetable intake in young adults (Clark et al. 2019) noted that the observed increases in intake 'could be explained by the Hawthorne effect' as adults 'know they are being observed...' (p. 96).

Example 7.14 (Hawthorne effect: observational study) During the covid-19 lockdowns in Denmark, Olesen and Feldthaus (2021) covertly observed adults entering a large mall in Copenhagen. They noticed that (p. 1)

Almost all subjects [\(340/345\) (\(99\)%)] wore a personal protective face mask, but only \(141\) (\(41\)%) made use of the hand sanitizer.

Using masks and hand sanitizer were recommended by the Danish Health Authority, but the adherence to the safety measures were very different. The authors surmised (p. 1):

... wearing a face mask corresponded to being observed continuously [...] hand hygiene takes moments to perform, and no one can see whether or not it has been done.

In other words, wearing a face mask is obvious (that is, other could observe whether the subjects was adhering to this guideline) but hand hygiene is not (so other people could not observe whether the subject was adhering to this guideline). The authors conclude that 'the Hawthorne effect may explain why almost all subjects wore a face mask'.

7.4 Observer effect and blinding researchers

Perhaps surprisingly, researchers' expectations or hopes may unconsciously influence how the researchers interact with the individuals and record observations. In addition, this may (unconsciously) influence the behaviour of the individuals in the study. This is called observer effect. (In experiments, it is sometimes called the experimenter effect.) This could compromise the internal validity of the study.

Definition 7.4 (Observer effect) The observer effect occurs when the researchers unconsciously change their behaviour to conform to expectations because they know what values of the explanatory variable apply to the individuals. This may then cause the individuals to change their behaviour or reporting also.

The impact of the observer effect can be minimized by blinding the researchers, so that they do not know which treatments the individuals are receiving. The researchers giving the treatment and the researchers evaluating the treatment can both be blinded, by using a third party. For example, the researchers may give an assistant two drugs, labelled A and B. The assistant administers the drug and evaluates the participants' response to the treatments. Later, the assistant tells the researchers whether Drug A or Drug B performed better, but only the researchers know which drugs the labels A and B refer to (Fig. 7.4).

FIGURE 7.4: Using a third party to avoid the observer effect.

Example 7.15 (Observer effect: experimental study) Seo et al. (2020) examined the impact of an injection to alleviate post-operative umbilical pain, and stated (p. 392):

...the postoperative pain scores were gathered by a nurse practitioner who was blinded to the usage of bupivacaine to avoid observer-expectancy bias [i.e., the observer effect].

The observer effect does not just apply to situations with people as individuals.

Example 7.16 (Observer effect) 'Clever Hans' was a horse that seemed to perform simple mental arithmetic. By using an experiment where the people interacting with the horse were blinded, Carl Stumpf realised that the horse was responding to involuntary (and unconscious) cues from the trainer.

The same effect has been observed in narcotic sniffer dogs (Bambauer 2012), who may respond to their handlers' unconscious cues.

The observer effect is when the researcher unconsciously influence the individuals, and are not aware it is occurring. Intentionally influencing the individuals is fraud.

The observer effect can impact observational as well as experimental studies. For example, consider a study measuring the blood pressure of smokers and non-smokers (Verdecchia et al. 1995). This study is observational (individuals cannot be allocated to be a smoker or non-smoker), but if the researchers know if an individual is a smoker when they measure blood pressure, then the observer effect could still impact the results (recalling that the observer effect is an unconscious effect). For example, the researchers may expect smokers to have a high blood pressure.

The observer effect could be managed by first measuring the blood pressure, and then asking if the individual was a smoker or not. That is, the researchers may be blinded to whether the subject is a smoker when they measure blood pressure. This may only be partially successful; the researcher may see the subject carrying cigarettes, or can smell smoke on their breath, for example. Nonetheless, since it may prove at least partially successful and is easy to implement, this strategy should form part of the research design.

Example 7.17 (Observer effect: observational study) Zimova et al. (2020) took photos of snowshoe hares, at various stages of moulting and in various environmental conditions. Eighteen independent observers rated the moult stage from the photographs (p. 4):

... images were randomly named and sorted, with the dates [...] removed to minimize observer expectancy bias [i.e., the observer effect].

Blinding the observer is not always possible, but should be used when possible to improve the internal validity of the study.

A study of the scats of gray wolves was used to study their diet (Spaulding, Krausman, and Ballard 2000). A scat analysis is where humans examine the scat of carnivores to determine the prey. However, the accuracy of the results was questioned, due to 'perpetuation of the assumption that wolf scats contain only \(1\) prey item/scat' (p. 949).

The observers might be seeing what they expect to see: that "wolf scats contain only \(1\) prey item/scat".

7.5 Placebo effect, controls, objective data, and blinding

Perhaps surprisingly, individuals in a study may report effects of a treatment, even if they have not received an active treatment. This could compromise the internal validity of the study. This is called the placebo effect, which generally only impacts people as individuals.

Definition 7.5 (Placebo effect) The placebo effect occurs when individuals report perceived or actual effects, despite not receiving an active treatment.

For example, people who attend therapy expect a positive outcome; this expectation may result in temporary or subjective (or sometimes even real) improvements in their condition. This is the placebo effect.

To manage the placebo effect, researchers should record objective data rather than patient-reported (subjective) outcomes when possible (Enck et al. 2013). Using a control group (Def. 2.16), if possible, is also useful: it acts as a benchmark for detecting changes in the outcome due to the treatment of interest. In addition, blinding the individuals and the researchers may help manage the placebo effect, as then the individuals cannot know which group they are in.

Example 7.18 (Placebo effect) Three active pain relievers were compared to different-coloured placebo (Huskisson 1974) in \(22\) patients. The most pain relief was experienced by those taking red placebos (Fig. 7.5), who experienced even more pain relief than those given true pain relievers. Note that the outcome is subjective: a patient-reported outcome.

FIGURE 7.5: Pain relief, for various pain relief medicine and different-coloured placebos.

Since the placebo effect is concerned with individuals response to allocated treatments, it is not directly relevant to observational studies.

Example 7.19 (Placebo effect) In the Himalaya study, the individuals 'were not told the identity of the test cereal in the foods provided' (Bird et al. (2008), p. \(1\,033\)). The subjects were blinded to the diet they were exposed to. However, some may think they are on the refined cereal or Himalaya diet, and respond accordingly (perhaps unconsciously). The use of the refined cereal was acting as a control (Def. 2.16). Researchers measured faecal weight, an objective outcome, to minimise the placebo effect.

A study of placebos (Waber et al. 2008) gave half the subjects a placebo, but told them the pill was an expensive (implying 'effective') painkiller. The other half were also given a placebo, but were told the pill was a discount (implying 'less effective') painkiller. About \(85\)% of participants in the first group reported a pain reduction, yet only \(61\)% in the second group reported a pain reduction. Remember: both groups actually received a placebo! Again, 'pain relief' is subjective.

7.6 Carry-over effect and washouts

In the Himalaya study (Example 7.2), the diet is a between-individuals comparison: one group of patients was given the refined cereal diet (the control), and a different group of people was given Himalaya 292. The study also used a within-individuals comparison: each person in the study was actually placed on both diets at different times.

Suppose all patients spent four weeks on the Himalaya 292 diet, then the next four weeks on the refined cereal diet. Potentially, the first diet could still be impacting the subjects' faecal weight for a little while after stopping the first diet. This could compromise the internal validity of the study. This is an example of the carry-over effect: when the influence of one treatment or condition on the response variable carries over to influence the value of the response variable for next treatment or condition. The carry-over effect is only a concern for within-individuals comparisons.

Definition 7.6 (Carryover effect) The carry-over effect occurs when the influence of one treatment or condition on the response variable influences the response variable for subsequent treatments or conditions (in a repeated-measures study).

The impact of the carry-over effect may be minimized by using a washout or similar between treatments or conditions. For example, after tasting a food sample, participants may rinse their mouth with water before tasting another food sample. For the Himalaya study, the participants could spend two weeks on their usual (before-study) diet, before starting each of the diets in the study. This is called a washout period.

Example 7.20 (Carry-over effect: experimental study) In the Himalaya study, 'there was no washout period' (Bird et al. (2008), p. \(1\,033\)) since the response variable was only recorded after individuals spent four weeks on each diet. Since faecal weight was not measured until the end of the four-week periods, the carry-over effect is essential irrelevant.

In Jaskiewicz et al. (2020), student paramedics performed chest compression in real-life (RL), and also using virtual reality (VR). Researchers were assessing the relaxation percentage of the students while undertaking the compression (a relaxation percentage of about \(50\)% is ideal).

When used by itself, the VR method produced an average relaxation percentage of \(45.5\)%. However, when the RL method was used first, and then followed by the VR method, the average VR relaxation method percentage was \(74.7\)%.

The response of the individuals was different depending on whether the RL method was used first. This is an example of the carry-over effect.

Sometimes, in experimental studies, researchers can randomly allocate the order in which the treatments are used (a crossover study). That is, some participants start by spending four weeks on the Himalaya 292 diet, then four weeks on the refined cereal diet; meanwhile, other participants start by spending four weeks on the refined cereal diet, then four weeks on the Himalaya 292 diet.

Example 7.21 (Carry-over effect) In the Himalaya study (Example 7.2), subjects were allocated randomly to begin the study on the Himalaya 292 diet or the refined cereal diet.

Example 7.22 (Washout periods: experimental study) R. D. MacDonald et al. (2006) required paramedics to conduct eight different tasks (such as electrical defibrillation and intravenous cannulation). Each of the paramedics began the series of tasks at a random task, to mitigate the carry-over effect. A washout period between tasks (i.e., a rest time) was also used.

The carry-over effect also is a potential concern to internal validity in observational studies involving a within-individuals comparison. However, since treatments are not allocated in observational studies, carry-over effects may be difficult to prevent, as washouts cannot be imposed, and the order of the conditions cannot be imposed. However, observing individuals exposed to Condition A then Condition B, and other individuals exposed to Condition B then Condition A, may be possible.

Example 7.23 (Carry-over effects: observational study) Norris (2005) studied the carry-over effect in ecological observational studies of animals (p. 181):

...individuals occupying poor quality winter habitat may experience reduced reproductive success the following breeding season when compared to individuals occupying high quality winter habitat.

7.7 Describing blinding

Blinding occurs when those involved in the study do not know information about the study. The individuals in the study may be blinded (to help manage the Hawthorne effect) to

whether they are involved in a study;
the aims of the study in which they are participants; and/or
which comparison group they are in.

The researchers and the analysts can be blinded to which comparison groups apply to the individuals (to help manage the observer effect).

When blinding is used in as many ways as possible, the internal validity of the study is increased and bias reduced. However, when people are the individuals, ethics requirements may mean that the individuals need to know they are in a study (especially if experimental), and the purpose of the study.

If only the individuals are blinded to the comparison groups, the study is called single blind. If both the researchers and participants are blinded to the comparison groups, the study is called double-blind. If the researchers, participants and the analyst are blinded to the comparison groups, the study is sometimes called triple-blind. Rather than using these terms, explicitly stating who or what is blinded to which parts of the study is clearer.

Blinding should be considered in all studies when possible (it is not always possible). Blinding participants does not just apply to people; it also may apply to animals (Example 7.16).

Example 7.24 (Double-blinding) Bulte et al. (2014) compared yields from modern and traditional cowpea crops in Tanzania. The two seed types ('traditional' and 'modern') were made similar in appearance, so the farmers were blinded to which group they were in (control or treatment). The seed type would eventually become obvious as the crop grew, but 'key inputs were already provided' by then (p. 817). In addition, the researchers interacting with the farmers were not informed about the type of seed distributed.

In observational studies, blinding individuals may be easier than in experimental studies (Sect. 7.3). Blinding the researchers may be difficult, since the researchers need to record the value of the explanatory variable.

Example 7.25 (Blinding: observational studies) Emerson et al. (2010) studied Achilles tendinopathy in gymnasts, by comparing \(40\) elite gymnasts with \(41\) similar controls who were non-gymnasts. The authors state (p. 38) that

Although the primary investigator was blind to the clinical status of the subjects, there was no blinding to whether each subject was in the gymnast or control group during image collection [...]

When the images were reviewed, however, the article explains that the examiner was unaware of the clinical state and group of the subjects.

The paper explains who was blinded and to what parts of the study they were blinded.

7.8 Recording extraneous variables

One way to design a high-quality study is to record information about many (potential) extraneous variables. Various reasons for doing this have been given:

To evaluate external validity to determine if the sample is representative of the population (Sect. 6.6), by comparing the sample and population.
To improve internal validity, by helping to manage confounding:
- by avoiding lurking variables (Sect. 3.4).
- by determining if the comparison groups are similar (Sect. 7.2).
- by using the information in analysis (Sect. 7.2).

Record the values of all extraneous variables that may be important in the study!

Example 7.26 (Poor internal validity) In the 1800s, Semmelweis recorded mortality rates of women after childbirth over many years (P. M. Dunn 2005) at two clinics:

In Clinic 1, with male doctors delivering babies: \(9.9\)%.
In Clinic 2, with female midwives delivering babies: \(3.4\)%.

Was the difference in mortality rate (the outcome) due to the sex of the person delivering the babies (the comparison)?

One possible confounder was the clinic; however, the clinic was eliminated as an explanation. For example, Clinic 2 was actually more overcrowded than Clinic 1, and the climate was similar for both clinics.

However, an important lurking variable was present. At the time, the benefits of hand-washing were not understood, nor commonplace. Many (male) doctors performed autopsies immediately before delivering babies, without washing their hands between procedures. In contrast, autopsies were not performed by the (female) nurses.

The lurking variable was 'whether the baby was delivered by someone with clean hands', which was related to the mortality rate and to the sex of the person delivering the baby. The female midwives had clean hands, and hence the mortality rate was (relatively) low. The male doctors did not have clean hands, and hence the mortality rate was high.

After instituting hand-washing for doctors, the mortality rate in Clinic 1 reduced to a rate similar to that in Clinic 2.

7.9 Recording objective data

Recording objective data is often more reliable than recording subjective data, as subjective data can be influenced by the Hawthorne, observer or placebo effects. Perceptions are often unreliable also. However, sometimes recording objective data is not possible, and sometimes the researchers are explicitly interested in the subjective responses of people to certain treatments or conditions.

If possible, objective data should be recorded.

Example 7.27 (Subjective and objective data) Ueberham et al. (2019) studied cyclists using everyday routes over one week in Leipzig (Germany). Sixty-six cyclists wore sensors that objectively recorded particle number counts (i.e., pollution), noise, humidity and temperature. The cyclists also subjectively recorded similar information.

The researchers concluded that (p. 1)

Except for heat, no significant associations between the objective and subjective data were found.

That is, the subjective and objective data generally did not agree, except for heat. The perceptions of heat may have been influenced by the Hawthorne effect (p. 7):

...most people are pre-informed about the daily temperature by the weather forecast and expect a certain degree of heat, which ultimately also affects their perception of it to a great extent.

Example 7.28 (Subjective and objective data) E. Johnson, Millar, and Shiely (2021) asked \(70\) people in south-west Ireland to subjectively self-report their Body-Mass index (BMI) category, as 'Normal' or 'Overweight'. Thirty-six subjects self-reported their BMI category as 'Normal'.

The researchers also objectively recorded the BMI of the same subjects. Twenty-nine subjects were objectively categorised as 'Normal'.

7.10 Chapter summary

Designing effective studies (Fig. 7.6) requires researchers to manage or minimise confounding where possible, by restricting the study to certain groups; blocking individuals into similar groups; through special analysis methods; and/or through random allocation of the units of analysis. Random allocation is only possible for experimental studies.

Well-designed studies manage the Hawthorne effect (e.g., by blinding participants); the observer effect (e.g., by blinding the researchers); the placebo effect (experimental studies only; e.g., by using controls, objective outcomes and blinding subjects); and the carry-over effect (e.g., by using a washout, or randomly allocating the treatment order). Recording objective data is usually better than recording subjective data.

The following short video may help explain some of these concepts:

Often, however, not all of these strategies can be used. For instance, people usually know they are involved in an experimental study, so the Hawthorne effect may impact conclusions. In these cases, the possible impacts should be minimized as far as possible, and then the likely impact on the conclusions discussed. The impact of these issues are often reported as limitations (Chap. 8).

Design considerations for designing studies. Note: lurking variables become confounding variables when recorded in the study, and so can be managed as confounding variables. The arrows indicate the main design strategies to (perhaps partially) manage the indicated potential bias. Not all strategies are possible for every study.

FIGURE 7.6: Design considerations for designing studies. Note: lurking variables become confounding variables when recorded in the study, and so can be managed as confounding variables. The arrows indicate the main design strategies to (perhaps partially) manage the indicated potential bias. Not all strategies are possible for every study.

Example 7.29 (Research design) Cross et al. (2019) (p. 3) compared chest compressions by student paramedics using dominant and non-dominant hands, and stated:

...participants were allocated randomly to one of two groups: 'dominant hand on chest' or 'non-dominant hand on chest'. Group allocation was determined by a computer-generated randomisation schedule...

The participants were blinded to the purpose of the study, but not to which group they were allocated. The analyst was also blinded to the group allocations. This study used many good design features.

7.11 Quick review questions

Doosti-Irani et al. (2016) wanted to determine the relationship between the depth of bruising on apples and the size of the impact force. The researchers purposefully hit apples with three different forces (\(200\), \(700\) and \(1200\,\text{mJ}\)) to inflict bruises on the apples. The researchers then recorded the depth of the bruising. The study was conducted separately for three different regions of the apple (lower; middle; upper), and each apple was only used once.

Are the following statements true or false?

The response variable is 'the depth of bruising'.
The explanatory variable is the force used on the apples'.
The variable 'location of the bruising' would be classified as a confounding variable'.
The researchers could minimise the effects of confounding by using potential confounding variables in the analysis.
The researchers could use random allocation of the treatments to the apples to minimise confounding.
The carry-over effect is likely to be a big problem in this study.
The Hawthorne effect is likely to be a big problem in this study.
The placebo effect is likely to be a big problem in this study.
The observer effect is likely to be a big problem in this study.

6 External validity: sampling

8 Research design limitations

Scientific Research and Methodology: An introduction to quantitative research and statistics