11 Summarising quantitative data

So far, you have learnt to ask a RQ, design a study, collect the data, and classify the data. In this chapter, you will learn to:

  • summarise quantitative data using the appropriate graphs.
  • summarise quantitative data using shape, average, variation and unusual features.

11.1 Introduction

Many quantitative research studies involve quantitative variables. Except for very small amounts of data, understanding the data is difficult without a summary. Quantitative data can be understood by knowing how often values of the variables appear. This is called the distribution of the data.

Definition 11.1 (Distribution) The distribution of a variable describes what values are present in the data, and how often those values appear.

The distribution can be displayed using a frequency table (Sect. 11.2) or a graph (Sect. 11.3). The distribution of quantitative data can be summarised by the shape (Sect. 11.4), and summarised numerically by computing the average value (Sect. 11.5), computing the amount of variation (Sect. 11.6), and identifying outliers (Sect. 11.7).

11.2 Frequency tables for quantitative data

Quantitative data can be collated in a frequency table by grouping the variables into appropriate intervals ('bins'). The intervals should be exhaustive (cover all values) and exclusive (observations belong to one and only one category). While not essential, usually the categories are of equal size.

For discrete quantitative data, bins are defined to contain single discrete values, or a small number of discrete values (Example 11.1).

Example 11.1 The data in Fig. 11.1 show the number of severe cyclones (also called hurricanes or typhoons) in the Australian region, for each year from 16969 to 2005.

A frequency table can be constructed by binning each discrete value individually (Table 11.1, left table) or grouped in pairs (Table 11.1, right table).

FIGURE 11.1: The cyclone data

TABLE 11.1: The number of severe cyclones per year, displayed in a frequency table
Number cyclones Number of years Percentage of years
\(3\) cyclones \(\phantom{0}8\) \(22\)
\(4\) cyclones \(10\) \(27\)
\(5\) cyclones \(\phantom{0}3\) \(\phantom{0}8\)
\(6\) cyclones \(\phantom{0}5\) \(14\)
\(7\) cyclones \(\phantom{0}2\) \(\phantom{0}5\)
\(8\) cyclones \(\phantom{0}4\) \(11\)
\(9\) cyclones \(\phantom{0}4\) \(11\)
\(10\) cyclones \(\phantom{0}0\) \(\phantom{0}0\)
\(11\) cyclones \(\phantom{0}1\) \(\phantom{0}3\)

Care is needed when creating frequency tables for continuous quantitative data, since all continuous data are rounded. The bins should be defined to ensure no values lie on the border between bins (Example 11.2).

Example 11.2 The data in Fig. 11.2 give the weights of \(44\) babies born in a hospital on one day (P. K. Dunn 1999; Steele 1997), plus the gender of each baby, and the number of minutes after midnight of the birth. The data are given in the order in which the births occurred.

To display the distribution, the weights can be grouped into clearly-defined weight intervals (Table 11.2, left column). Alternatively, the breaks between the bins can be given to one more decimal place than the data to avoid observations landing exactly on the bin divisions (final column).

The table also gives percentages in each bin; for example, the percentage of babies over \(4.0\) kg is \(1/44 \times 100 = 2.27\)%, or about \(2\)%. Most babies in the sample are between \(3\) and \(4\) kg at birth.

FIGURE 11.2: The baby-births data

TABLE 11.2: The baby-weights data, displayed in a frequency table
Weight group Number of babies Percentage of babies Alterative weight group
\(1.5\) kg to under \(2.0\) kg \(\phantom{0}1\) \(\phantom{0}2\) \(1.45\) kg to \(1.95\) kg
\(2.0\) kg to under \(2.5\) kg \(\phantom{0}4\) \(\phantom{0}9\) \(1.95\) kg to \(2.45\) kg
\(2.5\) kg to under \(3.0\) kg \(\phantom{0}4\) \(\phantom{0}9\) \(2.45\) kg to \(2.95\) kg
\(3.0\) kg to under \(3.5\) kg \(17\) \(39\) \(2.95\) kg to \(3.45\) kg
\(3.5\) kg to under \(4.0\) kg \(17\) \(39\) \(3.45\) kg to \(3.95\) kg
\(4.0\) kg to under \(4.5\) kg \(\phantom{0}1\) \(\phantom{0}2\) \(3.95\) kg to \(4.45\) kg

Sometimes trial and error is needed to find useful intervals for continuous data. Usually, but not universally, the intervals include values at the lower end of the interval, but exclude values at the upper end (as in Table 11.2).

11.3 Graphs

The graphs in this section are appropriate for continuous quantitative data, and sometimes for discrete quantitative data if many values are possible. Often, discrete quantitative data with very few recorded values are better displayed using the graphs designed for qualitative data (Sect. 12.3).

The purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.

Graphs used to display the distribution of one quantitative variable include:

  • Histogram (Sect. 11.3.1): best for moderate to large amounts of data.
  • Stemplots (Sect. 11.3.2): best for small amounts of data; sometimes useful.
  • Dot chart (Sect. 11.3.3): used for small to moderate amounts of data.

11.3.1 Histograms

Histograms are a series of boxes, where the width of the box represents an interval of values of the variable being graphed, and the height of the box represents the number (or percentage) of observations within that range of values2. The height of the histogram bars indicate the number (or percentage) in each category. Histograms are essentially a picture of a frequency table. The vertical axis can be counts (labelled as 'Counts', 'Frequency', or similar) or percentages.

The axis displaying the counts (or percentages) should start from zero, since the height of the bars visually implies the frequency of those observations (see Example 17.1).

When the quantitative variable is discrete, the labels usually are placed on the axis aligned with the centre of the bar (e.g., Example 11.3).

Example 11.3 (Histograms (discrete data)) Consider again the number of severe cyclones in the Australian region (Fig. 11.1). A histogram can be constructed from either frequency table in Table 11.1; see below. For example, the histogram shows that there were eight years where three severe cyclones were recorded.

Notice that the different bin sizes change the appearance of the distribution (Example 11.6).

Two histograms of the severe-cyclone data

FIGURE 11.3: Two histograms of the severe-cyclone data

When the quantitative variable is continuous, care is needed when constructing the histogram. Since the data are continuous, the data must be rounded. (For instance, the birthweights in Fig. 11.2 are rounded to one decimal place of a kilogram.) This means care is needed when defining boundaries between bins, and ensuring clarity about which bin contains observations when they lie on (or near) a boundary. One way to do this is to create boundaries between bins to one more decimal place than the given data.

The choice of bin size and bin boundaries can substantially change how a histogram displays the data (Examples 11.4 and 11.6).

When observations lie on the boundary of the boxes, some software includes these observations in the lower box and some in the higher box.

Example 11.4 (Histograms (continuous data)) Consider again the weights (in kg) of babies born in a Brisbane hospital in one day (Fig. 11.2). A histogram can be constructed for these data; see below.

An observation on a boundary between the bins may be placed it in the higher box (i.e., \(2.5\) kg is in the '\(2.5\) to \(3.0\) kg' box, not the '\(2.0\) to \(2.5\) kg' box): see the left panel. Alternatively, a boundary observation may be placed in the lower box; see the centre panel. This histogram is a picture of the frequency table in Table 11.2.

To avoid confusion, the boundaries can be defined to one more decimal place than the data (right panel), which is equivalent to counting the observations in the lower box (as in the left panel). Notice that the choice impacts the look of the histogram.

Histograms can be constructed in different ways to manage observations on the boundary of bins. Left: boundary values counted in the higher box. Centre: boundary values counted in the lower box. Right: define boundaries with one more decimal places that the data.

FIGURE 11.4: Histograms can be constructed in different ways to manage observations on the boundary of bins. Left: boundary values counted in the higher box. Centre: boundary values counted in the lower box. Right: define boundaries with one more decimal places that the data.

Example 11.5 (Histograms) Mages et al. (2017) recorded the length of 'brain freezes' after consuming cold food or drink. A histogram of the data (Fig. 11.5), shows \(11\) people experience symptoms less than \(5\) s in length; nine people experienced symptoms for at least \(5\) but less than \(10\) s; and \(1\) person experienced symptoms for at least \(35\) s but under \(40\) s.

Histogram of the duration of brain-freeze symptoms after drinking ice water. Boundary observations are counted in the lower box.

FIGURE 11.5: Histogram of the duration of brain-freeze symptoms after drinking ice water. Boundary observations are counted in the lower box.

Software usually makes sensible default choices for the number of bins, and width of the bins. However, the bin size can substantially change the appearance of the histogram. Software makes it easy to try different bin sizes to find one that displays the overall distribution well.

Example 11.6 (Bin width) A histogram for the time between eruptions (Härdle et al. 1991) of the Old Faithful geyser in Yellowstone National Park (USA) is shown below. Try changing the number of bins in the interaction below to see the impact.

FIGURE 11.6: Changing the bin width can change the impression of the distribution

11.3.2 Stemplots

Stemplots (or stem-and-leaf plots) are best described and explained using an example. Consider again the data in Fig. 11.2, the weights of babies born in a Brisbane hospital on one day.

In a stemplot, part of each number is placed to the left of a vertical line (the stem), and the rest of each number to the right of the line (the leaf). The weights in Fig. 11.2 are given to one decimal place of a kilogram, so the whole number of kilograms is placed to the left of the line (as the stem), and the first decimal place is placed on the right of the line (as a leaf). The animation below shows how the stemplot is constructed. The first weight, of \(1.7\) kg, is entered with the \(1\) to the left of the line, and the \(7\) to the right: 1 | 7. Similarly, \(2.1\) kg is entered as 2 | 1 and \(2.2\) kg is entered as 2 | 2, sharing the same stem as for \(2.1\) kg.⚄ The plots shows that most birthweights are \(3\)-point-something kilograms.

In stemplots, the original data remain visible. For stemplots:

  • place the larger digit (e.g., kilograms) on the left (stems).
  • place the smaller digit (e.g., first decimal of a kilogram) on the right (leaves).
  • some data do not work well with stemplots.
  • sometimes, data may need suitable rounding before creating the stemplot (the baby weights were originally given to three decimal places).
  • the numbers in each row should be evenly spaced, with the numbers in the columns under each other, so the length of each stem is proportional to the number of observations.
  • within each stem, the observations are ordered so patterns can be seen.
  • add an explanation for reading the stemplot. For example, the stemplot for the baby-birth data says '\(2\) | \(6\) means \(2.6\) kg' (rather than, say, \(0.26\) kg, or \(2\) lb \(6\) oz).

Example 11.7 (Stemplots) Wright et al. (2021) recorded the number chest-beats by gorillas. The stemplot in the animation below shows the stemplot being constructed, and for gorillas aged under 20 years shows a lot of variation in the beating rate, but most are under 20.

The following short video may help explain some of these concepts:

11.3.3 Dot charts (quantitative data)

Dot charts show the original data on a single (usually horizontal) axis, with each observation represented by a dot (or other symbol). Sometimes, observations are identical, or nearly so, and to avoid points being plotted on top of other points (called overplotting), are jittered (placed with some added randomness in the vertical direction) or stacked (plotted above each other).

The axis displaying the counts (or percentages) should start from zero, since the distance of the dots from the axis visually implies the frequency of those observations (see Example 17.1).

Example 11.8 (Dot charts) Consider the weights (in kg) of babies born in a Brisbane hospital (Fig. 11.2). A dot chart (Fig. 11.8) shows that most babies were born between \(3\) and \(4\) kg. The points have been jittered.

Example 11.9 (Dot charts) The chest-beating rate of young gorillas, seen in Example 11.7), can be displayed using a dot chart (Fig. 11.7, top panel). The points have been stacked.

Large orders of French fries: Mass measurements (top panel) and percentage of target mass (below panel)

FIGURE 11.7: Large orders of French fries: Mass measurements (top panel) and percentage of target mass (below panel)

A dot chart of the baby-weight data

FIGURE 11.8: A dot chart of the baby-weight data

11.3.4 Describing the distribution

Graphs are constructed to help readers understand the data. Hence, after producing a graph, the distribution of the data should be described using:

  1. The shape of the distribution. That is, are most of the values smaller or larger, or about even distributed between smaller and larger values?
  2. The average of the data. What is an average, central or typical value?
  3. The variation in the bulk of the data.
  4. Any outliers (unusually large or small observations) or unusual features.

These can be described in rough terms. The average, variation and outliers are usually described numerically, too (as in Sect. 11.5 to Sect. 11.7).

Example 11.10 (Describing quantitative data) The weights of babies (Example 11.4) are typically between about \(2.5\) kg and \(3\) kg (the average), with most between \(1.5\) kg and \(4.5\) kg (variation). A few babies have very lower weights, probably premature births (shape). No unusual values are present.

Importantly, these numerical quantities are computed from a sample, even though the whole population is of interest. As a result, distinguishing parameters and statistics is important.

Definition 11.2 (Parameter) A parameter is a number, usually unknown, describing some feature of a population.

Definition 11.3 (Statistic) A statistic is a number describing some feature of a sample (to estimate the unknown value of the population parameter).

A statistic is a numerical value estimating an unknown population value. However, countless samples are possible (Sect. 6.2), and so countless possible values for the statistic---all of which are estimates of the value of the parameter---are possible. The observed value of the statistic depends on which one of the countless possible samples is (randomly) selected.

The RQ identifies the population, but in practice only one of the many possible samples is studied. Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample. We only observe one value of the statistic from our single observed sample.

11.4 Describing shape

The shape of a distribution can described using some terminology:

  • Right (or positively) skewed: most data are small, with some larger values.
  • Left (or negatively) skewed: most data are large, with some smaller values.
  • Symmetric data: the left and right sides of the graph are similar.
  • Bimodal data: the distribution has two peaks.

The carousel below (click the left and right arrows to move through the example plots) shows typical shapes. Sometimes, no short descriptions (like above) are suitable.

FIGURE 11.9: Some common shapes of the distribution of qualitative data

Example 11.11 (Bimodal data) The Old Faithful geyser in Yellowstone National Park (USA) erupts regularly (Härdle et al. 1991). The original histogram for the time between eruptions (Fig. 11.6, left panel) is bimodal, with peaks near \(55\) min and \(80\) min.

11.5 Numerical summary: averages

The average (or location, or central value) for quantitative sample data can be described numerically in many ways. The most common are:

  • the sample mean (or sample arithmetic mean), which estimates the population mean (Sect. 11.5.1); and
  • the sample median, which estimates the population median (Sect. 11.5.2).

In both cases, the parameter is estimated by a statistic. Understanding whether to use the mean or median is important.

'Average' can refer to means, medians or other measures of centre. Use the precise term 'mean' or 'median', rather than 'average', when possible!

Example 11.12 (Averages) Consider the daily river flow volume ('streamflow') at the Mary River from 01 October 1959 to 17 January 2019. The 'average' daily streamflow in February could be described using either:

  • the mean daily streamflow, which is \(1\ 123.2\) ML.
  • the median daily streamflow, which is \(146.1\) ML.

Both summarise the same data, yet give very different 'averages'. This implies they measure the 'average' differently, with different meanings. Which is the best 'average' to use? To decide, both need to be studied.

11.5.1 Average: the mean

The mean of the population is denoted by \(\mu\), and its value is almost always unknown. The mean of the population is estimated by the mean of the sample, denoted \(\bar{x}\). In this context, the value of the unknown parameter is \(\mu\), and the value of the statistic is \(\bar{x}\).

The sample mean estimates the population mean, and every sample is likely to give a different sample mean. We usually only ever have one sample.

The Greek letter \(\mu\) is pronounced 'mew' or 'myoo'. \(\bar{x}\) is pronounced 'ex-bar'.

Example 11.13 (A small dataset) Consider a small dataset for answering this descriptive RQ: 'For gorillas aged under \(20\), what is the average chest-beating rate?' The population mean rate (denoted \(\mu\)) is to be estimated.

Clearly, every gorilla cannot be studied; a sample is studied. The unknown population mean is estimated using the sample mean (\(\bar{x}\)), and every sample can give a different value for \(\bar{x}\). Measurements were taken from \(14\) young gorillas (Fig. 11.7, top panel).

The sample mean is the 'balance point' of the observations. The animation below shows how the mean acts as the balance point for the gorilla data. Alternatively, the mean is the value such that the positive and negative distances of the observations from the mean add to zero, as in the animation below. Both of these explanations seem reasonable for identifying an 'average' for the data.

Definition 11.4 (Mean) The mean is one way to measure the 'average' value of quantitative data. The arithmetic mean is the 'balance point' of the data, and the value such that the positive and negative distances from the mean add to zero.

To find the value of the sample mean, add (denoted by \(\sum\)) all the observations (denoted by \(x\)) then divide by the number of observations (denoted by \(n\)). In symbols: \[ \bar{x} = \frac{\sum x}{n}. \]

Example 11.14 (Computing a sample mean) For the chest-beating data (Fig. 11.8), an estimate of the population mean (i.e., the sample mean) chest-beating rate is found by summing all \(n = 14\) observations and dividing by \(n\): \[ \overline{x} = \frac{\sum x}{n} = \frac{0.7 + 0.9 + \cdots + 4.4}{14} = \frac{31.1}{14} = 2.221429. \] The sample mean, the best estimate of the population mean, is \(2.22\) beats per \(10\) h.

The sample mean is usually calculated using software (such as jamovi) for large amounts of data, or a calculator (in Statistics Mode) for small amounts of data. However, knowing how these quantities are computed is important.

Software and calculators often produce numerical answers to many decimal places, not all of which may be meaningful or useful. A simple but useful rule-of-thumb is to round to one or two more significant figures than the original data. Software usually does not add measurement units to the answer either.

For example, the chest-beating data are given to one decimal place. The sample mean rate can be given as \(\bar{x} = 2.22\) beats per \(10\) hrs.

Example 11.15 (Computing the sample mean) Griffin, Webster, and Michael (1960) recorded the distance at which flies (Drosophila) were detected by bats for \(n = 11\) detections (Table 11.3). The population mean distance is estimated by the sample mean as \(\bar{x} = 532/11 = 48.4\) cm.

TABLE 11.3: The distance at which small fruit flies were detected by bats
Detection distance
\(62\) \(68\) \(34\) \(27\) \(83\) \(40\)
\(52\) \(23\) \(45\) \(42\) \(56\)

11.5.2 Average: the median

A median is a value separating the largest \(50\)% of the data from the smallest \(50\)% of the data. In a dataset with \(n\) values, the median is ordered observation number \((n + 1)\div 2\). (The median is not equal to \((n + 1)\div 2\), and is not halfway between the minimum and maximum values in the data.)

Many calculators cannot find the median. The median has no commonly-used symbol, though \(\tilde{\mu}\) and \(\tilde{x}\) are sometimes used for the population and sample medians respectively.

Definition 11.5 (Median) The median is one way to measure the 'average' value of data. A median is a value such that half the values are larger than the median, and half the values are smaller than the median.

Example 11.16 (Find a sample median) To find a sample median for the chest-beating data (Fig. 11.8), first arrange the data in numerical order (Table 11.4). The median separates the larger \(7\) numbers from the smaller \(7\) numbers. With \(n = 14\) observations, the median is the ordered observation located between the seventh and eighth observations (i.e., at position \((14 + 1)/2 = 7.5\); the median itself is not \(7.5\)).

The sample median, as estimate of the population median, is between \(1.7\) (ordered observation \(7\)) and \(1.7\) (observation \(8\)). Since these values are the same, the sample median is \(1.7\) beats per \(10\) h.

TABLE 11.4: The chest-beating rate of young gorillas, in increasing order
Chest-beating rate, per 10 h
0.7 0.9 1.3 1.5 1.5 1.5 1.7 1.7 1.8 2.6 3 4.1 4.4 4.4

To clarify:

  • if the sample size \(n\) is odd (see Example 11.17), the median is the middle number when the observations are ordered.
  • if the sample size \(n\) is even (such as the chest-beating data), the median is halfway between the two middle numbers, when the observations are ordered.

Software may use slightly different rules when \(n\) is even, producing slightly different values for the median.

The sample median estimates the population median, and every sample is likely to have a different sample median. We usually only ever have one sample.

Example 11.17 (Computing the median) For the bat data (Table 11.3), the estimate of the population median distance at which bats detect the flies is the sample median. With \(n = 11\), the median is the \((11 + 1)/2 = 6\)th ordered value, which is \(45\) cm.

11.5.3 Which average to use?

Consider the daily streamflow at the Mary River (Bellbird Creek) during February again (Example 11.12): the mean daily streamflow is \(1\ 123\) ML, and the median daily streamflow is \(146.1\) ML. Which is 'best' for measuring the average streamflow?

For these data, about \(86\)% of the observations are less than the mean, but \(50\)% the values are less than the median (by definition). The mean is hardly a central value...

A dot chart of the daily streamflow (Fig. 11.10; jittering) shows that the data are very highly right-skewed, with many very large outliers (presumably during flood events); the extreme outliers are seen in the frequency table too (Table 11.5).

A dot plot of the daily streamflow at Mary River from 1960 to 2017, for February ($n = 1650$). The vertical grey line is the mean value. Many large outliers exist, so the data near zero are all squashed together. Note: values have been jittered in the vertical direction, but overplotting is still present near $0$.\index{Overplotting}

FIGURE 11.10: A dot plot of the daily streamflow at Mary River from 1960 to 2017, for February (\(n = 1650\)). The vertical grey line is the mean value. Many large outliers exist, so the data near zero are all squashed together. Note: values have been jittered in the vertical direction, but overplotting is still present near \(0\).

TABLE 11.5: Mary River daily streamflow for February (in ML). The mean value is 1 123ML.
Daily streamflow (ML) Number of days
\(0\) to under \(20,000\) \(1636\)
\(20,000\) to under \(40,000\) \(\phantom{0}\phantom{0}\phantom{0}6\)
\(40,000\) to under \(60,000\) \(\phantom{0}\phantom{0}\phantom{0}3\)
\(60,000\) to under \(80,000\) \(\phantom{0}\phantom{0}\phantom{0}4\)
\(80,000\) to under \(100,000\) \(\phantom{0}\phantom{0}\phantom{0}0\)
\(100,000\) to under \(120,000\) \(\phantom{0}\phantom{0}\phantom{0}0\)
\(120,000\) to under \(140,000\) \(\phantom{0}\phantom{0}\phantom{0}0\)
\(140,000\) to under \(160,000\) \(\phantom{0}\phantom{0}\phantom{0}1\)

The streamflow data are very highly right skewed, which is important:

  • Means are best used for approximately symmetric data: the mean is influenced by outliers and skewness.
  • Medians are best used for data that are highly skewed or contain outliers: the median is not influenced by outliers and skewness.

Means tend to be too large if the data contains large outliers or severe right skewness, and too small if the data contains small outliers or severe left skewness. For the Mary River data, the large outliers---and because they are so extreme and abundant---cause the mean to be so much larger than the median. The median is the better measure of average for these data, though these data may be better described using percentiles (Sect. 11.6.4)

The mean is generally used if possible (for practical and mathematical reasons), and is the most commonly-used measure of location. However, the mean is not always appropriate; the median is not influenced by outliers and skewness. The mean and median are similar in approximately symmetric distributions. Sometimes, quoting both the mean and the median may be appropriate.

11.6 Numerical summary: variation

For quantitative data, the amount of variation in the bulk of the data should be described. Many ways exist to measure the variation in a dataset, including:

  • the range: very simple and simplistic, so not often used (Sect. 11.6.1).
  • the standard deviation: commonly used (Sect. 11.6.2).
  • the interquartile range (or IQR): commonly used (Sect. 11.6.3).
  • percentiles: useful in specific situations (Sect. 11.6.4).

As always, a value computed from a sample (the statistic) estimates the unknown value in the population (the parameter), and every sample can produce a different estimate.

11.6.1 Variation: the range

The range is the simplest measure of variation, but not often used.

Definition 11.6 (Range) The range is the maximum value minus the minimum value.

The range is not often used, as it only uses values which are extreme observations. As a result, the range is highly influenced by outliers. Sometimes, the range is given by stating both the maximum and the minimum value in the data instead of giving the difference between these values. The range is measured in the same measurement units as the data, and is usually quoted with the median.

Example 11.18 (The range) For the chest-beating data (Table 11.4), the largest value is \(4.4\), and the smallest value is \(0.7\); hence \[ \text{Range} = 4.4 - 0.7 = 3.7. \] The sample median chest-beating rate is \(1.7\) beats per \(10\) h, with a range of \(3.7\) beats per \(10\) h.

11.6.2 Variation: the standard deviation

The population standard deviation is denoted by \(\sigma\) (the parameter) and is estimated by the sample standard deviation \(s\) (the statistic). The standard deviation is the most commonly-used measure of variation, but is tedious to compute manually. The sample standard deviation is usually calculated using software (such as jamovi) for large amounts of data, or a calculator (in Statistics Mode) for small amounts of data.

The standard deviation is (approximately) the mean distance that observations are from the mean. This seems like a reasonable way to measure the amount of variation in data.

The Greek letter \(\sigma\) is pronounced 'sigma'.

The sample standard deviation estimates the population standard deviation, and every sample is likely to have a different sample standard deviation. We usually only ever have one sample.

Definition 11.7 (Standard deviation) The standard deviation is, approximately, the average distance of the observations from the mean.

Even though you do not have to use the formula to calculate \(s\), we will demonstrate to show exactly what \(s\) calculates. The formula is: \[ s = \sqrt{ \frac{\sum(x - \bar{x})^2}{n - 1} }, \] where \(\bar{x}\) is the sample mean, \(x\) represents the individual data values, \(n\) is the sample size, and the symbol '\(\sum\)' means to add (Sect. 11.5.1). Using the formula requires these steps:

  • calculate the sample mean: \(\overline{x}\);
  • calculate the deviation of each observation \(x\) from the sample mean: \(x - \bar{x}\);
  • square these deviations (to make them all positive values): \((x - \bar{x})^2\);
  • add these squared deviations: \(\sum(x - \bar{x})^2\);
  • divide the answer by \(n - 1\);
  • take the (positive) square root of the answer.

You do not need to use the formula! You must know how to use software or a calculator to find the standard deviation, what the standard deviation measures, and how to use it.

Example 11.19 (Standard deviation) For the chest-beating data (Table 11.4), the squared deviations of each observation from the mean of \(2.2214\) (using four decimal places in calculations) are shown in Fig. 11.11. The sum of the squared distances is \(20.9636\). Then, the sample standard deviation is: \[ s = \sqrt{\frac{20.9636}{14 - 1}} = \sqrt{ 1.612585} = 1.269876. \] The sample mean chest-beating rate is \(2.22\) per \(10\) h, with a sample standard deviation of \(1.27\) per \(10\) h.

The standard deviation is related to the sum of the squared-distances from the mean. The chest-beating data are being used.

FIGURE 11.11: The standard deviation is related to the sum of the squared-distances from the mean. The chest-beating data are being used.

The sample standard deviation \(s\) is:

  • positive (unless all observations are the same, when it is zero: no variation);
  • best used for (approximately) symmetric data;
  • usually quoted with the mean;
  • the most commonly-used measure of variation;
  • measured in the same units as the data;
  • influenced by skewness and outliers, like the mean.

11.6.3 Variation: the inter-quartile range (IQR)

The standard deviation uses the value of \(\bar{x}\), so is affected by skewness just like the sample mean. A measure of variation not affected by skewness is the inter-quartile range, or IQR. To understand the IQR, understanding quartiles is necessary.

Definition 11.8 (Quartiles) Quartiles describe the shape of the data:

  • The first quartile \(Q_1\) is a value separating the smallest \(25\)% of observations from the largest \(75\)%. The \(Q_1\) is like the median of the smaller half of the data, halfway between the minimum value and the median.
  • The second quartile \(Q_2\) is a value separating the smallest \(50\)% of observations from the largest \(50\)%. (This is the also the median.)
  • The third quartile \(Q_3\) is a value separating the smallest \(75\)% of observations from the largest \(25\)%. The \(Q_3\) is like the median of the larger half of the data, halfway between the median and the maximum value.

Quartiles divide the data into four parts of approximately equal numbers of observations. The inter-quartile range (or IQR) is the difference between \(Q_3\) and \(Q_1\).

Definition 11.9 (IQR) The IQR is the range in which the middle \(50\)% of the data lie: the difference between the third and the first quartiles.

Since the IQR measures the range of the central \(50\)% of the data, the IQR is not influenced by outliers. The IQR is measured in the same measurements units as the data.

The sample IQR estimates the population IQR, and every sample is likely to have a different sample IQR. We usually only ever have one sample.

For the chest-beating data (Table 11.4), the median is \(1.7\) (Example 11.16). The data then can be split into the smaller and the larger halves, each with seven values:

  • Smaller half:  \(0.7\)    \(0.9\)    \(1.3\)    \(1.5\)    \(1.5\)    \(1.5\)    \(1.7\)
  • Larger half:    \(1.7\)    \(1.8\)    \(2.6\)    \(3.0\)    \(4.1\)    \(4.4\)    \(4.4\)

Since each half has seven observations, the median of each half is the \((7 + 1)/2 = 4\)th value. Hence:

  • \(Q_1\), the first quartile, is the median of the smaller half: \(Q_1 = 1.5\). About \(25\)% of observations are smaller than \(1.5\).
  • \(Q_2\), the second quartile or median, is \(1.7\). About \(50\)% of observations are smaller than \(1.7\).
  • \(Q_3\), the third quartile, is the median of the larger half: \(Q_3 = 3.0\). About \(75\)% of observations are smaller than \(3.0\).

We say 'about' these values, as exact values cannot be found here; each quartile is required to have \(14/4 = 3.5\) observations, which is not possible. (Software often uses different rules to compute quartiles in these situations.) Using these values, the IQR is \(Q_3 - Q_1\) = \(3.0 - 1.5 = 1.5\), as shown in Fig. 11.12.

A dotchart for the chest-beating data for young gorillas showing the IQR

FIGURE 11.12: A dotchart for the chest-beating data for young gorillas showing the IQR

When \(n\) is odd, the median may or may not be included in each of these halves; we decide not to include the median in each half.

Example 11.20 (Computing the IQR for $n$ odd) The bat data given in Table 11.3) has \(n = 11\). The smaller and larger halves of the data, without the median of \(45\), are:

  • Smaller half:  \(23\)  \(27\)  \(34\)  \(40\)  \(42\)
  • Larger half:   \(52\)  \(56\)  \(62\)  \(68\)  \(83\)

The median of the smaller half is \(Q_1 = 34\); the median of the larger half is \(Q_3 = 62\). Hence, the IQR is \(62 - 34 = 28\) cm.

11.6.4 Variation: percentiles

Percentiles are similar in principle to quantiles.

Definition 11.10 (Percentiles) The \(p\)th percentile of the data is a value separating the smallest \(p\)% of the data from the rest.

For example:

  • the \(12\)th percentile separates the smallest \(12\)% of the data from the rest.
  • the \(67\)th percentile separates the smallest \(67\)% of the data from the rest.
  • the \(94\)th percentile separates the smallest \(94\)% of the data from the rest.

That is, the first quartile \(Q_1\) is the \(25\)th percentile, the second quartile \(Q_2\) is the \(50\)th percentile (and median), and the third quartile \(Q_3\) is the \(75\)th percentile.

Percentiles are especially useful for very skewed data in certain applications. For instance, scientists who monitor rainfall and stream heights, and engineers who use this information, are more interested in extreme weather events rather than the 'average' event. Structures are designed to withstand \(1\)-in-\(100\) year events (the \(99\)th percentile) or similar, rather than 'average' events. Percentiles are measured in the same measurements units as the data.

Example 11.21 (Percentiles) For the streamflow data at the Mary River (Example 11.12), the February data are highly right-skewed (Fig. 11.10). The median (50th percentile) is \(146.1\) ML. However, the 95th percentile is \(3\ 480\) ML and the 99th percentile is \(19\ 043\) ML.

Constructing infrastructure for the median streamflow would be inadequate.

11.6.5 Which measure of variation to use?

Which is the 'best' measure of variation for quantitative data? As with measures of location, it depends on the data.

Since the standard deviation formula uses the mean, it is impacted in the same way as the mean by outliers and skewness. Hence, the standard deviation is best used with approximately symmetric data. The IQR is best used when data are skewed or asymmetric. Sometimes, both the standard deviation and the IQR can be quoted.

11.7 Numerical summary: identifying outliers

Outliers are 'unusual' observations: those quite different (larger or smaller) than the bulk of the data. Outliers are 'unusual', but not necessarily 'wrong' or 'bad' observations. Rules for deciding if an observation is an outlier are always arbitrary.

Definition 11.11 (Outliers) An outlier is an observation that is 'unusual' (either larger or smaller) compared to the bulk of the data. Rules for identifying outliers are arbitrary.

Two rules for identifying outliers are:

  • the standard deviation rule, useful when the data have an approximately symmetric distribution (Sect. 11.7.1).
  • the IQR rule, useful in other situations (Sect. 11.7.2).

11.7.1 The standard deviation rule

This rule uses the mean and the standard deviation, so is suitable for approximately symmetric distributions (when means and standard deviations are sensible numerical summaries). The rationale behind this rule is explained in Sect. 21.3.

Definition 11.12 (Standard deviation rule for identifying outliers) For approximately symmetric distributions, an observation more than three standard deviations from the mean may be considered an outlier.

All rules for identifying outliers are arbitrary, and sometimes the standard deviation rule is given slightly differently. For example, outliers may be identified as observations more than \(2.5\) standard deviations from the mean. Both rules are acceptable, since the definition is arbitrary.

Example 11.22 (Standard deviation rule for identifying outliers) An engineering project (Hald 1952) studied a new building material, to estimate the average permeability. Permeability time (the time for water to permeate the sheets) was measured from \(81\) pieces of material (in seconds).

For these data, the mean is \(\bar{x} = 43.162\) and the standard deviation is \(s = 27.358\). Using the standard deviation rule, outliers are observations smaller than \(43.162 - (3\times 27.358)\) or larger than \(43.162 + (3\times 27.358)\); that is, smaller than \(-38.9\) (which is clearly not appropriate here), or larger than \(125.2\). This rule is shown in Fig. 11.13; two observations are identified as outliers using the standard deviation rule.

Outliers identified using the standard deviation rule for the permeability data

FIGURE 11.13: Outliers identified using the standard deviation rule for the permeability data

11.7.2 The IQR rule

Since the standard deviation rule for identifying outliers relies on the mean and standard deviation, it is not appropriate for non-symmetric distributions. Another rule is needed for identifying outliers in these situations: the IQR rule.

Definition 11.13 (IQR rule for identifying outliers) The IQR rule identifies mild and extreme outliers:

  • Extreme outliers: observations \(3\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\).
  • Mild outliers: observations \(1.5\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\) (that are not extreme outliers).

This definition is easier to understand using an example.

Example 11.23 (IQR rule for identifying outliers) Using the permeability data seen in Example 11.22, a computer shows that \(Q_1 = 24.7\) and \(Q_3 = 50.6\), so \(\text{IQR} = {50.6 - 24.7 = 25.9}\). Then, extreme outliers are observations \(3\times 25.9 = 77.7\) more unusual than \(Q_1\) or \(Q_3\). That is, extreme outliers are observations:

  • more unusual than \(24.7 - 77.7 = -53.0\) (that is, less than \(-53.0\)); or
  • more unusual than \(50.6 + 77.7 = 128.3\) (that is, greater than \(128.3\)).

Mild outliers are observations \(1.5\times 25.9 = 38.9\) more unusual than \(Q_1\) or \(Q_3\) (that are not extreme outliers). That is, mild outliers are

  • more unusual than \(24.7 - 38.9 = -14.2\) (that is, less than \(-14.2\)); or
  • more unusual than \(50.6 + 38.9 = 89.5\) (that is, greater than \(89.5\)).

Three observations are identified as outliers using the IQR rule (Fig. 11.14.

Mild and extreme outliers, using the IQR rule, for the permeability data

FIGURE 11.14: Mild and extreme outliers, using the IQR rule, for the permeability data

11.7.3 Which outlier rule to use?

The standard deviation rule is most appropriate for approximately symmetric distributions; the IQR rule can be used for any distribution, but primarily for those skewed or with outliers. But remember: all rules for identifying outliers are arbitrary!

11.7.4 What to do with outliers?

What should be done if outliers are identified in data? Remember that outliers are unusual observations; they are not necessarily wrong. Deleting or removing outliers simply because they are identified as outliers is a very bad idea. After all, the outliers were obtained from your study like all other observations... they deserve to be in your data as much as any other observation. In addition, the rules for identifying outliers are arbitrary: some observations may be identified as outliers using one rule, but not by another.

The strategy for managing outliers depends on the reason for the outlier (P. K. Dunn and Smyth (2018), p. 138):

  • The outlier is clearly a mistake (e.g., an age of \(222\)): If the mistake cannot be fixed (e.g., the completed survey form is lost), the observation can be deleted. Similarly, if the outlier comes from an error or mistake in the data collection (e.g., too much fertiliser was accidentally applied), the observation can be deleted.
  • The outlier represents a different population: Suppose an outlier is identified in a study of students, corresponding to a student aged 65. If the next oldest student in the data is aged \(39\), the outlier can be removed, since it belongs to a different population ('students aged over \(40\)') than the other observations ('students aged \(40\) and under'). The remaining observations can be analysed, but the results only apply to students aged under 40 (which should be clearly communicated).
  • The reason for the outlier is unknown: Discarding outliers routinely is not recommended; the outliers are probably real observations that are just as valid as the others. Perhaps a different analysis is necessary (e.g., using medians rather than means). Furthermore, very large datasets are expected to have observations identified as outliers using the above arbitrary rules.

In all cases, whenever observations are removed from a dataset, this should be clearly explained and documented.

Example 11.24 (Outliers) The Mary River dataset (Sect. 11.5) has many extremely large outliers identified by software, but each is reasonable. They probably correspond to flooding events. Removing these from the analysis would be inappropriate.

Example 11.25 (Outliers) The permeability data (Example 11.22) has large outliers, but all seem reasonable. Removing these from the analysis would be inappropriate.

11.8 Numerical summary tables

In studies with quantitative variables, these variables should be summarised in a table. The table should include, as a minimum, measures of average and variation. An example is given in the next section.

11.9 Example: water access

López-Serrano et al. (2022) recorded data about access to water for three rural communities in Cameroon. Three quantitative variables are recorded. Part of the understanding the data requires summarising the quantitative variables; histograms are shown in Fig. 11.15, and a summary table in Table 11.6).

A large number of households are coordinated by women in their late 50s. The number of people and number of children under 5 are both right-skewed. One household has over 30 people, and has 10 children in that household. (These are identified as outliers, but are not mistakes.) Some observations are missing, explaining the differences in sample sizes.

TABLE 11.6: Summarising the quantitative data in the water-access study
\(n\) Mean Median Std. dev. IQR Min. Max.
Woman's age (years) \(120\) \(41.6\) \(40.5\) \(14.56\) \(30.25\) \(19\) \(61\)
Household size \(121\) \(\phantom{0}7.0\) \(\phantom{0}6.0\) \(\phantom{0}4.80\) \(\phantom{0}4.00\) \(\phantom{0}0\) \(32\)
Children under \(5\) \(120\) \(\phantom{0}1.6\) \(\phantom{0}1.0\) \(\phantom{0}1.65\) \(\phantom{0}2.00\) \(\phantom{0}0\) \(10\)
The age of the women, the number of people in the household, and the number of children in the household for the water-access study.

FIGURE 11.15: The age of the women, the number of people in the household, and the number of children in the household for the water-access study.

11.10 Chapter summary

Quantitative data can be graphed using a histogram, stemplot (in special circumstances), or dot charts. Quantitative data can be summarised numerically; the most common techniques are indicated in Table 11.7. The mean and standard deviation are usually used whenever possible, for practical and mathematical reasons. Sometimes quoting both the mean and median (and the standard deviation and IQR) may be appropriate.

The following short video may help explain some of these concepts:

TABLE 11.7: Summarising quantitative data
For distributions that with a shape that is:
Feature Approximately symmetric Not symmetric, or has outliers
Average Mean Median
Variation Standard deviation IQR
Outliers Standard deviation rule IQR rule

11.11 Quick review questions

Are the following statements true or false?

  1. The IQR measures the amount of variability in data.
  2. The mean and the median can both be called an 'average'.
  3. The mean and the median are not always the same value.
  4. The range is a simple measure of variation in a set of data.
  5. The standard deviation measures the amount of variability in data.
  6. Another name for the median is \(Q_2\).
  7. \(Q_3\) is the median of the largest half of the data.
  8. The IQR is a useful measure of variation in data that are skewed.
  9. The IQR is the difference between the first and second quartiles.
  10. Another name for the \(75\)th percentile is \(Q_3\).
  11. The units of the standard deviation and the IQR are the same as for the original data.

11.12 Exercises

Answers to odd-numbered exercises are available in App. E.

Exercise 11.1 The Australian Bureau of Statistics (ABS) records the age at death of Australians. The histogram of the age of death for females in 2012 is shown in Fig. 11.16 (left panel). Describe the distribution.

Left: histograms of age at death for female Australians in 2012. Right: the oven-dried foliage biomass for naturally-grown lime trees.

FIGURE 11.16: Left: histograms of age at death for female Australians in 2012. Right: the oven-dried foliage biomass for naturally-grown lime trees.

Exercise 11.2 [Dataset: Lime] Schepaschenko et al. (2017b) measured the oven-dried foliage biomass of lime trees grown in natural environments. The histogram of the foliage biomass is shown in Fig. 11.16 (right panel). Describe the distribution.

Exercise 11.3 [Dataset: NHANES] The histogram of the direct HDL cholesterol concentration from the American National Health and Nutrition Examination Survey (NHANES) (Pruim 2015) from \(1999\)--\(2004\) is shown in Fig. 11.17 (left panel).

  1. Should the mean or median be used to measure the 'average' HDL cholesterol concentration? Explain.
  2. Describe the distribution.
Left: the histogram of direct HDL cholesterol concentration from the NHANES study (large outliers exist but are hard to see). Right: the weights of `Fun Size' Cherry Ripe chocolate bars.

FIGURE 11.17: Left: the histogram of direct HDL cholesterol concentration from the NHANES study (large outliers exist but are hard to see). Right: the weights of `Fun Size' Cherry Ripe chocolate bars.

Exercise 11.4 [Dataset: CherryRipe] The histogram of the weights of 'Fun Size' Cherry Ripe chocolate bars after 2016 is shown in Fig. 11.17 (right panel).

  1. Should the mean or median be used to measure the 'average' HDL cholesterol concentration? Explain.
  2. Describe the distribution.

Exercise 11.5 Levenson (2005) recorded the number of fatalities at amusement rides in the US from \(1994\) to \(2003\) (Table 11.8). Using software or a calculator, compute:

  1. the mean number of fatalities per year over this period.
  2. the median number of fatalities per year over this period.
  3. the standard deviation number of fatalities per year over this period.
TABLE 11.8: Fatalities at amusement park rides in the US
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Fatalities: \(2\) \(4\) \(3\) \(4\) \(7\) \(6\) \(1\) \(3\) \(2\) \(5\)

Exercise 11.6 Furness and Bryant (1996) studied fulmars (a seabird) . The mass of the female birds were (in grams): \(635\);\(635\);\(668\);\(640\);\(645\);\(635\).

  1. Construct a stemplot (using the first two digits as the stems).
  2. Using your calculator, find the value of the sample mean.
  3. Using your calculator, the value of the sample standard deviation.
  4. Find the value of the sample median.
  5. Find the value of the population standard deviation.

Exercise 11.7 Draw a stemplot of the average monthly SOI in August from \(1995\) to \(2000\) (Table 11.9). Then, use your calculator (where possible) to calculate the:

  1. sample mean
  2. sample median.

 

  1. range.
  2. sample standard deviation.
TABLE 11.9: The average monthly SOI values in August from 1995 to 2000
1995 1996 1997 1998 1999 2000
Monthly average SOI: \(0.8\) \(4.6\) \(-19.8\) \(9.8\) \(2.1\) \(5.3\)

Exercise 11.8 [Dataset: FriesWt] Wetzel (2005) weighed orders of French fries to determine how they matched the target weight of \(171\) g (Table 11.10).

  1. Produce graphs to summarise the data.
  2. Use software to produce numerical summary information.

Do you think the weights meet the target weight, on average?

Weight of large orders of fries (in g)
TABLE 11.10: The weight of servings of French fries
\(117.0\) \(132.0\) \(134.0\) \(139.0\) \(141.0\) \(143.0\) \(146.0\) \(152.0\) \(154.0\) \(155.0\) \(157.0\)
\(126.0\) \(133.0\) \(137.0\) \(139.0\) \(142.0\) \(143.5\) \(146.0\) \(152.0\) \(154.5\) \(156.0\) \(176.0\)
\(128.0\) \(133.0\) \(138.0\) \(140.0\) \(142.5\) \(145.0\) \(151.0\) \(154.0\) \(154.5\) \(156.5\) \(117.0\)

Exercise 11.9 [Dataset: Orthoses] In a study of the influence of using ankle-foot orthoses in children with cerebral palsy (Swinnen et al. 2018), the data in Table 10.3 describe the \(15\) subjects.

  1. Compute the mean, median, standard deviation and IQR for the heights.
  2. Produce a stemplot of the children's heights.
  3. Produce a dot chart of the children's heights.
  4. Produce a histogram of the children's heights.
  5. Describe the distribution of the children's heights.

Exercise 11.10 An article studied patients who had been admitted to Castle Hill Hospital (Jenner et al. 2022). The total number of microplastics found in the lungs of each patients are shown in Table 11.11. For these patients:

  1. Draw a stemplot, using the numbers as (say) \(8.0\), with the decimals as the leaves.
  2. What is the mean and median number of microplastics found?
  3. What is the standard deviation of the number of microplastics found?
  4. What is the IQR of the number of microplastics found?
Number of microplastics
TABLE 11.11: The number of microplastics found in \(11\) patients
\(8\) \(3\) \(5\) \(2\) \(0\) \(2\) \(1\) \(7\) \(5\) \(1\) \(0\)

Exercise 11.11 Describe the histogram in Fig. 11.5.

Exercise 11.12 The standard deviation for Dataset A in Fig. 11.18 is \(s = 2\). Will the standard deviation of Dataset B be smaller than or greater than \(2\)? Why?

Dotplots of two datasets

FIGURE 11.18: Dotplots of two datasets

Exercise 11.13 A tutor has recorded the marks (as a percentage) for all students in her four classes for an assignment. All classes have the same number of students. The corresponding histograms are shown in Fig. 11.20.

  1. In which class would the median be the largest?
  2. In which class would the median be the smallest?
  3. In which class would the standard deviation be the largest?
  4. In which class would the standard deviation be the smallest?
Histogram of marks for four classes

FIGURE 11.19: Histogram of marks for four classes

Exercise 11.14 Consider the four histograms in Fig. 11.20. Which histogram is most likely to describe the shape of the following data? Why?

  1. The time that students remain in the examination room for a short or easy two-hour examination.
  2. The heights of females at a local adults dance club.
  3. The starting salaries of new science graduates employed full-time.
  4. The volume of drink in \(375\) ml cans of soft drink.
  5. The time that students remain in the examination room for a hard or long two-hour examination.

(The first bar of the histogram is not necessarily at zero; it is the shape of the histogram that is of interest here: right skewed, left skewed, symmetric, etc.)

Four histograms: where would they be useful?

FIGURE 11.20: Four histograms: where would they be useful?