The refresher course consists of ten chapters:

  1. Set Theory
  2. Mathematical Operations
  3. Equations & Inequalities
  4. Functions & Graphs
  5. Statistical Operations
  6. The Language of Statistics
  7. Working with Numbers, and Data Display
  8. The Normal Distribution
  9. Central Tendency
  10. Variability

This first five chapters of this refresher course are based on:

Franken, W.M. & Bouts, R.A. (2002). Wiskunde voor statistiek: Een voorbereiding. Bussum, Netherlands: Uitgeverij Coutinho

Chapter 6 to 10 are based on:

Landers, R.N. (2019). A Step-by-Step Introduction to Statistics for Business. Sage Publications

The module illustrates all topics discussed the textbook, and shows how they can be implemented in R.

The module illustrates all topics discussed the textbook, and shows how they can be implemented in R.

On the fly, you will learn many of the basic functions of R, and functions in handy R packages.

This file about chapter 1: sets in mathematics.

1. Set Theory

1.1 Sets in Mathematics

Intuitively, we understand the concept of sets. In dictionaries, you will find definitions like a group or collection of things that belong together or resemble one another.

In mathematics, sets have a pretty precise definition:

  • A set is a well-defined collection of objects.
  • Each object in a set is called an element of the set.
  • Two sets are equal if they have exactly the same elements in them.
  • A set that contains no elements is called a null set or an empty set.
  • If every element in Set A is also in Set B, then Set A is a subset of Set B.

Example: if A is the set of first five charachters of the alphabet, then: A = {a,b,c,d,e}

In R we can create this set as follows.

We can decide on the name of the set. Here, we use SetA.

The two characters “<” and “-” look like an arrow.

We combine the characters on the right of the arrow, using the c() function

We then assign this combination to SetA, on the lefthand side.

Once we have defined SetA, we can check if certain elements, like “a” and “f” are elements of the set.

SetA <- c("a","b","c","d","e")
"a" %in% SetA
## [1] TRUE
"f" %in% SetA
## [1] FALSE

Likewise, we can create a set of numbers, from 1 to 100.

SetN <- c(1:100)  
length(SetN)
## [1] 100
88 %in% SetN  # Is 88  an element of our set? [TRUE, it is!]
## [1] TRUE
105 %in%SetN  # Is 105 an element of our set? [FALSE, it is not!]        
## [1] FALSE

1.2 Relations between Sets

Empty Set

An empty set is a set without elements.

SetEmpty <- NULL
length(SetEmpty)   # the length of an empty set is zero
## [1] 0
"2" %in% SetEmpty  # is 2 part of our empty set?
## [1] FALSE

Identical Sets

Apart from the sequence of elements, the two sets below are identical.

All elements of A occur in B, and all elements of B are part of A.

SetA <- c("1","3","4","6")
SetB <- c("1","4","6","3")
SetB %in% SetA
## [1] TRUE TRUE TRUE TRUE
SetA %in% SetB
## [1] TRUE TRUE TRUE TRUE

We can test the differences and similarities using the commands below.

The overlap between two sets, is called the intersection.

The intersection is graphically displayed below.

All elements that are part of two sets combined (either A, B or both) form the union of the two sets.

In a graph:

The intersect of two sets is written as \(A \cap B\)

The union of two sets is written as \(A \cup B\)

setdiff(SetA, SetB)
## character(0)
setequal(SetA, SetB)
## [1] TRUE
intersect(SetA, SetB)
## [1] "1" "3" "4" "6"

Let’s look at some other examples.

The two sets below are not identical, and contain duplicates.

From the result we see that duplicates are removed! The element “a” appears twice in SetA, but only once in the union of both sets.

SetA <- c("a","a","b","c","d")
SetB <- c("c","c","d","e")
union(SetA, SetB)
## [1] "a" "b" "c" "d" "e"

There are various functions in R that help you detect unique and duplicate cases.

Some of these functions are unique() and duplicated().

Applying unique() to a set, filters out any duplicates.

The function duplicated() indicates, for each element, starting from the first element in the set, if it is the same as any of the preceding elements. Since the first element, by definition, has no predecessor, the indication has to be FALSE. If, like in SetA, the second element is identical to the first element, it will be TRUE. Going through the entire set, we generate a list of duplicates: elements with indicator TRUE.

Note that duplication means that the same elements occurs twice or more. That is, triplication, quadruplication, and so on, are just subsets of duplication: elements appearing three times or more, are a subset of elements apperaring twice or more.

We can use the result of duplicated() to index SetA. Since SetA has five elements (including duplicates), we can refer to elements using square brackets.

For example, SetA[1] returns the first element of SetA. And SetA[3:5] returns the thrid to fifth elements of SetA. An alternative way for the latter, would be to use a list of TRUE and FALSE indicators, here c(F,F,T,T,T).

Since the duplicated() function produces such a list, we can use it to index SetA. However, duplicated() returns TRUE for duplicated elements. If we want to detect unique elements, we use the NOT operator (the exclamation mark, !) to change the TRUE indicators to FALSE, and FALSE indicators to TRUE.

SetA[1]
## [1] "a"
SetA[3:5]
## [1] "b" "c" "d"
SetA[c(F,F,T,T,T)] # Equivalent to previous!
## [1] "b" "c" "d"

Since the duplicated() function produces such a list, we can use it to index SetA. However, duplicated() returns TRUE for duplicated elements. If we want to detect unique elements, we use the NOT operator (the exclamation mark, !) to change the TRUE indicators to FALSE, and FALSE indicators to TRUE.

(SetA)
## [1] "a" "a" "b" "c" "d"
unique(SetA)             # Removes duplicates
## [1] "a" "b" "c" "d"
duplicated(SetA)
## [1] FALSE  TRUE FALSE FALSE FALSE
SetA[!duplicated(SetA)]  # Removes duplicates, in a slightly more complicated way
## [1] "a" "b" "c" "d"

Differences is sets can be checked using the setdiff() function. The funcion only compares two sets at the time, and the order is important, as you see below!

setdiff(SetA, SetB)  # a en b occur in A, but not in B
## [1] "a" "b"
setdiff(SetB, SetA)  # e occurs in B, but not in A
## [1] "e"

Advanced

To challenge your skills, we can define the union of sets A and B as follows:

  • Elements unique to A, plus:
  • Elements unique to B, plus:
  • Elements in the interesection of A and B.

We can sort the combination of these three parts alphabetically, and check if indeed the result is equal to what we have defined as the union:

sort(c(setdiff(SetA, SetB), setdiff(SetB, SetA), intersect(SetB, SetA)))
## [1] "a" "b" "c" "d" "e"
all(sort(c(setdiff(SetA, SetB), setdiff(SetB, SetA), intersect(SetB, SetA))) == union(SetA, SetB))
## [1] TRUE

The all() function can be used to see if all elements of one set, are contained by the other.

All elements of set C (a duplicated element “a” ) are part of D.

One element of D, is not contained in A.

SetC <- c("a","a"); SetD <- c("a","b")
SetC %in% SetD
## [1] TRUE TRUE
all(SetC %in% SetD)
## [1] TRUE
SetD %in% SetC
## [1]  TRUE FALSE
all(SetD %in% SetC)
## [1] FALSE

1.3 Sets of Numbers

Sets can contain any type of element: characters, binary indicators (like TRUE or FALSE), numbers, and more.

It can be tedious to create sets of characters or words, as you have to do a lot of typing.

Sets of numbers are often easier, especially if it’s a line of numbers, say from 1, 2 .. to 99.

You can use the length() function to return the number of elements (here: numbers).

You can access specific elements using indexing, between square brackets. Note that curved brackets () do not work, for indexing. Computers can be very picky.

(SetN <- c(-10:+10)) # Putting the expression between brackets, prints the object in the console!
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
## [20]   9  10
length(SetN)
## [1] 21
SetN[3]
## [1] -8
SetN[3] < SetN[4]
## [1] TRUE

1.4 Application of Set Theory, and Sets

You may wonder why all of the above would be of interest to a data scientist like yourself.

There are a least three reasons we would like to stress here.

  • Logical thinking

First of all, dealing with data and data analysis will force you to logically think about what it all means. Even when working with just one data set, say, a (partial) list of employees in your organization who have filled out survey questionnaire, requires you to first assess the quality of the list. How many employees have filled out the survey twice or more (and what do we do with duplicates, especially if responses by duplicate respondents are inconsistent)? Is the list of (unique) respondents a subset of employees? Is the response rate the same for all departments of the organization? And so on and so forth.

Answers to such questions can be obtained using the tools discussed above, or many other tools. But it is the thinking that counts!

  • Comparing data sets

One of the starting points in the data mining process, is data understanding. Often, your analysis is based on data from various sources. Before you start your analysis, you have to understand all sources in depth. Combining, or merging, data sets from various sources almost without exception reveals errors, inconsistencies and gaps. Applying the basics of set theory will help you in preparing high-quality, analyzable data sets. Or to put it negatively: data science follows the garbage-in-garbage-out principle!

  • Probability theory

In some of the modules, we will use probability theory. Concepts like conditional probablitity have close links to set theory. It helps if you’re familiar with the terminologies, and with notations like \(\cup\) and \(\cap\).

1.5 Exercises

1.5.1 Exercise 1

Given are the following sets:

A = {2,4,6,8,10,12,14,16} B = {1,2,3,4,5,6,7} C = {3,6,9,12,15,18}

  1. A \(\cap\) B = (Intersection of A and B)
  2. A \(\cap\) C =
  3. B \(\cap\) C =
  4. A \(\cap\) B \(\cap\) C =
  5. A \(\cup\) B = (Union of A and B)
  6. A \(\cap\) B \(\cap\) C =

Solutions

Let’s first create the sets A, B and C

A <- seq(2, 16, 2)
B <- c(1:7)
C <- seq(3, 18, 3)
A; B; C  # Prints the three sets
## [1]  2  4  6  8 10 12 14 16
## [1] 1 2 3 4 5 6 7
## [1]  3  6  9 12 15 18
  1. A \(\cap\) B
(intersect(A, B)) # Putting the command between brackets, prints the result
## [1] 2 4 6
  1. A \(\cap\) C
(intersect(A, C)) 
## [1]  6 12
  1. B \(\cap\) C
(intersect(B, C)) 
## [1] 3 6
  1. A \(\cap\) B \(\cap\) C

intersect(A, B, C) will give an error message, as intersect() only works on two sets!

We can do it in 2 steps; first, we determine the intersection of A and B; followed by the intersection of that result with C!

You can imagine that this is doable for three sets. But if we were to find the intersection between many sets, then the experession would get very lengthy!

A better alternative is to use the Reduce() function.

Often, you can find solutions to this kind of challenges by just googling.

The Reduce() function, for example, we found on link

intersect(intersect(A,B),C)
## [1] 6
Reduce(intersect, list(A,B,C))
## [1] 6
  1. A \(\cup\) B
sort((union(A, B)))
##  [1]  1  2  3  4  5  6  7  8 10 12 14 16
  1. A \(\cap\) B \(\cap\) C
sort(Reduce(union, list(A,B, C)))
##  [1]  1  2  3  4  5  6  7  8  9 10 12 14 15 16 18

Are you in for a challenge?

Which elements are in intersections of sets, but not in all sets?

(AB <- intersect(A,B)) # Pairwise intersection A and B; stored as set AB
## [1] 2 4 6
(AC <- intersect(A,C)) # Id, for A and C
## [1]  6 12
(BC <- intersect(B,C)) # Id, for B and C
## [1] 3 6
(sort(pairs <- Reduce(union, list(AB,AC,BC)))) # All pairwise intersection combined, stored as "pairs"
## [1]  2  3  4  6 12
(trios <- Reduce(intersect, list(A,B,C))) # Elements present in the intersection of all three sets
## [1] 6
sort(setdiff(pairs, trios)) # Elements in pairs, but not in trios
## [1]  2  3  4 12

You see that elements 2, 3, 4 and 12 are in pairwise intersections, but not in the three-way intersection!

Graphically, just to convince you:

1.5.2 Exercise 2

Are the following statements true or false?

  1. {2,4} \(\subset\) (A \(\cap\) B)
  2. 6 \(\in\) (A \(\cap\) B \(\cap\) C)
  3. {6} \(\subset\) (A \(\cap\) B)
  4. (A \(\cap\) B) \(\subset\) (A \(\cup\) B)

Solutions

  1. {2,4} \(\subset\) (A \(\cap\) B)
SetTest <- c(2,4)
all(SetTest %in% A)
## [1] TRUE
  1. 6 \(\in\) (A \(\cap\) B \(\cap\) C)
  2. {6} \(\subset\) (A \(\cap\) B)
# In the formulation of c.

SetTest <- c(6)
trios
## [1] 6
all(SetTest %in% trios) # Remember we stored the intersect as "trios"!
## [1] TRUE
# Alternatively (as in b.), define 6 as a single element rather than a set of one element

6 %in% trios
## [1] TRUE
# Note that b. and c. are basically the same!
  1. (A \(\cap\) B) \(\subset\) (A \(\cup\) B)

This is true in general. Elements in the intersection of A and B, are by definition part of both A and B. The intersection is therefore a subset of all elements in A or B!

Just to train the formulation of this exercise in R:

all(intersect(A, B) %in% union(A, B))
## [1] TRUE

1.5.3 Exercise 3

A = {1,2,3,4,5,6} B = {5,6} C = {1,2,5,6} D = {2,3,4} E = {2,3,4,5}

Complete the statement with one of the symbols:

  • \(\in\) (element of)
  • \(\notin\) (not an element of)
  • \(\subset\) (subset of)
  • \(\supset\) (superset; if A is subset of B, then B is superset of A)
  • \(\cap\) (intersection)
  • \(\cup\)
  1. B….C
  2. B….C = B
  3. B….C = C
  4. B….D = 0 (empty set)
  5. C….D = A
  6. D….E = D
  7. 4….C \(\cap\) B
  8. D….E….A

Solutions

Use reason to answer each of the questions!

As an additional challenge, formulate the sets in R, and use any of the commands introduced in this chapter to check your answer!

  1. \(\subset\) (B is obviously a subset of C, as all elements of B are also in C)
  2. \(\cap\) (the intersect of B and C, is equal to B, as B is a subset of B)
  3. \(\cup\) (the combined elements of B and C, are equal to C, as C is a superset of B)
  4. \(\cap\) (as B and D have no elements in common, the interesect is the empty set)
  5. \(\cup\) (all elements of C and D combined, match A)
  6. \(\cap\) or \(\subset\)
  7. \(\notin\) (4 is not part of the intersection of B and C; it is not even part of the union of B and C)
  8. \(\subset\) (D is a subset of E, which in turn is a subset of A; you can conclude that therefore D is a subset of A)

Working with data files (in data science) requires logicaal thinking. Set theory is a good exercise in logical thinking!

1.5.4 Exercise 4

Consider the following sets.

  • A = {x | x is an even number and x<20; x is a positive natural number}
  • B = {x | x is a multiple of 3 and x<20; x is a positive natural number}
  • C = {x | x is an odd numbber and x<20; x is a positive natural number}

This looks cryptic. Set A, for example, reads like the set of numbers x conditioned by the following rules: x is a positive natural number (1, 2, 3 to infinity), smaller than 20 and divisible by 2. We can enumerate these numbers easily: 2, 4, 6 .. up to 18.

Set B then is 3, 6, 9 .. up to 18).

Set C is 1, 3, 5 .. up to 19.

Determine:

  1. A \(\cap\) B
  2. A \(\cap\) B \(\cap\) C
  3. \(\cup\) B

Again, create the sets in R and use the proper functions to get the solutions!

Solutions

  1. {6, 12, 18}
  2. 0 (empty set)
  3. {2,3,4,6,8,9,10,11,12,14,15,16,18}

2. Mathematical Operations

2.1 Addition

Probably the most basic mathematical operation is adding two or more numbers.

You can use R as a calculator. Type your mathematical expression in the console, and get the result instantaneously. Click here for the video.

A better way to use R, is to write youR code in R-scripts. Most often, you will assign values to objects. Below we assign the value 2 to a, and 5 to b. You can then assign the addition of \(a+b\) to another object, \(c\).

2+5
## [1] 7
a = 2; b = 5
c = a + b
cat(a,"+",b,"=",c,"\n")
## 2 + 5 = 7
# if c = a + b, then b = c - a
c-a
## [1] 5
b
## [1] 5

2.2 Multiplication and Division

Multiplication is like adding a number several times. Adding 5 four times, is equivalent to multiplying 5 by 4.See below.

Often, in data science, you will multiply broken numbers. Like \(4.5*3.88\). The analogy (adding 4.5 3.88 times) is somewhat harder to envision.

The operator for multiplication is the __asterisk (*)__. Although in mathematical textbooks, you may find \(xy\) as shorthand for x times y, that doesn’t work in R, and other software. You have to use \(x*y\) in your code!

5+5+5+5
## [1] 20
4*5
## [1] 20

\(a+a+a+a = 4*a\) (or 4a, for short)

\(4a = 20\)

We can divide both sides by 4 to find a:

\(4a/4 = 20/4\) \(\Rightarrow\) \(a=5\)

2.3 Exponentiation

Exponentiation is equivalent to multiplying by the same number, several times. For instance, \(5*5\) is the same as 5 raised to the power 2, or \(5^{2}\). Exponentiation in R uses the operator ^, like in the example below.

Exponents do not have to be integers (1, 2, …), but can be broken numbers (e.g. 1.2, 2.8, …). A special case is an exponent of \(0.5\). Exponentiation by \(0.5\) is called the square root. Taking the square root of a number, is the reverse of taking the square.

If \(x^{2} = y\), then \(y^{0.5} = x\). For example, the square of 5 is \(5*5 = 25\); reversely, the square root of 25 is 5.

Other special cases are exponents of 0 and 1.

\(x^{0} = 1\)

\(x^{1} = x\)

5*5*5*5
## [1] 625
5^4
## [1] 625
5*5
## [1] 25
5^2
## [1] 25
sqrt(25)
## [1] 5
25^(1/2)
## [1] 5
5^0
## [1] 1
5^1
## [1] 5

Exponentiation has the following structure:

\(a^b=c\)

In this formula:

  • a is the base
  • b is the exponent
  • c is the power

2.4 Rooting and Logarithms

There is a relationship between exponentiation, rooting and logarithms.

In a simple example, 10 squared (or \(10^{2}\)) is \(10*10 = 100\).

That is:

\(10^2 = 100\)

Rooting is:

\(\sqrt(100) = 10\) (the square root of 100)

Logarithm:

\(log(100) = 2\) (using 10 as the base for the logarithm).

Note that the three numbers (2; 10; and 100) keep coming back, in different settings!

In R this would look like:

cat("The square of 10, or 10*10, equals",10^2,"\n")
## The square of 10, or 10*10, equals 100
cat("The square root of 100 equals",sqrt(100),"\n")
## The square root of 100 equals 10
cat("The logarithm of 100 (base 10) equals", log10(100),"\n")
## The logarithm of 100 (base 10) equals 2

Since 10 is an exceptional base, and squaring and square roots are special cases, we can use a more general version.

Suppose we do the same for \(2^3 = 2*2*2 = 8\).

cat("2 raised to the power 3 equals",2^3,"\n")
## 2 raised to the power 3 equals 8
cat("The cubic root of 8 equals",8^(1/3),"\n")
## The cubic root of 8 equals 2
cat("The logarithm of 8 (base 2) equals", log(8, 2),"\n")
## The logarithm of 8 (base 2) equals 3

For an easy explanation of the links between roots and exponents, have a look at this video.

2.5 The Order of Operations

The order of operations is governed by the principle of PEMDAS.

  1. Parentheses
  2. Exponents
  3. Multiplication and Division
  4. Addition and Subtraction

As a general rule, in programming for data science and statistics it is best to use parentheses (brackets) in order to avoid confusion.

Some examples:

2+5*8 
## [1] 42
2+(5*8)
## [1] 42
(2+5)*8
## [1] 56
3+2^2
## [1] 7
(3+2)^2
## [1] 25
12-24-34+12
## [1] -34
12-(24-34)+12
## [1] 34
12-24-(34+12)
## [1] -58
12-(24-34+12)
## [1] 10

As you see, formulas are prone to errors!

2.6 Negative Numbers in Multiplication

As a rule:

  • A positive number times a positive number gives a positive number
  • A positive number times a negative number gives a negative number
  • A negative number times a negative number gives a positive number

Examples:

8*7
## [1] 56
8*-7
## [1] -56
-8*7
## [1] -56
-8*-7
## [1] 56

2.7 Rules for Operations

  • Commutative law

\(a+b = b+a\)

  • Associative law

\((a+b)+c = a+(b+c)\)

This holds for addition, not for subtraction!

(8+2)+3 
## [1] 13
8+(2+3)  # Same as above
## [1] 13
(8-2)-3 
## [1] 3
8-(2-3)  # Not the same as above!!
## [1] 9
  • Distributive law

\(a*(b+c) = (a*b) + (a*c)\)

8*(2+3) 
## [1] 40
8*2 + 8*3
## [1] 40

We can use one command line to evaluate if the latter two expressions are identical.

For these evaluations, you have to use the \(a == b\) format (double =), which evaluates whether \(a\) and \(b\) are the same. \(a = b\) (single =) would allocate the value of \(b\) to \(a\), which is not what we want.

8*(2+3) == 8*2 + 8*3 # Evaluate if the expressions are identical
## [1] TRUE

Another rule:

\((b+c) / a = (b/a)+(c/a)\)

(8+7)/3
## [1] 5
(8/3) + (7/3)
## [1] 5
(8+7)/3 == (8/3) + (7/3)
## [1] TRUE

But: \((a) / (b+c) \neq (a/b) + (a/c)\)

15/(2+3)
## [1] 3
15/2 + 15/3
## [1] 12.5
15/(2+3) == 15/2 + 15/3
## [1] FALSE

2.8 Rules for Fractions

Rule 1: \((a/p) + (b/p) = (a+b)/p\)

8/4 + 3/4 == (8 + 3)/4
## [1] TRUE

Rule 2: \((a/p)*(b/q) = (a*b)/(b*q)\)

(8/4)*(6/3) == (8*6)/(4*3)
## [1] TRUE

Rule 3: \((a/p)/(b/q) = (a/b)*(q/b)\)

(8/4)/(6/3) == (8/4)*(3/6)
## [1] TRUE

2.9 Rules for Exponentiation

Rule 1: \(a^n = a * a * a * ...\) (n times)

Rule 2: \(a^n\) is positive if \(a>0\)

Rule 3: \(a^n\) is positive if \(a<0\) and n is an even number

Rule 4: \(a^n\) is negative if \(a<0\) and n is an odd number

Examples:

8^2    # Rule 2
## [1] 64
(-8)^2 # Rule 3
## [1] 64
-8^3   # Rule 4
## [1] -512
-8^2   # Is the outcome what you expected??
## [1] -64

In the expression \(-8^2\), the PEMDAS rule forces R to first evaluate \(8^2\) (E, for exponentiation), before multiplying (M) by -1!!

If you intend to square -8 (\(-8*-8=64\)), then the code should read:

(-8)^2   # Is the outcome what you expected??
## [1] 64

Parentheses make all the difference!