3  Functions and Loops

3.1 Introduction

In programming, we often perform the same tasks repeatedly. Functions and Loops help us write cleaner, shorter, and more efficient code.

  • Function is a block of code that can be called anytime to perform a specific task.
  • Loop is used to run the same code repeatedly without rewriting it.

3.2 What Is a Function?

A function is a block of code designed to perform a specific task. Using functions helps us avoid redundant code.

3.2.1 Adding Two Numbers

This function takes two numbers as inputs and returns their sum.

Python Code

# Function to add two numbers
def add_numbers(a, b):
    return a + b

print(add_numbers(5, 3))  # Output: 8
8

R Code

# Function to add two numbers
add_numbers <- function(a, b) {
  return(a + b)
}

print(add_numbers(5, 3))  # Output: 8
[1] 8

3.2.2 Rectangle Properties

This function calculates the area and perimeter of a rectangle, given its length and width.

Python Code

# Function to calculate area and perimeter of a rectangle
def rectangle_properties(length, width):
    area = length * width
    perimeter = 2 * (length + width)
    return {"area": area, "perimeter": perimeter}

print(rectangle_properties(5, 3))
{'area': 15, 'perimeter': 16}
# Output: {'area': 15, 'perimeter': 16}

3.2.2.1 R Code

# Function to calculate area and perimeter of a rectangle
rectangle_properties <- function(length, width) {
  area <- length * width
  perimeter <- 2 * (length + width)
  return(list(area = area, perimeter = perimeter))
}

print(rectangle_properties(5, 3))
$area
[1] 15

$perimeter
[1] 16
# Output: $area [1] 15, $perimeter [1] 16

3.2.3 Comparing Two Datasets

This function analyzes two datasets by calculating their mean, median, and standard deviation, useful in data analysis.

Python Code

import statistics
from tabulate import tabulate

# Function to compare two datasets
def compare_data(group1, group2):
    return {
        "group1": {
            "mean": statistics.mean(group1),
            "median": statistics.median(group1),
            "std_dev": statistics.stdev(group1)
        },
        "group2": {
            "mean": statistics.mean(group2),
            "median": statistics.median(group2),
            "std_dev": statistics.stdev(group2)
        }
    }

# Sample datasets
data1 = [10, 20, 30, 40, 50]
data2 = [15, 25, 35, 45, 55]

# Get results
results = compare_data(data1, data2)

# Convert results to a table format
table = [
    ["Metric", "Group 1", "Group 2"],
    ["Mean", results["group1"]["mean"], results["group2"]["mean"]],
    ["Median", results["group1"]["median"], results["group2"]["median"]],
    ["Standard Deviation", results["group1"]["std_dev"], results["group2"]["std_dev"]]
]

# Print table
print(tabulate(table, headers="firstrow", tablefmt="grid"))
+--------------------+-----------+-----------+
| Metric             |   Group 1 |   Group 2 |
+====================+===========+===========+
| Mean               |   30      |   35      |
+--------------------+-----------+-----------+
| Median             |   30      |   35      |
+--------------------+-----------+-----------+
| Standard Deviation |   15.8114 |   15.8114 |
+--------------------+-----------+-----------+

R Code

# Load library
library(knitr)

# Function to compare two datasets
compare_data <- function(group1, group2) {
  data.frame(
    Statistic = c("Mean", "Median", "Std Dev"),
    Group1 = round(c(mean(group1), median(group1), sd(group1)), 2),
    Group2 = round(c(mean(group2), median(group2), sd(group2)), 2)
  )
}

# Sample data
data1 <- c(10, 20, 30, 40, 50)
data2 <- c(15, 25, 35, 45, 55)

# Print as formatted table
kable(compare_data(data1, data2))
Statistic Group1 Group2
Mean 30.00 35.00
Median 30.00 35.00
Std Dev 15.81 15.81

Functions save time by allowing code reuse, improve program organization and readability, and make debugging and future development easier.

3.3 What Is a Loop?

Loops allow us to execute the same code multiple times without rewriting it. Loops allow us to perform repetitive calculations for mathematical analysis and data processing. Types of Loops:

  • For Loop – Used when the number of repetitions is known.
  • While Loop – Used when repetitions depend on a condition.

3.3.1 Fibonacci Sequence

The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones:

\[F(n) = F(n-1) + F(n-2)\]

Example: $0,1,1,2,3,5,8,13,21,\dots$

Python Code

def fibonacci(n):
    fib_series = [0, 1]
    for i in range(2, n):
        fib_series.append(fib_series[-1] + fib_series[-2])
    return fib_series

print(fibonacci(10))  # Output: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

3.3.1.1 R Code

fibonacci <- function(n) {
  fib_series <- c(0, 1)
  for (i in 3:n) {
    fib_series <- c(fib_series, fib_series[i-1] + fib_series[i-2])
  }
  return(fib_series)
}

print(fibonacci(10))  # Output: 0 1 1 2 3 5 8 13 21 34
 [1]  0  1  1  2  3  5  8 13 21 34

3.3.2 Standard Deviation

Standard deviation measures how spread out the data is in a distribution:

\[\sigma = \sqrt{\frac{1}{n} \sum (x_i - \bar{x})^2}\]

Python Code

def standard_deviation(data):
    mean = sum(data) / len(data)
    variance = sum((x - mean) ** 2 for x in data) / len(data)
    return variance ** 0.5

data = [10, 20, 30, 40, 50]

print(f"Standard Deviation: {standard_deviation(data):.2f}")
Standard Deviation: 14.14
# Output: 14.14

R Code

standard_deviation <- function(data) {
  mean_value <- mean(data)
  variance <- sum((data - mean_value) ^ 2) / length(data)
  return(sqrt(variance))
}

data <- c(10, 20, 30, 40, 50)

print(paste("Standard Deviation:", round(standard_deviation(data), 2)))
[1] "Standard Deviation: 14.14"
# Output: 14.14

3.3.3 Simple Linear Regression

Linear regression is used to find the relationship between an independent variable \(X\) and a dependent variable \(Y\):

\[Y = aX + b\]

where:

  • \(a\) is the slope
  • \(b\) is the intercept

Python Code

import numpy as np

# Data (X: study hours, Y: exam scores)
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 5, 4, 5])

# Calculate slope (a) and intercept (b)
n = len(X)
sum_x, sum_y = sum(X), sum(Y)
sum_xy = sum(X * Y)
sum_x2 = sum(X ** 2)

a = (n * sum_xy - sum_x * sum_y) / (n * sum_x2 - sum_x ** 2)
b = (sum_y - a * sum_x) / n

print(f"Linear Regression: Y = {a:.2f}X + {b:.2f}")
Linear Regression: Y = 0.60X + 2.20

R Code

# Data
X <- c(1, 2, 3, 4, 5)
Y <- c(2, 4, 5, 4, 5)

# Calculate slope (a) and intercept (b)
n <- length(X)
sum_x <- sum(X)
sum_y <- sum(Y)
sum_xy <- sum(X * Y)
sum_x2 <- sum(X^2)

a <- (n * sum_xy - sum_x * sum_y) / (n * sum_x2 - sum_x^2)
b <- (sum_y - a * sum_x) / n

print(paste("Linear Regression: Y =", round(a, 2), "X +", round(b, 2)))
[1] "Linear Regression: Y = 0.6 X + 2.2"

Functions and loops help us create simpler and more efficient code. By understanding these two concepts, we can write better and more readable programs.

3.4 Applied of Functions and Loops

Let’s apply these Functions and Loops to real-world data science tasks, such as:

  • ✅ Creating a dataset dynamically using functions and loops.
  • ✅ Categorizing employees based on salary.
  • ✅ Computing aggregated statistics (average salary, experience, etc.).
  • ✅ Finding top-paid employees per job position.

3.4.1 Creating a Dataset

Python Code

import pandas as pd
import random

def create_employee_dataset(num_employees):
    positions = {
        "Staff": (3000, 5000, 1, 5),
        "Supervisor": (5000, 8000, 5, 10),
        "Manager": (8000, 12000, 10, 15),
        "Director": (12000, 15000, 15, 25)
    }
    
    data = {
        "ID_Number": [],
        "Position": [],
        "Salary": [],
        "Age": [],
        "Experience": []
    }
    
    for _ in range(num_employees):
        id_number = random.randint(10000, 99999)
        position = random.choice(list(positions.keys()))
        salary = random.randint(positions[position][0], positions[position][1])
        experience = random.randint(positions[position][2], positions[position][3])
        age = experience + random.randint(22, 35)  # Ensuring age aligns with experience
        
        data["ID_Number"].append(id_number)
        data["Position"].append(position)
        data["Salary"].append(salary)
        data["Age"].append(age)
        data["Experience"].append(experience)
    
    return pd.DataFrame(data)

df = create_employee_dataset(20)
print(df)
    ID_Number    Position  Salary  Age  Experience
0       88908     Manager   11456   35          11
1       77421       Staff    3091   27           5
2       20801     Manager    8532   46          15
3       16678    Director   14792   50          22
4       50771     Manager   11642   39          15
5       11268     Manager   11710   41          15
6       96478  Supervisor    5595   38           5
7       47405       Staff    3645   32           3
8       88978  Supervisor    5431   28           5
9       48778     Manager    9012   42          14
10      65712    Director   12400   51          19
11      81515       Staff    4517   29           2
12      62467  Supervisor    7467   43          10
13      72510       Staff    3429   27           5
14      47992     Manager   11216   45          12
15      92462    Director   12137   48          19
16      40928    Director   13114   50          21
17      46011       Staff    4939   27           5
18      72698  Supervisor    5115   42           9
19      79496  Supervisor    7885   38          10

R Code

# set.seed(123)

create_employee_dataset <- function(num_employees) {
  positions <- list(
    "Staff" = c(3000, 5000, 1, 5),
    "Supervisor" = c(5000, 8000, 5, 10),
    "Manager" = c(8000, 12000, 10, 15),
    "Director" = c(12000, 15000, 15, 25)
  )
  
  data <- data.frame(ID_Number = integer(), 
                     Position = character(), 
                     Salary = numeric(), 
                     Age = integer(), 
                     Experience = integer(), 
                     stringsAsFactors = FALSE)
  
  for (i in 1:num_employees) {
    id_number <- sample(10000:99999, 1)
    position <- sample(names(positions), 1)
    salary <- sample(positions[[position]][1]:positions[[position]][2], 1)
    experience <- sample(positions[[position]][3]:positions[[position]][4], 1)
    age <- experience + sample(22:35, 1)
    
    data <- rbind(data, data.frame(ID_Number = id_number, 
                                   Position = position, 
                                   Salary = salary, 
                                   Age = age, 
                                   Experience = experience, 
                                   stringsAsFactors = FALSE))
  }
  
  return(data)
}

df <- create_employee_dataset(20)
print(df)
   ID_Number   Position Salary Age Experience
1      33322   Director  12149  57         24
2      55467    Manager  11101  42         15
3      14901      Staff   3146  36          5
4      58400    Manager  10076  32         10
5      86313    Manager  10960  46         13
6      51471      Staff   4346  31          5
7      52938    Manager   8867  44         14
8      12476    Manager  11343  44         12
9      81267    Manager  11441  40         13
10     78883   Director  12362  45         18
11     78554   Director  14523  53         24
12     96369 Supervisor   6140  39          6
13     91749    Manager   8450  45         14
14     47989      Staff   3677  34          4
15     91192 Supervisor   7238  39          6
16     39667    Manager  11576  32         10
17     41553   Director  12123  53         25
18     46500    Manager  11537  42         11
19     92026      Staff   4158  32          5
20     27634 Supervisor   6429  34          5

3.4.2 Filtering Data

We can use functions and loops to filter employees based on salary or experience levels.

Python Code

def filter_high_salary(df, threshold=10000):
    return df[df['Salary'] > threshold]

high_salary_df = filter_high_salary(df, 10000)
print(high_salary_df)
    ID_Number  Position  Salary  Age  Experience
0       88908   Manager   11456   35          11
3       16678  Director   14792   50          22
4       50771   Manager   11642   39          15
5       11268   Manager   11710   41          15
10      65712  Director   12400   51          19
14      47992   Manager   11216   45          12
15      92462  Director   12137   48          19
16      40928  Director   13114   50          21

3.4.3 Aggregating Data

Using loops and functions, we can compute key statistics (exp: Average Salary and Experience) for different employee groups.

Python Code

def compute_averages(df):
    return df.groupby("Position").agg({
        "ID_Number": "first",   # "Take one example ID"
        "Salary": "mean",
        "Age": "mean",
        "Experience": "mean"
    }).reset_index().round(2)

avg_stats = compute_averages(df)
print(avg_stats)
     Position  ID_Number    Salary    Age  Experience
0    Director      16678  13110.75  49.75       20.25
1     Manager      88908  10594.67  41.33       13.67
2       Staff      77421   3924.20  28.40        4.00
3  Supervisor      96478   6298.60  37.80        7.80

3.4.4 Determine Data

Loops can be used to determine the highest-paid employee in each position category.

Python Code

def top_earners(df):
    return df.loc[df.groupby("Position")["Salary"].idxmax()]

top_paid_employees = top_earners(df)
print(top_paid_employees)
    ID_Number    Position  Salary  Age  Experience
3       16678    Director   14792   50          22
5       11268     Manager   11710   41          15
17      46011       Staff    4939   27           5
19      79496  Supervisor    7885   38          10