# Function to add two numbers
def add_numbers(a, b):
return a + b
print(add_numbers(5, 3)) # Output: 8
8
In programming, we often perform the same tasks repeatedly. Functions and Loops help us write cleaner, shorter, and more efficient code.
A function is a block of code designed to perform a specific task. Using functions helps us avoid redundant code.
This function takes two numbers as inputs and returns their sum.
# Function to add two numbers
def add_numbers(a, b):
return a + b
print(add_numbers(5, 3)) # Output: 8
8
# Function to add two numbers
<- function(a, b) {
add_numbers return(a + b)
}
print(add_numbers(5, 3)) # Output: 8
[1] 8
This function calculates the area and perimeter of a rectangle, given its length and width.
# Function to calculate area and perimeter of a rectangle
def rectangle_properties(length, width):
= length * width
area = 2 * (length + width)
perimeter return {"area": area, "perimeter": perimeter}
print(rectangle_properties(5, 3))
{'area': 15, 'perimeter': 16}
# Output: {'area': 15, 'perimeter': 16}
# Function to calculate area and perimeter of a rectangle
<- function(length, width) {
rectangle_properties <- length * width
area <- 2 * (length + width)
perimeter return(list(area = area, perimeter = perimeter))
}
print(rectangle_properties(5, 3))
$area
[1] 15
$perimeter
[1] 16
# Output: $area [1] 15, $perimeter [1] 16
This function analyzes two datasets by calculating their mean, median, and standard deviation, useful in data analysis.
import statistics
from tabulate import tabulate
# Function to compare two datasets
def compare_data(group1, group2):
return {
"group1": {
"mean": statistics.mean(group1),
"median": statistics.median(group1),
"std_dev": statistics.stdev(group1)
},"group2": {
"mean": statistics.mean(group2),
"median": statistics.median(group2),
"std_dev": statistics.stdev(group2)
}
}
# Sample datasets
= [10, 20, 30, 40, 50]
data1 = [15, 25, 35, 45, 55]
data2
# Get results
= compare_data(data1, data2)
results
# Convert results to a table format
= [
table "Metric", "Group 1", "Group 2"],
["Mean", results["group1"]["mean"], results["group2"]["mean"]],
["Median", results["group1"]["median"], results["group2"]["median"]],
["Standard Deviation", results["group1"]["std_dev"], results["group2"]["std_dev"]]
[
]
# Print table
print(tabulate(table, headers="firstrow", tablefmt="grid"))
+--------------------+-----------+-----------+
| Metric | Group 1 | Group 2 |
+====================+===========+===========+
| Mean | 30 | 35 |
+--------------------+-----------+-----------+
| Median | 30 | 35 |
+--------------------+-----------+-----------+
| Standard Deviation | 15.8114 | 15.8114 |
+--------------------+-----------+-----------+
# Load library
library(knitr)
# Function to compare two datasets
<- function(group1, group2) {
compare_data data.frame(
Statistic = c("Mean", "Median", "Std Dev"),
Group1 = round(c(mean(group1), median(group1), sd(group1)), 2),
Group2 = round(c(mean(group2), median(group2), sd(group2)), 2)
)
}
# Sample data
<- c(10, 20, 30, 40, 50)
data1 <- c(15, 25, 35, 45, 55)
data2
# Print as formatted table
kable(compare_data(data1, data2))
Statistic | Group1 | Group2 |
---|---|---|
Mean | 30.00 | 35.00 |
Median | 30.00 | 35.00 |
Std Dev | 15.81 | 15.81 |
Functions save time by allowing code reuse, improve program organization and readability, and make debugging and future development easier.
Loops allow us to execute the same code multiple times without rewriting it. Loops allow us to perform repetitive calculations for mathematical analysis and data processing. Types of Loops:
The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones:
\[F(n) = F(n-1) + F(n-2)\]
Example: $0,1,1,2,3,5,8,13,21,\dots$
def fibonacci(n):
= [0, 1]
fib_series for i in range(2, n):
-1] + fib_series[-2])
fib_series.append(fib_series[return fib_series
print(fibonacci(10)) # Output: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
<- function(n) {
fibonacci <- c(0, 1)
fib_series for (i in 3:n) {
<- c(fib_series, fib_series[i-1] + fib_series[i-2])
fib_series
}return(fib_series)
}
print(fibonacci(10)) # Output: 0 1 1 2 3 5 8 13 21 34
[1] 0 1 1 2 3 5 8 13 21 34
Standard deviation measures how spread out the data is in a distribution:
\[\sigma = \sqrt{\frac{1}{n} \sum (x_i - \bar{x})^2}\]
def standard_deviation(data):
= sum(data) / len(data)
mean = sum((x - mean) ** 2 for x in data) / len(data)
variance return variance ** 0.5
= [10, 20, 30, 40, 50]
data
print(f"Standard Deviation: {standard_deviation(data):.2f}")
Standard Deviation: 14.14
# Output: 14.14
<- function(data) {
standard_deviation <- mean(data)
mean_value <- sum((data - mean_value) ^ 2) / length(data)
variance return(sqrt(variance))
}
<- c(10, 20, 30, 40, 50)
data
print(paste("Standard Deviation:", round(standard_deviation(data), 2)))
[1] "Standard Deviation: 14.14"
# Output: 14.14
Linear regression is used to find the relationship between an independent variable \(X\) and a dependent variable \(Y\):
\[Y = aX + b\]
where:
import numpy as np
# Data (X: study hours, Y: exam scores)
= np.array([1, 2, 3, 4, 5])
X = np.array([2, 4, 5, 4, 5])
Y
# Calculate slope (a) and intercept (b)
= len(X)
n = sum(X), sum(Y)
sum_x, sum_y = sum(X * Y)
sum_xy = sum(X ** 2)
sum_x2
= (n * sum_xy - sum_x * sum_y) / (n * sum_x2 - sum_x ** 2)
a = (sum_y - a * sum_x) / n
b
print(f"Linear Regression: Y = {a:.2f}X + {b:.2f}")
Linear Regression: Y = 0.60X + 2.20
# Data
<- c(1, 2, 3, 4, 5)
X <- c(2, 4, 5, 4, 5)
Y
# Calculate slope (a) and intercept (b)
<- length(X)
n <- sum(X)
sum_x <- sum(Y)
sum_y <- sum(X * Y)
sum_xy <- sum(X^2)
sum_x2
<- (n * sum_xy - sum_x * sum_y) / (n * sum_x2 - sum_x^2)
a <- (sum_y - a * sum_x) / n
b
print(paste("Linear Regression: Y =", round(a, 2), "X +", round(b, 2)))
[1] "Linear Regression: Y = 0.6 X + 2.2"
Functions and loops help us create simpler and more efficient code. By understanding these two concepts, we can write better and more readable programs.
Let’s apply these Functions and Loops to real-world data science tasks, such as:
import pandas as pd
import random
def create_employee_dataset(num_employees):
= {
positions "Staff": (3000, 5000, 1, 5),
"Supervisor": (5000, 8000, 5, 10),
"Manager": (8000, 12000, 10, 15),
"Director": (12000, 15000, 15, 25)
}
= {
data "ID_Number": [],
"Position": [],
"Salary": [],
"Age": [],
"Experience": []
}
for _ in range(num_employees):
= random.randint(10000, 99999)
id_number = random.choice(list(positions.keys()))
position = random.randint(positions[position][0], positions[position][1])
salary = random.randint(positions[position][2], positions[position][3])
experience = experience + random.randint(22, 35) # Ensuring age aligns with experience
age
"ID_Number"].append(id_number)
data["Position"].append(position)
data["Salary"].append(salary)
data["Age"].append(age)
data["Experience"].append(experience)
data[
return pd.DataFrame(data)
= create_employee_dataset(20)
df print(df)
ID_Number Position Salary Age Experience
0 88908 Manager 11456 35 11
1 77421 Staff 3091 27 5
2 20801 Manager 8532 46 15
3 16678 Director 14792 50 22
4 50771 Manager 11642 39 15
5 11268 Manager 11710 41 15
6 96478 Supervisor 5595 38 5
7 47405 Staff 3645 32 3
8 88978 Supervisor 5431 28 5
9 48778 Manager 9012 42 14
10 65712 Director 12400 51 19
11 81515 Staff 4517 29 2
12 62467 Supervisor 7467 43 10
13 72510 Staff 3429 27 5
14 47992 Manager 11216 45 12
15 92462 Director 12137 48 19
16 40928 Director 13114 50 21
17 46011 Staff 4939 27 5
18 72698 Supervisor 5115 42 9
19 79496 Supervisor 7885 38 10
# set.seed(123)
<- function(num_employees) {
create_employee_dataset <- list(
positions "Staff" = c(3000, 5000, 1, 5),
"Supervisor" = c(5000, 8000, 5, 10),
"Manager" = c(8000, 12000, 10, 15),
"Director" = c(12000, 15000, 15, 25)
)
<- data.frame(ID_Number = integer(),
data Position = character(),
Salary = numeric(),
Age = integer(),
Experience = integer(),
stringsAsFactors = FALSE)
for (i in 1:num_employees) {
<- sample(10000:99999, 1)
id_number <- sample(names(positions), 1)
position <- sample(positions[[position]][1]:positions[[position]][2], 1)
salary <- sample(positions[[position]][3]:positions[[position]][4], 1)
experience <- experience + sample(22:35, 1)
age
<- rbind(data, data.frame(ID_Number = id_number,
data Position = position,
Salary = salary,
Age = age,
Experience = experience,
stringsAsFactors = FALSE))
}
return(data)
}
<- create_employee_dataset(20)
df print(df)
ID_Number Position Salary Age Experience
1 33322 Director 12149 57 24
2 55467 Manager 11101 42 15
3 14901 Staff 3146 36 5
4 58400 Manager 10076 32 10
5 86313 Manager 10960 46 13
6 51471 Staff 4346 31 5
7 52938 Manager 8867 44 14
8 12476 Manager 11343 44 12
9 81267 Manager 11441 40 13
10 78883 Director 12362 45 18
11 78554 Director 14523 53 24
12 96369 Supervisor 6140 39 6
13 91749 Manager 8450 45 14
14 47989 Staff 3677 34 4
15 91192 Supervisor 7238 39 6
16 39667 Manager 11576 32 10
17 41553 Director 12123 53 25
18 46500 Manager 11537 42 11
19 92026 Staff 4158 32 5
20 27634 Supervisor 6429 34 5
We can use functions and loops to filter employees based on salary or experience levels.
def filter_high_salary(df, threshold=10000):
return df[df['Salary'] > threshold]
= filter_high_salary(df, 10000)
high_salary_df print(high_salary_df)
ID_Number Position Salary Age Experience
0 88908 Manager 11456 35 11
3 16678 Director 14792 50 22
4 50771 Manager 11642 39 15
5 11268 Manager 11710 41 15
10 65712 Director 12400 51 19
14 47992 Manager 11216 45 12
15 92462 Director 12137 48 19
16 40928 Director 13114 50 21
Using loops and functions, we can compute key statistics (exp: Average Salary and Experience) for different employee groups.
def compute_averages(df):
return df.groupby("Position").agg({
"ID_Number": "first", # "Take one example ID"
"Salary": "mean",
"Age": "mean",
"Experience": "mean"
round(2)
}).reset_index().
= compute_averages(df)
avg_stats print(avg_stats)
Position ID_Number Salary Age Experience
0 Director 16678 13110.75 49.75 20.25
1 Manager 88908 10594.67 41.33 13.67
2 Staff 77421 3924.20 28.40 4.00
3 Supervisor 96478 6298.60 37.80 7.80
Loops can be used to determine the highest-paid employee in each position category.
def top_earners(df):
return df.loc[df.groupby("Position")["Salary"].idxmax()]
= top_earners(df)
top_paid_employees print(top_paid_employees)
ID_Number Position Salary Age Experience
3 16678 Director 14792 50 22
5 11268 Manager 11710 41 15
17 46011 Staff 4939 27 5
19 79496 Supervisor 7885 38 10