Customer | Income | Expenditure |
---|---|---|
Customer 1 | 7000 | 3000 |
Customer 2 | 4500 | 2000 |
Customer 3 | 8000 | 4000 |
Customer 4 | 5500 | 3500 |
Customer 5 | 6000 | 2500 |
2 Vectors
In data science, understanding the foundational concepts of vectors and matrices is essential. Both are fundamental to a wide range of operations in machine learning, statistics, optimization, and various algorithms.
2.1 Definition
A vector is a fundamental concept in mathematics and physics that represents a quantity with both magnitude (size) and direction. In the context of data science, vectors are used to represent data points, parameters, and relationships between variables in a structured format. Vectors are particularly useful because they allow for efficient manipulation of multidimensional data.
Vectors are often represented as:
Column vectors: \[ \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ \vdots \\ v_n \end{bmatrix} \]
Row vectors: \[ \bigg[v_1, v_2, v_3, \ldots, v_n\bigg] \]
2.2 Properties
Vectors are fundamental objects in mathematics and physics, defined as quantities possessing both magnitude (size) and direction. Understanding their properties is essential for various applications, particularly in fields such as data science, physics, and engineering.
2.2.1 Dimension
The dimension of a vector is determined by the number of components it contains. A vector with \(n\) elements is said to exist in \(n\)-dimensional space.
- A vector in \(2D\) space, such as \(\mathbf{v} = [v_1, v_2]\), has a dimension of 2.
- A vector in \(3D\) space, like \(\mathbf{v} = [v_1, v_2, v_3]\), has a dimension of 3.
- A vector in \(nD\) space, like \(\mathbf{v} = [v_1, v_2, \cdots, v_n]\), has a dimension of \(n\)
2.2.2 Types of Vectors
Zero Vector: A vector where all components are zero, denoted as \(\mathbf{0}\). The zero vector is unique and acts as the additive identity in vector addition: \(\mathbf{v} + \mathbf{0} = \mathbf{v}\).
Unit Vector: A vector with a magnitude (length) of 1. Given a vector \(\mathbf{v}\), the unit vector \(\hat{\mathbf{v}}\) is calculated as:
\[ \hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|} \] where \(\|\mathbf{v}\|\) is the magnitude of \(\mathbf{v}\). Unit vectors are often used to specify direction without regard to magnitude.
- Position Vector: A vector that represents the position of a point in space relative to a fixed origin. In 3D space, the position vector of a point \(P(x, y, z)\) can be represented as:
\[ \mathbf{p} = \begin{bmatrix} x \\ y \\ z \end{bmatrix} \]
2.2.3 Addition and Subtraction
Two vectors can be added/subtracted together if they have the same dimension. The resultant vector is obtained by adding corresponding components:
\[ \mathbf{u} \pm \mathbf{v} = \begin{bmatrix} u_1 \pm v_1 \\ u_2 \pm v_2 \\ \vdots \\ u_n \pm v_n \end{bmatrix} \]
Properties of Addition and Subtraction:
- Commutative: \(\mathbf{u} \pm \mathbf{v} = \mathbf{v} \pm \mathbf{u}\)
- Associative: \(\mathbf{u} \pm (\mathbf{v} \pm \mathbf{w}) = (\mathbf{u} \pm \mathbf{v}) + \mathbf{w}\)
2.2.4 Scalar Multiplication
A vector can be multiplied by a scalar (a real number), resulting in a new vector that scales each component:
\[ c \cdot \mathbf{v} = \begin{bmatrix} c \cdot v_1 \\ c \cdot v_2 \\ \vdots \\ c \cdot v_n \end{bmatrix} \]
Properties Scalar Multiplication:
- If \(c > 1\), the vector is stretched.
- If \(0 < c < 1\), the vector is shrunk.
- If \(c < 0\), the vector is flipped in direction.
2.2.5 Magnitude
The magnitude (length) of a vector \(\mathbf{v} = [v_1, v_2, \ldots, v_n]\) is given by:
\[ \|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2} \]
Properties of Magnitude:
- Magnitude is always non-negative: \(\|\mathbf{v}\| \geq 0\).
- The magnitude of the zero vector is zero: \(\|\mathbf{0}\| = 0\).
2.2.6 Dot Product
The dot product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) is calculated as:
\[ \mathbf{u} \cdot \mathbf{v} = u_1 v_1 + u_2 v_2 + \ldots + u_n v_n \]
The dot product is commutative:
\[\mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u}\]. - It provides a measure of the angle \(\theta\) between two vectors:
\[\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)\]
- If \(\mathbf{u} \cdot \mathbf{v} = 0\), the vectors are orthogonal (perpendicular).
2.2.7 Cross Product
The cross product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) results in a vector that is orthogonal to both, defined only in three-dimensional space:
\[ \mathbf{u} \times \mathbf{v} = \begin{bmatrix} u_2 v_3 - u_3 v_2 \\ u_3 v_1 - u_1 v_3 \\ u_1 v_2 - u_2 v_1 \end{bmatrix} \]
The magnitude of the cross product gives the area of the parallelogram formed by the two vectors:
\[\|\mathbf{u} \times \mathbf{v}\| = \|\mathbf{u}\| \|\mathbf{v}\| \sin(\theta)\]
The cross product is anti-commutative:
\[\mathbf{u} \times \mathbf{v} = -(\mathbf{v} \times \mathbf{u})c\].
2.3 Simple Applied
The geometric interpretations in 2D and 3D space are also depicted, illustrating how these vector operations apply.
2.3.1 Vectors in 2D
Problem 1: Vector Addition
Given the following five vectors representing customer expenditures in different categories:
- Vector: \(\mathbf{A} = [1000, 1500]\) (expenditure for food and entertainment)
- Vector: \(\mathbf{B} = [700, 300]\) (expenditure for transportation and others)
- Vector: \(\mathbf{C} = [1200, 800]\) (expenditure for clothing and accessories)
- Vector: \(\mathbf{D} = [900, 400]\) (expenditure for utilities)
- Vector: \(\mathbf{E} = [500, 600]\) (expenditure for health and fitness)
Calculate the sum of all vectors:
\[ \mathbf{T} = \mathbf{A} + \mathbf{B} + \mathbf{C} + \mathbf{D} + \mathbf{E} \]
Calculating each component:
\[ \mathbf{T} = [3600, 3600] \]
The resulting vector \(\mathbf{T} = [3600, 3600]\) represents the total expenditure across all categories for the customers, indicating the overall spending in food, entertainment, transportation, clothing, utilities, and health.
Problem 2: Magnitude
Given the income and expenses of five customers, visualize these data points as vectors. The following are their income and expense data:
- Customer 1: \(\mathbf{P_1} = [7000, 3000]\)
- Customer 2: \(\mathbf{P_2} = [4500, 2000]\)
- Customer 3: \(\mathbf{P_3} = [8000, 4000]\)
- Customer 4: \(\mathbf{P_4} = [5500, 3500]\)
- Customer 5: \(\mathbf{P_5} = [6000, 2500]\)
Magnitude Calculation:
Magnitude of Customer 1: \[ \|\mathbf{P_1}\| = \sqrt{7000^2 + 3000^2} \approx 7615.77 \]
Magnitude of Customer 2: \[ \|\mathbf{P_2}\| = \sqrt{4500^2 + 2000^2} \approx 4924.43 \]
Magnitude of Customer 3: \[ \|\mathbf{P_3}\| = \sqrt{8000^2 + 4000^2} \approx 8944.27 \]
Magnitude of Customer 4: \[ \|\mathbf{P_4}\| = \sqrt{5500^2 + 3500^2} \approx 6557.44 \]
Magnitude of Customer 5: \[ \|\mathbf{P_5}\| = \sqrt{6000^2 + 2500^2} \approx 6557.44 \]
These magnitudes represent the overall financial status (considering both income and expenditure) of each customer, showing their relative financial strengths.
Problem 3: Cluster Analysis
To perform cluster analysis, we first calculate the Euclidean distances between each pair of customers using the formula:
\[ d(\mathbf{P_i}, \mathbf{P_j}) = \sqrt{(x_i - x_j)^2 + (y_i - y_j)^2} \]
Where \((x_i, y_i)\) and \((x_j, y_j)\) are the coordinates of the vectors.
Distance Calculations:
Distance between Customer 1 and Customer 2: \[ d(\mathbf{P_1}, \mathbf{P_2}) = \sqrt{(7000 - 4500)^2 + (3000 - 2000)^2} \approx 2500.00 \]
Distance between Customer 1 and Customer 3: \[ d(\mathbf{P_1}, \mathbf{P_3}) = \sqrt{(7000 - 8000)^2 + (3000 - 4000)^2} \approx 1414.21 \]
Distance between Customer 1 and Customer 4: \[ d(\mathbf{P_1}, \mathbf{P_4}) = \sqrt{(7000 - 5500)^2 + (3000 - 3500)^2} \approx 1500.00 \]
Distance between Customer 1 and Customer 5: \[ d(\mathbf{P_1}, \mathbf{P_5}) = \sqrt{(7000 - 6000)^2 + (3000 - 2500)^2} \approx 1118.03 \]
Distance between Customer 2 and Customer 3: \[ d(\mathbf{P_2}, \mathbf{P_3}) = \sqrt{(4500 - 8000)^2 + (2000 - 4000)^2} \approx 3961.33 \]
Distance between Customer 2 and Customer 4: \[ d(\mathbf{P_2}, \mathbf{P_4}) = \sqrt{(4500 - 5500)^2 + (2000 - 3500)^2} \approx 1732.05 \]
Distance between Customer 2 and Customer 5: \[ d(\mathbf{P_2}, \mathbf{P_5}) = \sqrt{(4500 - 6000)^2 + (2000 - 2500)^2} \approx 1118.03 \]
Distance between Customer 3 and Customer 4: \[ d(\mathbf{P_3}, \mathbf{P_4}) = \sqrt{(8000 - 5500)^2 + (4000 - 3500)^2} \approx 2500.00 \]
Distance between Customer 3 and Customer 5: \[ d(\mathbf{P_3}, \mathbf{P_5}) = \sqrt{(8000 - 6000)^2 + (4000 - 2500)^2} \approx 1767.77 \]
Distance between Customer 4 and Customer 5: \[ d(\mathbf{P_4}, \mathbf{P_5}) = \sqrt{(5500 - 6000)^2 + (3500 - 2500)^2} \approx 1118.03 \]
Clustering the Customers: Based on the calculated distances, we can group the customers into clusters. A common method is to use hierarchical clustering or a distance threshold. Using the calculated distances, we can cluster the customers as follows:
- Cluster 1:
- Customers 1, 4, and 5: These customers are closer to each other based on their financial vectors, indicating similar income and expenditure patterns.
- Cluster 2:
- Customer 2: This customer is more distanced from the others, indicating a different financial behavior.
- Cluster 3:
- Customer 3: This customer is also distanced from Cluster 1 and 2, showing a distinct pattern.
Summary of Clusters:
- Cluster 1: \(\{\mathbf{P_1}, \mathbf{P_4}, \mathbf{P_5}\}\)
- Cluster 2: \(\{\mathbf{P_2}\}\)
- Cluster 3: \(\{\mathbf{P_3}\}\)
This clustering approach helps identify groups of customers with similar financial states, which can be beneficial for targeted marketing strategies or financial planning.
Problem 4: Vector Normalization
The unit vector for each customer vector \(\mathbf{P_i}\) can be calculated using the formula:
\[ \hat{\mathbf{P_i}} = \frac{\mathbf{P_i}}{\|\mathbf{P_i}\|} \]
where \(\|\mathbf{P_i}\|\) is the magnitude of the vector \(\mathbf{P_i}\). Calculations:
Magnitude of Customer 1: \[ \|\mathbf{P_1}\| = \sqrt{7000^2 + 3000^2} \approx 7810.25 \]
Unit Vector of Customer 1: \[ \hat{\mathbf{P_1}} = \frac{\mathbf{P_1}}{\|\mathbf{P_1}\|} \approx \left[\frac{7000}{7810.25}, \frac{3000}{7810.25}\right] \approx [0.896, 0.384] \]
Magnitude of Customer 2: \[ \|\mathbf{P_2}\| = \sqrt{4500^2 + 2000^2} \approx 5000 \]
Unit Vector of Customer 2: \[ \hat{\mathbf{P_2}} = \frac{\mathbf{P_2}}{\|\mathbf{P_2}\|} \approx [0.9, 0.4] \]
Magnitude of Customer 3: \[ \|\mathbf{P_3}\| = \sqrt{8000^2 + 4000^2} \approx 8944.27 \]
Unit Vector of Customer 3: \[ \hat{\mathbf{P_3}} = \frac{\mathbf{P_3}}{\|\mathbf{P_3}\|} \approx [0.894, 0.447] \]
Magnitude of Customer 4: \[ \|\mathbf{P_4}\| = \sqrt{5500^2 + 3500^2} \approx 6557.44 \]
Unit Vector of Customer 4: \[ \hat{\mathbf{P_4}} = \frac{\mathbf{P_4}}{\|\mathbf{P_4}\|} \approx [0.839, 0.534] \]
Magnitude of Customer 5: \[ \|\mathbf{P_5}\| = \sqrt{6000^2 + 2500^2} \approx 6557.44 \]
Unit Vector of Customer 5: \[ \hat{\mathbf{P_5}} = \frac{\mathbf{P_5}}{\|\mathbf{P_5}\|} \approx [0.915, 0.382] \]
Normalization is crucial in data analysis and machine learning because:
- It ensures that all features have the same scale, which is essential for algorithms that rely on distance calculations, such as K-means clustering and K-nearest neighbors.
- It improves the convergence speed of gradient descent algorithms.
- It helps mitigate the effects of bias due to varying ranges of feature values, leading to more balanced contributions during model training.
Notes: In summary, normalization enhances the effectiveness and accuracy of machine learning models by ensuring that all input vectors contribute equally to the analysis.
2.3.2 Vectors in 3D
Problem 1: Vector Addition
Suppose we have data on income, expenditure, and savings from five customers. We can represent this as a vector in 3D space.
Customer | Income | Expenditure | Savings |
---|---|---|---|
Customer 1 | 7000 | 3000 | 4000 |
Customer 2 | 4500 | 2000 | 2500 |
Customer 3 | 8000 | 4000 | 4000 |
Customer 4 | 5500 | 3500 | 2000 |
Customer 5 | 6000 | 2500 | 3500 |
Let’s perform the vector addition for all customers by summing their components one by one.
If we define the total vector as:
\[ \mathbf{P_{\text{Total}} = P_1 + P_2 + P_3 + P_4 + P_5} \]
Then the components of \(\mathbf{P_{\text{Total}}}\) can be calculated as follows:
Total Income: \[ \text{Total Income} = 7000 + 4500 + 8000 + 5500 + 6000 = 31000 \]
Total Expenditure: \[ \text{Total Expenditure} = 3000 + 2000 + 4000 + 3500 + 2500 = 15500 \]
Total Savings: \[ \text{Total Savings} = 4000 + 2500 + 4000 + 2000 + 3500 = 16000 \]
Problem 2: Magnitude
Magnitude Calculation:
The magnitude of each customer’s financial profile will be calculated using the formula:
\[ \| P \| = \sqrt{x^2 + y^2 + z^2} \]
Customer Income Expenditure Savings Magnitude
1 Customer 1 7000 3000 4000 8602.325
2 Customer 2 4500 2000 2500 5522.681
3 Customer 3 8000 4000 4000 9797.959
4 Customer 4 5500 3500 2000 6819.091
5 Customer 5 6000 2500 3500 7382.412
Problem 3: Cluster Analysis
The Euclidean distance between two vectors \(P_i\) and \(P_j\) can be calculated using the formula:
\[ d(P_i, P_j) = (x_i - x_j)^2 + (y_i - y_j)^2 + (z_i - z_j)^2 \]
Where:
- \(P_i \) and \( P_j\) are the two points in space,
- \(x_i, y_i, z_i\) are the coordinates of point \(P_i\),
- \(x_j, y_j, z_j\) are the coordinates of point \(P_j\).
Let’s calculate the Euclidean distance between the customers in our dataset. We will use the previously defined customer data:
Customer 1 Customer 2 Customer 3 Customer 4 Customer 5
Customer 1 0.000 3082.207 1414.214 2549.510 1224.745
Customer 2 3082.207 0.000 4301.163 1870.829 1870.829
Customer 3 1414.214 4301.163 0.000 3240.370 2549.510
Customer 4 2549.510 1870.829 3240.370 0.000 1870.829
Customer 5 1224.745 1870.829 2549.510 1870.829 0.000
Now, we will apply K-Means clustering using the distance matrix:
Notes: To visualize the customer data after normalization and apply K-Means clustering, you need to normalize the data before performing clustering.
Problem 4: Vector Normalization
Normalization is the process of transforming a vector into a unit vector that has a magnitude of 1. The formula to calculate the unit vector \(\hat{P}\) is:
\[ \hat{P} = \frac{P}{\|P\|} \]
Where \(\|P\|\) is the magnitude of vector \(P\). Consider the following Tabel:
Income | Expenditure | Savings |
---|---|---|
0.8137335 | 0.3487429 | 0.4649906 |
0.8148217 | 0.3621430 | 0.4526787 |
0.8164966 | 0.4082483 | 0.4082483 |
0.8065591 | 0.5132649 | 0.2932942 |
0.8127426 | 0.3386427 | 0.4740998 |
2.4 K-Means Clustering
In this document, we will manually calculate the K-Means clustering for a dataset containing customer data. The dataset consists of three features: Income, Expenditure, and Savings. We will follow the K-Means clustering algorithm steps, including initialization, assignment, and update of centroids.
2.4.1 Step 1: Data Preparation
The customer data is as follows:
Customer | Income | Expenditure | Savings |
---|---|---|---|
Customer 1 | 7000 | 3000 | 4000 |
Customer 2 | 4500 | 2000 | 2500 |
Customer 3 | 8000 | 4000 | 4000 |
Customer 4 | 5500 | 3500 | 2000 |
Customer 5 | 6000 | 2500 | 3500 |
2.4.2 Step 2: Initialization
Let’s assume we randomly select the following points as initial centroids:
- Centroid 1: Customer 1 \((7000, 3000, 4000)\)
- Centroid 2: Customer 2 \((4500, 2000, 2500)\)
- Centroid 3: Customer 3 \((8000, 4000, 4000)\)
2.4.3 Step 3: Assignment
We will calculate the Euclidean distance from each customer to each centroid and assign each customer to the nearest centroid. The formula for Euclidean distance is:
\[ d(P_i, P_j) = \sqrt{(x_i - x_j)^2 + (y_i - y_j)^2 + (z_i - z_j)^2} \]
Calculate Distances:
- Customer 1 \((7000, 3000, 4000)\):
- Distance to Centroid 1: \[ d = 0 \]
- Distance to Centroid 2: \[ d = \sqrt{(7000-4500)^2 + (3000-2000)^2 + (4000-2500)^2} \approx 2914.21 \]
- Distance to Centroid 3: \[ d = \sqrt{(7000-8000)^2 + (3000-4000)^2 + (4000-4000)^2} \approx 1414.21 \]
- Customer 2 \((4500, 2000, 2500)\):
- Distance to Centroid 1: \[ d \approx 2914.21 \]
- Distance to Centroid 2: \[ d = 0 \]
- Distance to Centroid 3: \[ d = \sqrt{(4500-8000)^2 + (2000-4000)^2 + (2500-4000)^2} \approx 4202.38 \]
- Customer 3 \((8000, 4000, 4000)\):
- Distance to Centroid 1: \[ d \approx 1414.21 \]
- Distance to Centroid 2: \[ d \approx 4202.38 \]
- Distance to Centroid 3: \[ d = 0 \]
- Customer 4 \((5500, 3500, 2000)\):
- Distance to Centroid 1: \[ d \approx 2204.24 \]
- Distance to Centroid 2: \[ d \approx 1600.00 \]
- Distance to Centroid 3: \[ d \approx 2675.90 \]
- Customer 5 \((6000, 2500, 3500)\):
- Distance to Centroid 1: \[ d \approx 1414.21 \]
- Distance to Centroid 2: \[ d \approx 1600.00 \]
- Distance to Centroid 3: \[ d \approx 2236.07 \]
2.4.4 Step 4: Assign Customers to Clusters
Based on the distances calculated, we assign each customer to the nearest centroid:
- Customer 1: Cluster 1 (Centroid 1)
- Customer 2: Cluster 2 (Centroid 2)
- Customer 3: Cluster 3 (Centroid 3)
- Customer 4: Cluster 2 (Centroid 2)
- Customer 5: Cluster 2 (Centroid 2)
2.4.5 Step 5: Update Centroids
Next, we calculate the new centroids for each cluster:
Cluster 1:
- Centroid = \((7000, 3000, 4000)\)
Cluster 2 (Customers 2, 4, and 5):
- New Income = \(\frac{4500 + 5500 + 6000}{3} = 5500\)
- New Expenditure = \(\frac{2000 + 3500 + 2500}{3} = 2333.33\)
- New Savings = \(\frac{2500 + 2000 + 3500}{3} = 3000\)
- New Centroid = \((5500, 2333.33, 3000)\)
Cluster 3:
- Centroid = \((8000, 4000, 4000)\)
2.4.6 Repeat Steps
You would repeat the assignment and update steps until the centroids no longer change significantly.
Notes: This manual calculation provides a basic understanding of how K-Means clustering works, including the assignment of points to clusters and the update of centroids based on the mean of the assigned points. This process can be complex, especially for larger datasets, and is typically done using algorithms implemented in software like R and Python.