2  Vectors

In data science, understanding the foundational concepts of vectors and matrices is essential. Both are fundamental to a wide range of operations in machine learning, statistics, optimization, and various algorithms.

2.1 Definition

A vector is a fundamental concept in mathematics and physics that represents a quantity with both magnitude (size) and direction. In the context of data science, vectors are used to represent data points, parameters, and relationships between variables in a structured format. Vectors are particularly useful because they allow for efficient manipulation of multidimensional data.

Vectors are often represented as:

  • Column vectors: \[ \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ \vdots \\ v_n \end{bmatrix} \]

  • Row vectors: \[ \bigg[v_1, v_2, v_3, \ldots, v_n\bigg] \]

2.2 Properties

Vectors are fundamental objects in mathematics and physics, defined as quantities possessing both magnitude (size) and direction. Understanding their properties is essential for various applications, particularly in fields such as data science, physics, and engineering.

2.2.1 Dimension

The dimension of a vector is determined by the number of components it contains. A vector with \(n\) elements is said to exist in \(n\)-dimensional space.

  • A vector in \(2D\) space, such as \(\mathbf{v} = [v_1, v_2]\), has a dimension of 2.
  • A vector in \(3D\) space, like \(\mathbf{v} = [v_1, v_2, v_3]\), has a dimension of 3.
  • A vector in \(nD\) space, like \(\mathbf{v} = [v_1, v_2, \cdots, v_n]\), has a dimension of \(n\)

2.2.2 Types of Vectors

  • Zero Vector: A vector where all components are zero, denoted as \(\mathbf{0}\). The zero vector is unique and acts as the additive identity in vector addition: \(\mathbf{v} + \mathbf{0} = \mathbf{v}\).

  • Unit Vector: A vector with a magnitude (length) of 1. Given a vector \(\mathbf{v}\), the unit vector \(\hat{\mathbf{v}}\) is calculated as:

\[ \hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|} \] where \(\|\mathbf{v}\|\) is the magnitude of \(\mathbf{v}\). Unit vectors are often used to specify direction without regard to magnitude.

  • Position Vector: A vector that represents the position of a point in space relative to a fixed origin. In 3D space, the position vector of a point \(P(x, y, z)\) can be represented as:

\[ \mathbf{p} = \begin{bmatrix} x \\ y \\ z \end{bmatrix} \]

2.2.3 Addition and Subtraction

Two vectors can be added/subtracted together if they have the same dimension. The resultant vector is obtained by adding corresponding components:

\[ \mathbf{u} \pm \mathbf{v} = \begin{bmatrix} u_1 \pm v_1 \\ u_2 \pm v_2 \\ \vdots \\ u_n \pm v_n \end{bmatrix} \]

Properties of Addition and Subtraction:

  • Commutative: \(\mathbf{u} \pm \mathbf{v} = \mathbf{v} \pm \mathbf{u}\)
  • Associative: \(\mathbf{u} \pm (\mathbf{v} \pm \mathbf{w}) = (\mathbf{u} \pm \mathbf{v}) + \mathbf{w}\)

2.2.4 Scalar Multiplication

A vector can be multiplied by a scalar (a real number), resulting in a new vector that scales each component:

\[ c \cdot \mathbf{v} = \begin{bmatrix} c \cdot v_1 \\ c \cdot v_2 \\ \vdots \\ c \cdot v_n \end{bmatrix} \]

Properties Scalar Multiplication:

  • If \(c > 1\), the vector is stretched.
  • If \(0 < c < 1\), the vector is shrunk.
  • If \(c < 0\), the vector is flipped in direction.

2.2.5 Magnitude

The magnitude (length) of a vector \(\mathbf{v} = [v_1, v_2, \ldots, v_n]\) is given by:

\[ \|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2} \]

Properties of Magnitude:

  • Magnitude is always non-negative: \(\|\mathbf{v}\| \geq 0\).
  • The magnitude of the zero vector is zero: \(\|\mathbf{0}\| = 0\).

2.2.6 Dot Product

The dot product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) is calculated as:

\[ \mathbf{u} \cdot \mathbf{v} = u_1 v_1 + u_2 v_2 + \ldots + u_n v_n \]

The dot product is commutative:

\[\mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u}\]. - It provides a measure of the angle \(\theta\) between two vectors:

\[\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)\]

  • If \(\mathbf{u} \cdot \mathbf{v} = 0\), the vectors are orthogonal (perpendicular).

2.2.7 Cross Product

The cross product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) results in a vector that is orthogonal to both, defined only in three-dimensional space:

\[ \mathbf{u} \times \mathbf{v} = \begin{bmatrix} u_2 v_3 - u_3 v_2 \\ u_3 v_1 - u_1 v_3 \\ u_1 v_2 - u_2 v_1 \end{bmatrix} \]

The magnitude of the cross product gives the area of the parallelogram formed by the two vectors:

\[\|\mathbf{u} \times \mathbf{v}\| = \|\mathbf{u}\| \|\mathbf{v}\| \sin(\theta)\]

The cross product is anti-commutative:

\[\mathbf{u} \times \mathbf{v} = -(\mathbf{v} \times \mathbf{u})c\].

2.3 Simple Applied

The geometric interpretations in 2D and 3D space are also depicted, illustrating how these vector operations apply.

2.3.1 Vectors in 2D

Problem 1: Vector Addition

Given the following five vectors representing customer expenditures in different categories:

  • Vector: \(\mathbf{A} = [1000, 1500]\) (expenditure for food and entertainment)
  • Vector: \(\mathbf{B} = [700, 300]\) (expenditure for transportation and others)
  • Vector: \(\mathbf{C} = [1200, 800]\) (expenditure for clothing and accessories)
  • Vector: \(\mathbf{D} = [900, 400]\) (expenditure for utilities)
  • Vector: \(\mathbf{E} = [500, 600]\) (expenditure for health and fitness)

Calculate the sum of all vectors:

\[ \mathbf{T} = \mathbf{A} + \mathbf{B} + \mathbf{C} + \mathbf{D} + \mathbf{E} \]

Calculating each component:

\[ \mathbf{T} = [3600, 3600] \]

The resulting vector \(\mathbf{T} = [3600, 3600]\) represents the total expenditure across all categories for the customers, indicating the overall spending in food, entertainment, transportation, clothing, utilities, and health.

Problem 2: Magnitude

Given the income and expenses of five customers, visualize these data points as vectors. The following are their income and expense data:

Customer Income and Expenditure
Customer Income Expenditure
Customer 1 7000 3000
Customer 2 4500 2000
Customer 3 8000 4000
Customer 4 5500 3500
Customer 5 6000 2500
  • Customer 1: \(\mathbf{P_1} = [7000, 3000]\)
  • Customer 2: \(\mathbf{P_2} = [4500, 2000]\)
  • Customer 3: \(\mathbf{P_3} = [8000, 4000]\)
  • Customer 4: \(\mathbf{P_4} = [5500, 3500]\)
  • Customer 5: \(\mathbf{P_5} = [6000, 2500]\)

Magnitude Calculation:

  • Magnitude of Customer 1: \[ \|\mathbf{P_1}\| = \sqrt{7000^2 + 3000^2} \approx 7615.77 \]

  • Magnitude of Customer 2: \[ \|\mathbf{P_2}\| = \sqrt{4500^2 + 2000^2} \approx 4924.43 \]

  • Magnitude of Customer 3: \[ \|\mathbf{P_3}\| = \sqrt{8000^2 + 4000^2} \approx 8944.27 \]

  • Magnitude of Customer 4: \[ \|\mathbf{P_4}\| = \sqrt{5500^2 + 3500^2} \approx 6557.44 \]

  • Magnitude of Customer 5: \[ \|\mathbf{P_5}\| = \sqrt{6000^2 + 2500^2} \approx 6557.44 \]

These magnitudes represent the overall financial status (considering both income and expenditure) of each customer, showing their relative financial strengths.

Problem 3: Cluster Analysis

To perform cluster analysis, we first calculate the Euclidean distances between each pair of customers using the formula:

\[ d(\mathbf{P_i}, \mathbf{P_j}) = \sqrt{(x_i - x_j)^2 + (y_i - y_j)^2} \]

Where \((x_i, y_i)\) and \((x_j, y_j)\) are the coordinates of the vectors.

Distance Calculations:

  • Distance between Customer 1 and Customer 2: \[ d(\mathbf{P_1}, \mathbf{P_2}) = \sqrt{(7000 - 4500)^2 + (3000 - 2000)^2} \approx 2500.00 \]

  • Distance between Customer 1 and Customer 3: \[ d(\mathbf{P_1}, \mathbf{P_3}) = \sqrt{(7000 - 8000)^2 + (3000 - 4000)^2} \approx 1414.21 \]

  • Distance between Customer 1 and Customer 4: \[ d(\mathbf{P_1}, \mathbf{P_4}) = \sqrt{(7000 - 5500)^2 + (3000 - 3500)^2} \approx 1500.00 \]

  • Distance between Customer 1 and Customer 5: \[ d(\mathbf{P_1}, \mathbf{P_5}) = \sqrt{(7000 - 6000)^2 + (3000 - 2500)^2} \approx 1118.03 \]

  • Distance between Customer 2 and Customer 3: \[ d(\mathbf{P_2}, \mathbf{P_3}) = \sqrt{(4500 - 8000)^2 + (2000 - 4000)^2} \approx 3961.33 \]

  • Distance between Customer 2 and Customer 4: \[ d(\mathbf{P_2}, \mathbf{P_4}) = \sqrt{(4500 - 5500)^2 + (2000 - 3500)^2} \approx 1732.05 \]

  • Distance between Customer 2 and Customer 5: \[ d(\mathbf{P_2}, \mathbf{P_5}) = \sqrt{(4500 - 6000)^2 + (2000 - 2500)^2} \approx 1118.03 \]

  • Distance between Customer 3 and Customer 4: \[ d(\mathbf{P_3}, \mathbf{P_4}) = \sqrt{(8000 - 5500)^2 + (4000 - 3500)^2} \approx 2500.00 \]

  • Distance between Customer 3 and Customer 5: \[ d(\mathbf{P_3}, \mathbf{P_5}) = \sqrt{(8000 - 6000)^2 + (4000 - 2500)^2} \approx 1767.77 \]

  • Distance between Customer 4 and Customer 5: \[ d(\mathbf{P_4}, \mathbf{P_5}) = \sqrt{(5500 - 6000)^2 + (3500 - 2500)^2} \approx 1118.03 \]

Clustering the Customers: Based on the calculated distances, we can group the customers into clusters. A common method is to use hierarchical clustering or a distance threshold. Using the calculated distances, we can cluster the customers as follows:

  • Cluster 1:
    • Customers 1, 4, and 5: These customers are closer to each other based on their financial vectors, indicating similar income and expenditure patterns.
  • Cluster 2:
    • Customer 2: This customer is more distanced from the others, indicating a different financial behavior.
  • Cluster 3:
    • Customer 3: This customer is also distanced from Cluster 1 and 2, showing a distinct pattern.

Summary of Clusters:

  • Cluster 1: \(\{\mathbf{P_1}, \mathbf{P_4}, \mathbf{P_5}\}\)
  • Cluster 2: \(\{\mathbf{P_2}\}\)
  • Cluster 3: \(\{\mathbf{P_3}\}\)

This clustering approach helps identify groups of customers with similar financial states, which can be beneficial for targeted marketing strategies or financial planning.

Problem 4: Vector Normalization

The unit vector for each customer vector \(\mathbf{P_i}\) can be calculated using the formula:

\[ \hat{\mathbf{P_i}} = \frac{\mathbf{P_i}}{\|\mathbf{P_i}\|} \]

where \(\|\mathbf{P_i}\|\) is the magnitude of the vector \(\mathbf{P_i}\). Calculations:

  • Magnitude of Customer 1: \[ \|\mathbf{P_1}\| = \sqrt{7000^2 + 3000^2} \approx 7810.25 \]

  • Unit Vector of Customer 1: \[ \hat{\mathbf{P_1}} = \frac{\mathbf{P_1}}{\|\mathbf{P_1}\|} \approx \left[\frac{7000}{7810.25}, \frac{3000}{7810.25}\right] \approx [0.896, 0.384] \]

  • Magnitude of Customer 2: \[ \|\mathbf{P_2}\| = \sqrt{4500^2 + 2000^2} \approx 5000 \]

  • Unit Vector of Customer 2: \[ \hat{\mathbf{P_2}} = \frac{\mathbf{P_2}}{\|\mathbf{P_2}\|} \approx [0.9, 0.4] \]

  • Magnitude of Customer 3: \[ \|\mathbf{P_3}\| = \sqrt{8000^2 + 4000^2} \approx 8944.27 \]

  • Unit Vector of Customer 3: \[ \hat{\mathbf{P_3}} = \frac{\mathbf{P_3}}{\|\mathbf{P_3}\|} \approx [0.894, 0.447] \]

  • Magnitude of Customer 4: \[ \|\mathbf{P_4}\| = \sqrt{5500^2 + 3500^2} \approx 6557.44 \]

  • Unit Vector of Customer 4: \[ \hat{\mathbf{P_4}} = \frac{\mathbf{P_4}}{\|\mathbf{P_4}\|} \approx [0.839, 0.534] \]

  • Magnitude of Customer 5: \[ \|\mathbf{P_5}\| = \sqrt{6000^2 + 2500^2} \approx 6557.44 \]

  • Unit Vector of Customer 5: \[ \hat{\mathbf{P_5}} = \frac{\mathbf{P_5}}{\|\mathbf{P_5}\|} \approx [0.915, 0.382] \]

Normalization is crucial in data analysis and machine learning because:

  • It ensures that all features have the same scale, which is essential for algorithms that rely on distance calculations, such as K-means clustering and K-nearest neighbors.
  • It improves the convergence speed of gradient descent algorithms.
  • It helps mitigate the effects of bias due to varying ranges of feature values, leading to more balanced contributions during model training.

Notes: In summary, normalization enhances the effectiveness and accuracy of machine learning models by ensuring that all input vectors contribute equally to the analysis.

2.3.2 Vectors in 3D

Problem 1: Vector Addition

Suppose we have data on income, expenditure, and savings from five customers. We can represent this as a vector in 3D space.

Income, Expenditure, and Savings of Customers
Customer Income Expenditure Savings
Customer 1 7000 3000 4000
Customer 2 4500 2000 2500
Customer 3 8000 4000 4000
Customer 4 5500 3500 2000
Customer 5 6000 2500 3500

Let’s perform the vector addition for all customers by summing their components one by one.

If we define the total vector as:

\[ \mathbf{P_{\text{Total}} = P_1 + P_2 + P_3 + P_4 + P_5} \]

Then the components of \(\mathbf{P_{\text{Total}}}\) can be calculated as follows:

  • Total Income: \[ \text{Total Income} = 7000 + 4500 + 8000 + 5500 + 6000 = 31000 \]

  • Total Expenditure: \[ \text{Total Expenditure} = 3000 + 2000 + 4000 + 3500 + 2500 = 15500 \]

  • Total Savings: \[ \text{Total Savings} = 4000 + 2500 + 4000 + 2000 + 3500 = 16000 \]

Problem 2: Magnitude

Magnitude Calculation:

The magnitude of each customer’s financial profile will be calculated using the formula:

\[ \| P \| = \sqrt{x^2 + y^2 + z^2} \]

    Customer Income Expenditure Savings Magnitude
1 Customer 1   7000        3000    4000  8602.325
2 Customer 2   4500        2000    2500  5522.681
3 Customer 3   8000        4000    4000  9797.959
4 Customer 4   5500        3500    2000  6819.091
5 Customer 5   6000        2500    3500  7382.412

Problem 3: Cluster Analysis

The Euclidean distance between two vectors \(P_i\) and \(P_j\) can be calculated using the formula:

\[ d(P_i, P_j) = (x_i - x_j)^2 + (y_i - y_j)^2 + (z_i - z_j)^2 \]

Where:

  • \(P_i \) and \( P_j\) are the two points in space,
  • \(x_i, y_i, z_i\) are the coordinates of point \(P_i\),
  • \(x_j, y_j, z_j\) are the coordinates of point \(P_j\).

Let’s calculate the Euclidean distance between the customers in our dataset. We will use the previously defined customer data:

           Customer 1 Customer 2 Customer 3 Customer 4 Customer 5
Customer 1      0.000   3082.207   1414.214   2549.510   1224.745
Customer 2   3082.207      0.000   4301.163   1870.829   1870.829
Customer 3   1414.214   4301.163      0.000   3240.370   2549.510
Customer 4   2549.510   1870.829   3240.370      0.000   1870.829
Customer 5   1224.745   1870.829   2549.510   1870.829      0.000

Now, we will apply K-Means clustering using the distance matrix:

Notes: To visualize the customer data after normalization and apply K-Means clustering, you need to normalize the data before performing clustering.

Problem 4: Vector Normalization

Normalization is the process of transforming a vector into a unit vector that has a magnitude of 1. The formula to calculate the unit vector \(\hat{P}\) is:

\[ \hat{P} = \frac{P}{\|P\|} \]

Where \(\|P\|\) is the magnitude of vector \(P\). Consider the following Tabel:

Unit Vectors of Customers
Income Expenditure Savings
0.8137335 0.3487429 0.4649906
0.8148217 0.3621430 0.4526787
0.8164966 0.4082483 0.4082483
0.8065591 0.5132649 0.2932942
0.8127426 0.3386427 0.4740998

2.4 K-Means Clustering

In this document, we will manually calculate the K-Means clustering for a dataset containing customer data. The dataset consists of three features: Income, Expenditure, and Savings. We will follow the K-Means clustering algorithm steps, including initialization, assignment, and update of centroids.

2.4.1 Step 1: Data Preparation

The customer data is as follows:

Customer Income Expenditure Savings
Customer 1 7000 3000 4000
Customer 2 4500 2000 2500
Customer 3 8000 4000 4000
Customer 4 5500 3500 2000
Customer 5 6000 2500 3500

2.4.2 Step 2: Initialization

Let’s assume we randomly select the following points as initial centroids:

  • Centroid 1: Customer 1 \((7000, 3000, 4000)\)
  • Centroid 2: Customer 2 \((4500, 2000, 2500)\)
  • Centroid 3: Customer 3 \((8000, 4000, 4000)\)

2.4.3 Step 3: Assignment

We will calculate the Euclidean distance from each customer to each centroid and assign each customer to the nearest centroid. The formula for Euclidean distance is:

\[ d(P_i, P_j) = \sqrt{(x_i - x_j)^2 + (y_i - y_j)^2 + (z_i - z_j)^2} \]

Calculate Distances:

  • Customer 1 \((7000, 3000, 4000)\):
    • Distance to Centroid 1: \[ d = 0 \]
    • Distance to Centroid 2: \[ d = \sqrt{(7000-4500)^2 + (3000-2000)^2 + (4000-2500)^2} \approx 2914.21 \]
    • Distance to Centroid 3: \[ d = \sqrt{(7000-8000)^2 + (3000-4000)^2 + (4000-4000)^2} \approx 1414.21 \]
  • Customer 2 \((4500, 2000, 2500)\):
    • Distance to Centroid 1: \[ d \approx 2914.21 \]
    • Distance to Centroid 2: \[ d = 0 \]
    • Distance to Centroid 3: \[ d = \sqrt{(4500-8000)^2 + (2000-4000)^2 + (2500-4000)^2} \approx 4202.38 \]
  • Customer 3 \((8000, 4000, 4000)\):
    • Distance to Centroid 1: \[ d \approx 1414.21 \]
    • Distance to Centroid 2: \[ d \approx 4202.38 \]
    • Distance to Centroid 3: \[ d = 0 \]
  • Customer 4 \((5500, 3500, 2000)\):
    • Distance to Centroid 1: \[ d \approx 2204.24 \]
    • Distance to Centroid 2: \[ d \approx 1600.00 \]
    • Distance to Centroid 3: \[ d \approx 2675.90 \]
  • Customer 5 \((6000, 2500, 3500)\):
    • Distance to Centroid 1: \[ d \approx 1414.21 \]
    • Distance to Centroid 2: \[ d \approx 1600.00 \]
    • Distance to Centroid 3: \[ d \approx 2236.07 \]

2.4.4 Step 4: Assign Customers to Clusters

Based on the distances calculated, we assign each customer to the nearest centroid:

  • Customer 1: Cluster 1 (Centroid 1)
  • Customer 2: Cluster 2 (Centroid 2)
  • Customer 3: Cluster 3 (Centroid 3)
  • Customer 4: Cluster 2 (Centroid 2)
  • Customer 5: Cluster 2 (Centroid 2)

2.4.5 Step 5: Update Centroids

Next, we calculate the new centroids for each cluster:

  1. Cluster 1:

    • Centroid = \((7000, 3000, 4000)\)
  2. Cluster 2 (Customers 2, 4, and 5):

    • New Income = \(\frac{4500 + 5500 + 6000}{3} = 5500\)
    • New Expenditure = \(\frac{2000 + 3500 + 2500}{3} = 2333.33\)
    • New Savings = \(\frac{2500 + 2000 + 3500}{3} = 3000\)
    • New Centroid = \((5500, 2333.33, 3000)\)
  3. Cluster 3:

    • Centroid = \((8000, 4000, 4000)\)

2.4.6 Repeat Steps

You would repeat the assignment and update steps until the centroids no longer change significantly.

Notes: This manual calculation provides a basic understanding of how K-Means clustering works, including the assignment of points to clusters and the update of centroids based on the mean of the assigned points. This process can be complex, especially for larger datasets, and is typically done using algorithms implemented in software like R and Python.

2.5 Use Vector in Python

Klik here