PCA Implementation with Step-by-Step Example

Let me walk you through each step of PCA with this example:

Data Preparation
- We start with a 10×2 dataset containing two features
- Each row represents one observation with two measurements
Standardization (Step 1)
- Center the data by subtracting the mean
- Scale the data to unit variance
- This ensures all features contribute equally to the analysis
Covariance Matrix Calculation (Step 2)
- Compute the covariance matrix to understand relationships between variables
- For our 2D data, this results in a 2×2 matrix
- The diagonal elements represent variances
- Off-diagonal elements represent covariances between variables
Eigendecomposition (Step 3)
- Calculate eigenvalues and eigenvectors of the covariance matrix
- Eigenvalues tell us the amount of variance explained by each principal component
- Eigenvectors give us the direction of the principal components
Principal Components (Step 4)
- Project the standardized data onto the principal components
- The first principal component (PC1) captures the maximum variance
- The second principal component (PC2) is orthogonal to PC1

Key Results Interpretation:

Explained Variance Ratio
- Shows how much variance each principal component explains
- Helps determine how many components to keep
- In this example, if PC1 explains >80% of variance, we might only keep one component
Transformed Data
- The final coordinates in the new principal component space
- Can be used for dimensionality reduction by keeping fewer components

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Create sample dataset
np.random.seed(42)
X = np.array([
    [2.5, 2.4],
    [0.5, 0.7],
    [2.2, 2.9],
    [1.9, 2.2],
    [3.1, 3.0],
    [2.3, 2.7],
    [2.0, 1.6],
    [1.0, 1.1],
    [1.5, 1.6],
    [1.1, 0.9]
])

def plot_data(X, title):
    plt.figure(figsize=(8, 6))
    plt.scatter(X[:, 0], X[:, 1], c='blue', alpha=0.5)
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(True)
    plt.axis('equal')
    return plt

# Step 1: Standardize the data
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Step 2: Calculate covariance matrix
covariance_matrix = np.cov(X_standardized.T)

# Step 3: Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# Sort eigenvalues and eigenvectors in descending order
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Step 4: Project data onto principal components
PC = X_standardized.dot(eigenvectors)

# Calculate explained variance ratio
explained_variance_ratio = eigenvalues / np.sum(eigenvalues)

# Print results
print("Original Data:")
print(X)
print("\nStandardized Data:")
print(X_standardized)
print("\nCovariance Matrix:")
print(covariance_matrix)
print("\nEigenvalues:")
print(eigenvalues)
print("\nEigenvectors:")
print(eigenvectors)
print("\nExplained Variance Ratio:")
print(explained_variance_ratio)
print("\nTransformed Data (Principal Components):")
print(PC)

PCA Implementation with Step-by-Step Example

Santhakumar Raja

Related Categories

Law

Health science