Let me walk you through each step of PCA with this example:
- Data Preparation
- We start with a 10×2 dataset containing two features
- Each row represents one observation with two measurements
- Standardization (Step 1)
- Center the data by subtracting the mean
- Scale the data to unit variance
- This ensures all features contribute equally to the analysis
- Covariance Matrix Calculation (Step 2)
- Compute the covariance matrix to understand relationships between variables
- For our 2D data, this results in a 2×2 matrix
- The diagonal elements represent variances
- Off-diagonal elements represent covariances between variables
- Eigendecomposition (Step 3)
- Calculate eigenvalues and eigenvectors of the covariance matrix
- Eigenvalues tell us the amount of variance explained by each principal component
- Eigenvectors give us the direction of the principal components
- Principal Components (Step 4)
- Project the standardized data onto the principal components
- The first principal component (PC1) captures the maximum variance
- The second principal component (PC2) is orthogonal to PC1
Key Results Interpretation:
- Explained Variance Ratio
- Shows how much variance each principal component explains
- Helps determine how many components to keep
- In this example, if PC1 explains >80% of variance, we might only keep one component
- Transformed Data
- The final coordinates in the new principal component space
- Can be used for dimensionality reduction by keeping fewer components
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler # Create sample dataset np.random.seed(42) X = np.array([ [2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0], [2.3, 2.7], [2.0, 1.6], [1.0, 1.1], [1.5, 1.6], [1.1, 0.9] ]) def plot_data(X, title): plt.figure(figsize=(8, 6)) plt.scatter(X[:, 0], X[:, 1], c='blue', alpha=0.5) plt.title(title) plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.grid(True) plt.axis('equal') return plt # Step 1: Standardize the data scaler = StandardScaler() X_standardized = scaler.fit_transform(X) # Step 2: Calculate covariance matrix covariance_matrix = np.cov(X_standardized.T) # Step 3: Calculate eigenvalues and eigenvectors eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix) # Sort eigenvalues and eigenvectors in descending order idx = eigenvalues.argsort()[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Step 4: Project data onto principal components PC = X_standardized.dot(eigenvectors) # Calculate explained variance ratio explained_variance_ratio = eigenvalues / np.sum(eigenvalues) # Print results print("Original Data:") print(X) print("\nStandardized Data:") print(X_standardized) print("\nCovariance Matrix:") print(covariance_matrix) print("\nEigenvalues:") print(eigenvalues) print("\nEigenvectors:") print(eigenvectors) print("\nExplained Variance Ratio:") print(explained_variance_ratio) print("\nTransformed Data (Principal Components):") print(PC)
Read also: How to code a binary classifier in python