Let me walk you through each step of PCA with this example:
- Data Preparation
- We start with a 10×2 dataset containing two features
- Each row represents one observation with two measurements
- Standardization (Step 1)
- Center the data by subtracting the mean
- Scale the data to unit variance
- This ensures all features contribute equally to the analysis
- Covariance Matrix Calculation (Step 2)
- Compute the covariance matrix to understand relationships between variables
- For our 2D data, this results in a 2×2 matrix
- The diagonal elements represent variances
- Off-diagonal elements represent covariances between variables
- Eigendecomposition (Step 3)
- Calculate eigenvalues and eigenvectors of the covariance matrix
- Eigenvalues tell us the amount of variance explained by each principal component
- Eigenvectors give us the direction of the principal components
- Principal Components (Step 4)
- Project the standardized data onto the principal components
- The first principal component (PC1) captures the maximum variance
- The second principal component (PC2) is orthogonal to PC1
Key Results Interpretation:
- Explained Variance Ratio
- Shows how much variance each principal component explains
- Helps determine how many components to keep
- In this example, if PC1 explains >80% of variance, we might only keep one component
- Transformed Data
- The final coordinates in the new principal component space
- Can be used for dimensionality reduction by keeping fewer components
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# Create sample dataset
np.random.seed(42)
X = np.array([
[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0],
[2.3, 2.7],
[2.0, 1.6],
[1.0, 1.1],
[1.5, 1.6],
[1.1, 0.9]
])
def plot_data(X, title):
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c='blue', alpha=0.5)
plt.title(title)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.axis('equal')
return plt
# Step 1: Standardize the data
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
# Step 2: Calculate covariance matrix
covariance_matrix = np.cov(X_standardized.T)
# Step 3: Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
# Sort eigenvalues and eigenvectors in descending order
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Step 4: Project data onto principal components
PC = X_standardized.dot(eigenvectors)
# Calculate explained variance ratio
explained_variance_ratio = eigenvalues / np.sum(eigenvalues)
# Print results
print("Original Data:")
print(X)
print("\nStandardized Data:")
print(X_standardized)
print("\nCovariance Matrix:")
print(covariance_matrix)
print("\nEigenvalues:")
print(eigenvalues)
print("\nEigenvectors:")
print(eigenvectors)
print("\nExplained Variance Ratio:")
print(explained_variance_ratio)
print("\nTransformed Data (Principal Components):")
print(PC)
Read also: How to code a binary classifier in python





