Linear Discriminant Analysis ( LDA) with Scikit

November 14, 2017May 16, 2021 / RP / 1 Comment

Linear Discriminant Analysis (LDA) is similar to Principal Component Analysis (PCA) in reducing the dimensionality. However, there are certain nuances with LDA that we should be aware of-

LDA is supervised (needs categorical dependent variable) to provide the best linear combination of original variables while providing the maximum separation among the different groups. On the other hand, PCA is unsupervised
LDA can be used for classification also, whereas PCA is generally used for unsupervised learning
LDA doesn’t need the numbers of discriminant to be passed on ahead of time. Generally speaking the number of discriminant will be lower of the number of variables or number of categories-1.
LDA is more robust and can be conducted without even standardizing or normalizing the variables in certain cases
LDA is preferred for bigger data sets and machine learning

Let the action begin now-

lda1 LDA2 LDA3 LDA4 LDA5

Cheers!

Principal Component Analysis ( PCA) using Scikit

November 14, 2017May 16, 2021 / RP / 2 Comments

Principal Component Analysis ( PCA) is generally used as an unsupervised algorithm for reducing the data dimensions to address Curse of Dimensionality, detecting outliers, removing noise, speech recognition and other such areas.

The underlying algorithm in PCA is generally a linear algebra technique called Singular Value Decomposition (SVD). PCAs take the original data and create orthogonal components (uncorrelated components) that capture the information contained in the original data however with significantly less number of components.

Either the components themselves or key loading of the components can be plugged in any further modeling work, rather than the original data to minimize information redundancy and noise.

There are three main ways to select the right number of components-

Number of components should explain at least 80% of the original data variance or information [Preferred One]
Eigen value of each PCA component should be more than or equal to 1. This means that they should express at least one variable worth of information
Elbow or Scree method- look for the elbow in the percentage of variance explained by each components and select the components where an elbow or kink is visible.

You can use any one of the above or combination of the above to select the right number of components. It is very critical to standardize or normalize data before conducting PCA.

In the below case study we will use the first criterion shown above, i.e. 80% or more of the original data variance should be explained by the selected number of components.

PCA1 PCA2 PCA3 PCA4 PCA5 PCA6

Decision Tree using Python Scikit

November 4, 2017May 16, 2021 / RP / 1 Comment

If you are not familiar with Decision Trees, please read this article first.

First let’s look at a very simple example on the Iris data-

Decision Tree in Python

Now let’s look at slightly more complex data-

Let’s first build a logistic regression model in Python using machine learning library Scikit. Please read here about the dataset and dummy coding.

clf1 clf2 clf3 clf4 clf5 clf6 clf7

dt1 dt2 dt3 dt4

Cheers!

Logistic Regression using Scikit Python

November 4, 2017May 16, 2021 / RP / 1 Comment

If you are not familiar with logistics regression, please read this article first. Moreover, if you are not familiar with the sklearn machine learning model building process, please read this article also.

Assuming you are now familiar, this is how you can build a logistic regression model in Python using machine learning library Scikit. Please read here about the dataset and dummy coding.

clf1 clf2 clf3 clf4 clf5 clf6 clf7

clf8 clf9 clf10

Cheers!

Categorical Variables Dummy Coding

November 4, 2017May 16, 2021 / RP / 3 Comments

Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.

In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using four different methods-

Scikit learn preprocessing LabelEncoder
Pandas getdummies
Looping
Mapping

We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here. In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.

We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-

clf1 clf2 clf3 clf4 clf5 clf6 clf7

Here are few other ways to dummy coding-

dummy_coding1 dummy_coding2 dummy_coding3

Here is an excellent Kaggle Kernel for detailed feature engineering.

Cheers!

Hierarchical Clustering with Python

November 4, 2017May 16, 2021 / RP / 2 Comments

As highlighted in the article, clustering and segmentation play an instrumental role in Data Science. In this blog, we will show you how to build a Hierarchical Clustering with Python.

For this purpose, we will work with a R dataset called “Cheese”. Please install package called “Bayesm” in R and export this data set in csv format to be imported in Python. More on this dataset can be found here.

Let’s begin with the clustering in Python then. hclust1 hclust2 hclust3 hclust4 hclust5 hclust6 hclust7

hclust8

Cheers!

KMeans Clustering: Core Concepts, Assumptions, and Key Equations

November 4, 2017August 10, 2025 / RP / 3 Comments

Overview:
KMeans is an unsupervised machine learning algorithm used to partition data into a specified number of clusters (k). Each cluster is defined by its centroid, and the algorithm aims to minimize the distance between data points and their assigned cluster centroids.

Core Concepts:

Clusters and Centroids:
- A cluster is a group of data points that are similar to each other.
- The centroid is the mean position of all the points in a cluster.
Assignment and Update Steps:
- Assignment: Each data point is assigned to the nearest centroid.
- Update: The centroids are recalculated as the mean of all points assigned to each cluster.
Iterative Optimization:
- The assignment and update steps are repeated until the centroids no longer change significantly or a maximum number of iterations is reached.

Assumptions:

The number of clusters (k) is known and fixed in advance.
Clusters are roughly spherical and equally sized.
Data points are closer to their own cluster centroid than to others.
The algorithm is sensitive to the initial placement of centroids.

Key Equations:

Distance Calculation:
- The most common distance metric is Euclidean distance.
- For a data point x and centroid c:
  Distance = sqrt( (x1 – c1)^2 + (x2 – c2)^2 + … + (xn – cn)^2 )
Centroid Update:
- For each cluster, the new centroid is the mean of all points assigned to that cluster.
- Centroid for cluster j:
  cj = (1 / Nj) * sum(xi)
  where Nj is the number of points in cluster j, and xi are the points in cluster j.
Objective Function (Inertia):
- KMeans minimizes the sum of squared distances (inertia) between each point and its assigned centroid.
- Inertia = sum over all clusters j [ sum over all points i in cluster j (distance(xi, cj))^2 ]

Algorithm Steps:

Choose k initial centroids (randomly or using a method like k-means++).
Assign each data point to the nearest centroid.
Recalculate centroids as the mean of assigned points.
Repeat steps 2 and 3 until centroids stabilize.

Limitations:

Sensitive to outliers and noise.
May converge to a local minimum (results can vary with different initializations).
Not suitable for clusters with non-spherical shapes or very different sizes.

Applications:

Market segmentation
Image compression
Document clustering
Anomaly detection

# Simple KMeans Clustering Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Elbow method to find optimal k
inertia = []
k_range = range(1, 11)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
plt.figure(figsize=(6,4))
plt.plot(k_range, inertia, 'bo-')
plt.axvline(x=4, color='red', linestyle='--', label='Optimal k=4')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.legend()
plt.grid(True)
plt.show()

# Fit KMeans with optimal k (choose visually, e.g., k=4)
k_opt = 4
kmeans = KMeans(n_clusters=k_opt, random_state=42)
labels = kmeans.fit_predict(X)

# Plot clusters
plt.figure(figsize=(7,5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=200, alpha=0.75, marker='X', label='Centers')
plt.title(f'KMeans Clustering (k={k_opt})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

# Silhouette score
score = silhouette_score(X, labels)
print(f'Silhouette Score (k={k_opt}): {score:.3f}')

Silhouette Score (k=4): 0.876

# KMeans Clustering on Iris Dataset
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd

# Load Iris data
iris = load_iris()
X = iris.data

# Elbow method to find optimal k
inertia = []
k_range = range(1, 11)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
k_opt = 3  # Set optimal k explicitly for Iris data
plt.figure(figsize=(6,4))
plt.plot(k_range, inertia, 'bo-')
plt.axvline(x=k_opt, color='red', linestyle='--', label='Optimal k=3')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k (Iris)')
plt.legend()
plt.grid(True)
plt.show()

# Fit KMeans with optimal k (choose visually, e.g., k=3)
kmeans = KMeans(n_clusters=k_opt, random_state=42)
labels = kmeans.fit_predict(X)

# Plot clusters (using first two features for visualization)
plt.figure(figsize=(7,5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=200, alpha=0.75, marker='X', label='Centers')
plt.title(f'Iris KMeans Clustering (k={k_opt})')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.legend()
plt.show()

# Plot clusters (using petal length and petal width for visualization)
plt.figure(figsize=(7,5))
plt.scatter(X[:, 2], X[:, 3], c=labels, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 2], kmeans.cluster_centers_[:, 3], c='red', s=200, alpha=0.75, marker='X', label='Centers')
plt.title(f'Iris KMeans Clustering (k={k_opt}) - Petal Length vs Petal Width')
plt.xlabel(iris.feature_names[2])
plt.ylabel(iris.feature_names[3])
plt.legend()
plt.show()

# Silhouette score
score = silhouette_score(X, labels)
print(f'Silhouette Score (k={k_opt}): {score:.3f}')

# Number of observations in each cluster
unique, counts = np.unique(labels, return_counts=True)
for i, count in zip(unique, counts):
    print(f"Cluster {i}: {count} data points")

# Descriptive summary of each cluster (mean feature values)
df = pd.DataFrame(X, columns=iris.feature_names)
df['cluster'] = labels
print("\nCluster feature means:")
print(df.groupby('cluster').mean())

Cheers!

Python Machine Learning Linear Regression with Scikit- learn

October 31, 2017June 13, 2026 / RP / 3 Comments

What is a “Linear Regression”-

Linear regression is one of the most powerful and yet very simple machine learning algorithm. Linear regression is used for cases where the relationship between the dependent and one or more of the independent variables is supposed to be linearly correlated in the following fashion-

Y = b0 + b1*X1 + b2*X2 + b3*X3 + …..

Here Y is the dependent variable and X1, X2, X3 etc are independent variables. The purpose of building a linear regression model is to estimate the coefficients b0, b1, b2 et cetera that provides the least error rate in the prediction. More on the error will be discussed later in this article.

In the above equation, b0 is the intercept, b1 is the coefficient for variable X1, b2 is the coefficient for the variable X2 and so on…

What is a “Simple Linear Regression” and “ Multiple Linear Regression”?

When we have only one independent variable, resulting regression is called a “Simple Linear Regression” when we have 2 or more independent variables the resulting regression is called “Multiple Linear Regression”

What are the requirements for the dependent and independent variables in the regression analysis?

The dependent variable in linear regression is generally Numerical and Continuous such as sales in dollars, gdp, unemployment rate, pollution level, amount of rainfall etc. On the other hand, the independent variables can be either numeric or categorical. However, please note that the categorical variables will need to be dummy coded before we can use these variables for building a regression model in the sklearn library of Python.

What are some of the real world usage of linear regression?

As we discussed earlier, this is one of the most commonly used algorithm in ML. Some of the use cases are listed below-

Example 1-

Predict sales amount of a car company as a function of the # of models, new models, price, discount,GDP, interest rate, unemployment rate, competitive prices etc.

Example 2-

Predict weight gain/loss of a person as a function of calories intake, junk food, genetics, exercise time and intensity, sleep, festival time, diet plans, medicines etc.

Example 3-

Predict house prices as a function of sqft, # of rooms, interest rate, parking, pollution level, distance from city center, population mix etc.

Example 4-

Predict GDP growth rate as a function of inflation, unemployment rate, investment, new business, weather pattern, resources, population

How do we evaluate linear regression model’s performance?

There are many metrics that can be used to evaluate a linear regression model’s performance and choose the best model. Some of the most commonly used metrics are-

Mean Square Error (MSE)- This is an error and lower the amount the better it is. It is defined using the below formula

R Square– This is called coefficient of determination and provides a gauge of model’s explaining power. For example, for a linear regression model with a RSquare of 0.70 or 70% would imply that 70% of the variation in the dependent variable can be explained by the model that has been built.

Assumptions of Linear Regression

The five assumptions

1. Linearity — E(Y|X) should follow a straight line, not a curve. Check with scatter plots and residual vs. fitted plots. Fix with transforms, polynomial terms, or nonlinear models.

2. Independence — Errors should not correlate across observations (common in time series or repeated measures). Check with Durbin–Watson or residuals vs. order. Fix with GLS, mixed models, or cluster-robust standard errors.

3. Homoscedasticity — Residual spread should stay constant across X. A funnel shape in the residual plot is a red flag. Fix with robust standard errors, WLS, or log transforms.

4. Normality — Residuals should be roughly bell-shaped. Matters most for small samples; large samples are often more forgiving. Check with Q–Q plots.

5. No multicollinearity — Predictors should not be almost redundant. High VIF can make individual coefficients unstable even when overall prediction is fine. Fix by dropping or combining predictors, or using ridge regression.

How do we build a linear regression model in Python?

In this exercise, we will build a linear regression model on Boston housing data set which is an inbuilt data in the scikit-learn library of Python. However, before we go down the path of building a model, let’s talk about some of the basic steps in any machine learning model in Python

In most cases, any of the machine learning algorithm in sklearn library will follow the following steps-

Split original data into features and label. In other words, create dependent variable and set of independent variables in two different arrays separately. Please note this requirement exists only for the supervised learning ( where a dependent variable is present). For unsupervised learning, we don’t have a dependent variable and hence there is no need to split the data into features and label
Scale or Normalize the features and label data. Please note that this is not a necessity for all algorithms and/or datasets. Also we are assuming that all the data cleaning and feature engineering such as missing value treatment, outlier treatment, bogus values fixes and dummy coding of the categorical variables have been done before doing this step
Create training and test data sets from the original data. Training data set will be used for training the model whereas the test data set will be used for validating the accuracy or the prediction power of the model on a new dataset. We would need to split both the features and labels into the training and the test split.
Create an instance of the model object that will be used for the modelling exercise. This process is called “Instantiation”. In simpler words, during this process we are loading the model package necessary to build a model.
“Fit” the model instance on the training data. During this step, the model is leveraging both the features and the label information provided in the training data to connect the features to label. Please note that we are going with all the default option during fitting of the model. As you get more expertise you may want to play with some parameter optimization, however we are just going with the defaults for now.
“Predict” using the model instance on test data. During this step, the model is only using the features information to predict the label.
Based on the predictions generated on the test data, we generate key performance indicators of model performance. This generally includes metrics such as Precision, Recall F score, Confusion Matrix, Accuracy, Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Area Under the Curve (AUC), Mean Absolute Percentage error (MAPE) etc.
Once the model performance is evaluated and its deemed to be satisfactory for the purpose of the business uses, we implement the model for new unseen data

So let’s get started with building this model-

Overview

The dataset has 20,640 rows — one row per census block group in California (1990 U.S. Census). The goal is to predict median house value in a block from local demographic and housing features.

Target variable

Column	Description
MedHouseVal	Median house value in the block group, in $100,000 units (e.g. 2.5 ≈ $250,000). Values are capped at 5.0 ($500,000).

Features (8 predictors)

Column	Description
MedInc	Median income in the block group
HouseAge	Median age of houses in the block group
AveRooms	Average number of rooms per household
AveBedrms	Average number of bedrooms per household
Population	Total population in the block group
AveOccup	Average number of household members
Latitude	Block group latitude
Longitude	Block group longitude

			
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings("ignore")
sns.set_theme(style="whitegrid")
# --- Load data & EDA ---
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["MedHouseVal"] = housing.target
print("Shape:", df.shape)
print(df.head())
print(df.describe())
print("Missing values:\n", df.isnull().sum())
corr = df.corr()
print("\nCorrelation matrix:\n", corr.round(3))
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, linewidths=0.5)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()
# --- STEP 1: features & label ---
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]
# --- STEP 2: scale ---
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns)
# --- STEP 3: train/test split ---
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)
# --- STEP 4 & 5: instantiate & fit ---
model = LinearRegression()
model.fit(X_train, y_train)
print(f"\nIntercept: {model.intercept_:.4f}")
coef_df = pd.DataFrame({"Feature": X.columns, "Coefficient": model.coef_})
print(coef_df.sort_values("Coefficient", key=abs, ascending=False))
plt.figure(figsize=(9, 5))
coef_df.set_index("Feature")["Coefficient"].plot(kind="bar", color="steelblue")
plt.title("Feature Coefficients")
plt.ylabel("Coefficient")
plt.axhline(0, color="black", linewidth=0.8)
plt.tight_layout()
plt.show()
# --- STEP 6: predict & evaluate ---
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"\nR²   : {r2:.4f}")
print(f"MSE  : {mse:.4f}")
print(f"RMSE : {np.sqrt(mse):.4f}")
print(f"MAE  : {mae:.4f}")
results = pd.DataFrame({
    "Actual": y_test.values,
    "Predicted": y_pred,
})
results["Error"] = results["Actual"] - results["Predicted"]
results["Abs_Error"] = results["Error"].abs()
print("\nSample results:\n", results.head(10).round(4))
print("\nError summary:\n", results[["Error", "Abs_Error"]].describe().round(4))
residuals = results["Error"]
abs_errors = results["Abs_Error"]
# Actual vs predicted
plt.figure(figsize=(7, 6))
plt.scatter(results["Actual"], results["Predicted"], alpha=0.3, s=10, color="steelblue")
lims = [results["Actual"].min(), results["Actual"].max()]
plt.plot(lims, lims, "r--", label="Perfect prediction")
plt.xlabel("Actual ($100k)")
plt.ylabel("Predicted ($100k)")
plt.title(f"Actual vs Predicted (R² = {r2:.3f})")
plt.legend()
plt.tight_layout()
plt.show()
# Error distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(residuals, bins=40, kde=True, ax=axes[0], color="steelblue")
axes[0].axvline(0, color="red", linestyle="--")
axes[0].set_title("Residual Distribution")
axes[0].set_xlabel("Actual − Predicted")
sns.histplot(abs_errors, bins=40, kde=True, ax=axes[1], color="seagreen")
axes[1].set_title("Absolute Error Distribution")
axes[1].set_xlabel("|Actual − Predicted|")
plt.tight_layout()
plt.show()
# Residuals vs fitted & Q-Q plot
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
axes[0].scatter(results["Predicted"], residuals, alpha=0.3, s=10, color="steelblue")
axes[0].axhline(0, color="red", linestyle="--")
axes[0].set_xlabel("Predicted ($100k)")
axes[0].set_ylabel("Residual")
axes[0].set_title("Residuals vs Fitted")
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title("Q-Q Plot")
plt.tight_layout(
plt.show()

		

Output from the above code-

Shape: (20640, 9)
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23        4.526
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22        3.585
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24        3.521
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25        3.413
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25        3.422
             MedInc      HouseAge      AveRooms     AveBedrms    Population      AveOccup      Latitude     Longitude  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744      3.070655     35.631861   -119.569704   
std        1.899822     12.585558      2.474173      0.473911   1132.462122     10.386050      2.135952      2.003532   
min        0.499900      1.000000      0.846154      0.333333      3.000000      0.692308     32.540000   -124.350000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000      2.429741     33.930000   -121.800000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000      2.818116     34.260000   -118.490000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000      3.282261     37.710000   -118.010000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   1243.333333     41.950000   -114.310000   

        MedHouseVal  
count  20640.000000  
mean       2.068558  
std        1.153956  
min        0.149990  
25%        1.196000  
50%        1.797000  
75%        2.647250  
max        5.000010  
Missing values:
 MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

Correlation matrix:
              MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
MedInc        1.000    -0.119     0.327     -0.062       0.005     0.019    -0.080     -0.015        0.688
HouseAge     -0.119     1.000    -0.153     -0.078      -0.296     0.013     0.011     -0.108        0.106
AveRooms      0.327    -0.153     1.000      0.848      -0.072    -0.005     0.106     -0.028        0.152
AveBedrms    -0.062    -0.078     0.848      1.000      -0.066    -0.006     0.070      0.013       -0.047
Population    0.005    -0.296    -0.072     -0.066       1.000     0.070    -0.109      0.100       -0.025
AveOccup      0.019     0.013    -0.005     -0.006       0.070     1.000     0.002      0.002       -0.024
Latitude     -0.080     0.011     0.106      0.070      -0.109     0.002     1.000     -0.925       -0.144
Longitude    -0.015    -0.108    -0.028      0.013       0.100     0.002    -0.925      1.000       -0.046
MedHouseVal   0.688     0.106     0.152     -0.047      -0.025    -0.024    -0.144     -0.046        1.000

Intercept: 2.0679
      Feature  Coefficient
6    Latitude    -0.896635
7   Longitude    -0.868927
0      MedInc     0.852382
3   AveBedrms     0.371132
2    AveRooms    -0.305116
1    HouseAge     0.122382
5    AveOccup    -0.036624
4  Population    -0.002298

R²   : 0.5758
MSE  : 0.5559
RMSE : 0.7456
MAE  : 0.5332

Sample results:
    Actual  Predicted   Error  Abs_Error
0   0.477     0.7191 -0.2421     0.2421
1   0.458     1.7640 -1.3060     1.3060
2   5.000     2.7097  2.2904     2.2904
3   2.186     2.8389 -0.6529     0.6529
4   2.780     2.6047  0.1753     0.1753
5   1.587     2.0118 -0.4248     0.4248
6   1.982     2.6455 -0.6635     0.6635
7   1.575     2.1688 -0.5938     0.5938
8   3.400     2.7407  0.6593     0.6593
9   4.466     3.9156  0.5504     0.5504

Error summary:
            Error  Abs_Error
count  4128.0000  4128.0000
mean      0.0035     0.5332
std       0.7457     0.5212
min      -9.8753     0.0001
25%      -0.4609     0.1968
50%      -0.1224     0.4102
75%       0.3124     0.6886
max       4.1484     9.8753

As you can see from the above metrics that overall this plain vanilla regression model is doing a decent job. However, it can be significantly improved upon by either doing feature engineering such as binning, multicollinearity and heteroscedasticity fixes etc. or by leveraging more robust techniques such as Elastic Net, Ridge Regression or SGD Regression, Non Linear models.

Building Linear Model using statsmodels module

Image 9- Fitting Linear Regression Model using Statmodels

Image 11- Fitting Linear Regression Model with Significant Variables

Image 12- Heteroscedasticity Consistent Linear Regression Estimates

More details on the metrics can be found at the below links-

Wiki

Here is a blog with excellent explanation of all metrics

Cheers!

Data Standardization or Normalization

October 27, 2017June 13, 2026 / RP / 1 Comment

Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.

Standardization is when a variable is made to follow the standard normal distribution ( mean =0 and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.

Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?

In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable X will outweigh variable Y due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
Multivariate models may become more stable and the coefficients more reliable if you normalize the data
It provides immunity from the problem of outliers

Let’s look at a Python example on how we can normalize data-

			
# MinMaxScaler on California Housing — before vs after
# Formula: x_scaled = (x - min) / (max - min)  ->  values land in [0, 1]
# In Jupyter: add `%matplotlib inline` at the top for inline charts.
import warnings
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import MinMaxScaler
warnings.filterwarnings("ignore")
sns.set_theme(style="whitegrid")
# 1. Load features
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
print("BEFORE MinMaxScaler (original values):\n")
print(X.head().round(3))
print("\n", X.describe().round(3))
# 2. Fit MinMaxScaler and transform
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(
    scaler.fit_transform(X),
    columns=X.columns,
)
print("\n" + "=" * 50)
print("AFTER MinMaxScaler (scaled to [0, 1]):\n")
print(X_scaled.head().round(3))
print("\n", X_scaled.describe().round(3))
# 3. Worked example — one feature, first 5 rows
feature = "MedInc"
lo = scaler.data_min_[X.columns.get_loc(feature)]
hi = scaler.data_max_[X.columns.get_loc(feature)]
example = pd.DataFrame({
    "raw": X[feature].head(),
    "min": lo,
    "max": hi,
})
example["scaled = (raw - min) / (max - min)"] = (
    (example["raw"] - example["min"]) / (example["max"] - example["min"])
).round(3)
example["sklearn output"] = X_scaled[feature].head().round(3).values
print("\n" + "=" * 50)
print(f"How scaling works for {feature}:\n")
print(example.to_string())
# 4. Quick before / after summary
print("\n" + "=" * 50)
print("Before vs after (min and max per feature):\n")
comparison = pd.DataFrame({
    "before_min": X.min(),
    "before_max": X.max(),
    "after_min": X_scaled.min(),
    "after_max": X_scaled.max(),
}).round(3)
print(comparison.to_string())
# 5. Plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.boxplot(data=X, ax=axes[0], color="steelblue", fliersize=2)
axes[0].set_title("Before scaling")
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45, ha="right")
sns.boxplot(data=X_scaled, ax=axes[1], color="darkorange", fliersize=2)
axes[1].set_title("After MinMaxScaler [0, 1]")
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha="right")
plt.suptitle("California Housing — all features", y=1.02)
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
sns.histplot(X[feature], bins=40, kde=True, ax=axes[0], color="steelblue")
axes[0].set_title(f"Before — {feature}")
axes[0].set_xlabel(feature)
sns.histplot(X_scaled[feature], bins=40, kde=True, ax=axes[1], color="darkorange")
axes[1].set_title(f"After — {feature}")
axes[1].set_xlabel(f"{feature} (scaled)")
plt.tight_layout()
plt.show()
print("\nTakeaway: every feature is rescaled to the same [0, 1] range.")
print("  x_scaled = (x - column_min) / (column_max - column_min)")

		

BEFORE MinMaxScaler (original values):

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0   8.325      41.0     6.984      1.024       322.0     2.556     37.88   
1   8.301      21.0     6.238      0.972      2401.0     2.110     37.86   
2   7.257      52.0     8.288      1.073       496.0     2.802     37.85   
3   5.643      52.0     5.817      1.073       558.0     2.548     37.85   
4   3.846      52.0     6.282      1.081       565.0     2.181     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  

           MedInc   HouseAge   AveRooms  AveBedrms  Population   AveOccup  \
count  20640.000  20640.000  20640.000  20640.000   20640.000  20640.000   
mean       3.871     28.639      5.429      1.097    1425.477      3.071   
std        1.900     12.586      2.474      0.474    1132.462     10.386   
min        0.500      1.000      0.846      0.333       3.000      0.692   
25%        2.563     18.000      4.441      1.006     787.000      2.430   
50%        3.535     29.000      5.229      1.049    1166.000      2.818   
75%        4.743     37.000      6.052      1.100    1725.000      3.282   
max       15.000     52.000    141.909     34.067   35682.000   1243.333   

        Latitude  Longitude  
count  20640.000  20640.000  
mean      35.632   -119.570  
std        2.136      2.004  
min       32.540   -124.350  
25%       33.930   -121.800  
50%       34.260   -118.490  
75%       37.710   -118.010  
max       41.950   -114.310  

==================================================
AFTER MinMaxScaler (scaled to [0, 1]):

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0   0.540     0.784     0.044      0.020       0.009     0.001     0.567   
1   0.538     0.392     0.038      0.019       0.067     0.001     0.565   
2   0.466     1.000     0.053      0.022       0.014     0.002     0.564   
3   0.355     1.000     0.035      0.022       0.016     0.001     0.564   
4   0.231     1.000     0.039      0.022       0.016     0.001     0.564   

   Longitude  
0      0.211  
1      0.212  
2      0.210  
3      0.209  
4      0.209  

           MedInc   HouseAge   AveRooms  AveBedrms  Population   AveOccup  \
count  20640.000  20640.000  20640.000  20640.000   20640.000  20640.000   
mean       0.232      0.542      0.032      0.023       0.040      0.002   
std        0.131      0.247      0.018      0.014       0.032      0.008   
min        0.000      0.000      0.000      0.000       0.000      0.000   
25%        0.142      0.333      0.025      0.020       0.022      0.001   
50%        0.209      0.549      0.031      0.021       0.033      0.002   
75%        0.293      0.706      0.037      0.023       0.048      0.002   
max        1.000      1.000      1.000      1.000       1.000      1.000   

        Latitude  Longitude  
count  20640.000  20640.000  
mean       0.329      0.476  
std        0.227      0.200  
min        0.000      0.000  
25%        0.148      0.254  
50%        0.183      0.584  
75%        0.549      0.631  
max        1.000      1.000  

==================================================
How scaling works for MedInc:

      raw     min      max  scaled = (raw - min) / (max - min)  sklearn output
0  8.3252  0.4999  15.0001                               0.540           0.540
1  8.3014  0.4999  15.0001                               0.538           0.538
2  7.2574  0.4999  15.0001                               0.466           0.466
3  5.6431  0.4999  15.0001                               0.355           0.355
4  3.8462  0.4999  15.0001                               0.231           0.231

==================================================
Before vs after (min and max per feature):

            before_min  before_max  after_min  after_max
MedInc           0.500      15.000        0.0        1.0
HouseAge         1.000      52.000        0.0        1.0
AveRooms         0.846     141.909        0.0        1.0
AveBedrms        0.333      34.067        0.0        1.0
Population       3.000   35682.000        0.0        1.0
AveOccup         0.692    1243.333        0.0        1.0
Latitude        32.540      41.950        0.0        1.0
Longitude     -124.350    -114.310        0.0        1.0

Takeaway: every feature is rescaled to the same [0, 1] range.
  x_scaled = (x - column_min) / (column_max - column_min)

Cheers!

Basic Statistics and Data Visualization

October 24, 2017May 16, 2021 / RP / 1 Comment

Doing exploratory, diagnostic and descriptive statistics is the first and very crucial part of any data analytics project.

Here are some more details on each of the steps involved in Exploratory Data Analysis ( EDA)

Let’s now look at examples on how to accomplish these tasks in Python.

You can find all the inbuilt datasets in the seaborn library using the below command-

seaborn.get_dataset_names()

The following datasets are available-

[‘anscombe’,

‘attention’,

‘brain_networks’,

‘car_crashes’,

‘diamonds’,

‘dots’,

‘exercise’,

‘flights’,

‘fmri’,

‘gammas’,

‘iris’,

‘mpg’,

‘planets’,

‘tips’,

‘titanic’]

EDA1 EDA2 EDA3 EDA4 EDA5 EDA6 EDA7 EDA8 EDA9 EDA10

EDA11 EDA12 EDA14 EDA15 EDA16 EDA17 EDA18 EDA19 EDA20 EDA21 EDA22 EDA23 EDA24 EDA25 EDA26 EDA27 EDA28

Cheers!

RP’s Blog on AI

Connect with RP- https://www.linkedin.com/in/ratnakarpandey/

Python

Linear Discriminant Analysis ( LDA) with Scikit

Principal Component Analysis ( PCA) using Scikit

Decision Tree using Python Scikit

Logistic Regression using Scikit Python

Categorical Variables Dummy Coding

Hierarchical Clustering with Python

KMeans Clustering: Core Concepts, Assumptions, and Key Equations

Python Machine Learning Linear Regression with Scikit- learn

Assumptions of Linear Regression

The five assumptions

Overview

Target variable

Features (8 predictors)

Building Linear Model using statsmodels module

Data Standardization or Normalization

Basic Statistics and Data Visualization

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Assumptions of Linear Regression

The five assumptions

Overview

Target variable

Features (8 predictors)

Building Linear Model using statsmodels module

Share this:

Share this:

Share this: