Decision Tree Regression using Scikit

August 21, 2018August 7, 2025 / RP / 1 Comment

Clean, text-free image representing decision tree regression concept

Decision Tree for Regression: Complete Guide and Implementation

Core Concepts and How It Works

Decision tree regression is a non-parametric supervised learning method used to predict continuous numerical values. Unlike classification trees that predict categories, regression trees predict continuous outcomes like prices, temperatures, or scores.

How Decision Trees Work for Regression

The algorithm works by recursively splitting the dataset into smaller, more homogeneous subsets based on feature values. Here’s the step-by-step process:

Start with the Root Node: The algorithm begins with the entire dataset at the root node
Find the Best Split: It evaluates all possible splits across all features to find the one that best reduces prediction error
Recursive Partitioning: The dataset is divided into subsets based on the chosen split, and the process repeats for each subset
Create Leaf Nodes: The process continues until stopping criteria are met (e.g., minimum samples per leaf, maximum depth)
Make Predictions: For any new data point, it follows the decision path down the tree and outputs the mean value of the training samples in the corresponding leaf node

Key Mathematical Concepts

Standard Deviation Reduction

Decision trees for regression use standard deviation reduction instead of information gain to determine the best splits. The goal is to minimize the variance within each subset after splitting.

Sum of Squared Errors (SSE)

The algorithm minimizes the Residual Sum of Squares (RSS) given by:

RSS = Sum over all regions j of Sum over observations i in region j of (yi – y_hat_Rj)^2

Where:

yi is the actual value for observation i
y_hat_Rj is the predicted value (mean) for region Rj

Mean Squared Error (MSE)

The quality of predictions is commonly evaluated using Mean Squared Error:

MSE = (1/n) * Sum of (actual_value – predicted_value)^2

Where n is the number of observations.

Core Algorithm Components

Splitting Criteria

For regression, the algorithm uses variance reduction or mean squared error reduction to evaluate splits. The feature and threshold that result in the maximum reduction in variance are chosen for each split.

Stopping Criteria

The tree stops growing when:

Maximum depth is reached
Minimum number of samples per leaf is reached
No further improvement in variance reduction is possible

Pruning

To prevent overfitting, cost complexity pruning is often applied, which removes branches that don’t significantly improve model performance.

Key Assumptions and Highlights

Non-linear Relationships: Decision trees can capture non-linear relationships between features and target variables
Feature Independence: The algorithm doesn’t assume independence between features
Local Prediction: Predictions are made locally within each leaf node using the mean of training samples
Greedy Algorithm: Uses a greedy approach, making the best split at each step without looking ahead

Advantages and Limitations

Advantages:

Easy to understand and interpret
Handles both numerical and categorical features
Requires minimal data preprocessing
Can capture non-linear patterns

Limitations:

Prone to overfitting, especially with deep trees
Can be unstable (small changes in data can result in different trees)
May create biased trees if some classes dominate

Complete Python Implementation

Here’s a comprehensive implementation that demonstrates decision tree regression with a real-world dataset:

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("=== Decision Tree Regression Implementation ===\n")

# 1. LOAD AND EXPLORE DATASET
print("1. Loading California Housing Dataset...")
# Load the California housing dataset - perfect for regression
california_housing = fetch_california_housing()
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = pd.Series(california_housing.target, name='MedHouseValue')

print(f"Dataset shape: {X.shape}")
print(f"Target variable range: ${y.min():.2f} - ${y.max():.2f} (in hundreds of thousands)")
print("\nFeature descriptions:")
for i, feature in enumerate(california_housing.feature_names):
    print(f"- {feature}: {california_housing.DESCR.split('Attribute Information:')[1].split(':')[i+1].split('-')[0].strip()}")

# Display basic statistics
print("\nDataset Info:")
print(X.describe().round(2))
print(f"\nTarget variable statistics:")
print(y.describe().round(2))

# 2. DATA PREPROCESSING
print("\n2. Data Preprocessing...")

# Check for missing values
print(f"Missing values in features: {X.isnull().sum().sum()}")
print(f"Missing values in target: {y.isnull().sum()}")

# Feature correlation analysis
print("\nFeature correlation with target:")
correlations = X.corrwith(y).sort_values(ascending=False)
print(correlations.round(3))

# 3. TRAIN-TEST SPLIT
print("\n3. Splitting data into train and test sets...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# 4. MODEL TRAINING WITH HYPERPARAMETER TUNING
print("\n4. Training Decision Tree Regressor with Hyperparameter Tuning...")

# Define hyperparameter grid for optimization
param_grid = {
    'max_depth': [3, 5, 7, 10, 15, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'max_features': ['auto', 'sqrt', 'log2', None]
}

# Initialize the regressor
dt_regressor = DecisionTreeRegressor(random_state=42)

# Perform grid search with cross-validation
print("Performing Grid Search with 5-fold Cross Validation...")
grid_search = GridSearchCV(
    estimator=dt_regressor,
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=0
)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Get the best model
best_dt = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score (negative MSE): {grid_search.best_score_:.4f}")

# Train a simple model for comparison
simple_dt = DecisionTreeRegressor(max_depth=5, random_state=42)
simple_dt.fit(X_train, y_train)

# 5. MODEL EVALUATION
print("\n5. Model Evaluation...")

# Make predictions
y_train_pred_best = best_dt.predict(X_train)
y_test_pred_best = best_dt.predict(X_test)
y_train_pred_simple = simple_dt.predict(X_train)
y_test_pred_simple = simple_dt.predict(X_test)

# Calculate metrics for both models
def calculate_metrics(y_true, y_pred, model_name, dataset_type):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n{model_name} - {dataset_type} Set Metrics:")
    print(f"  Mean Squared Error (MSE): {mse:.4f}")
    print(f"  Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"  Mean Absolute Error (MAE): {mae:.4f}")
    print(f"  R² Score: {r2:.4f}")
    
    return mse, rmse, mae, r2

# Evaluate both models
calculate_metrics(y_train, y_train_pred_best, "Best Tuned Model", "Training")
calculate_metrics(y_test, y_test_pred_best, "Best Tuned Model", "Test")
calculate_metrics(y_train, y_train_pred_simple, "Simple Model", "Training")
calculate_metrics(y_test, y_test_pred_simple, "Simple Model", "Test")

# Cross-validation scores
cv_scores_best = cross_val_score(best_dt, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_scores_simple = cross_val_score(simple_dt, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

print(f"\nCross-validation RMSE (Best Model): {np.sqrt(-cv_scores_best.mean()):.4f} (+/- {np.sqrt(cv_scores_best.std() * 2):.4f})")
print(f"Cross-validation RMSE (Simple Model): {np.sqrt(-cv_scores_simple.mean()):.4f} (+/- {np.sqrt(cv_scores_simple.std() * 2):.4f})")

# 6. FEATURE IMPORTANCE ANALYSIS
print("\n6. Feature Importance Analysis...")
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance_best': best_dt.feature_importances_,
    'importance_simple': simple_dt.feature_importances_
}).sort_values('importance_best', ascending=False)

print("Top 5 Most Important Features (Best Model):")
print(feature_importance.head())

# 7. VISUALIZATIONS
print("\n7. Generating Visualizations...")

# Set up the plotting style
plt.style.use('default')
fig = plt.figure(figsize=(20, 15))

# Plot 1: Actual vs Predicted (Best Model)
plt.subplot(3, 3, 1)
plt.scatter(y_test, y_test_pred_best, alpha=0.6, color='blue', s=30)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values (Best Model)')
plt.grid(True, alpha=0.3)

# Plot 2: Residuals Plot (Best Model)
plt.subplot(3, 3, 2)
residuals = y_test - y_test_pred_best
plt.scatter(y_test_pred_best, residuals, alpha=0.6, color='green', s=30)
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals Plot (Best Model)')
plt.grid(True, alpha=0.3)

# Plot 3: Feature Importance
plt.subplot(3, 3, 3)
plt.barh(feature_importance['feature'], feature_importance['importance_best'], color='skyblue')
plt.xlabel('Importance')
plt.title('Feature Importance (Best Model)')
plt.gca().invert_yaxis()

# Plot 4: Prediction Distribution
plt.subplot(3, 3, 4)
plt.hist(y_test, bins=30, alpha=0.7, label='Actual', color='blue', edgecolor='black')
plt.hist(y_test_pred_best, bins=30, alpha=0.7, label='Predicted', color='red', edgecolor='black')
plt.xlabel('House Value')
plt.ylabel('Frequency')
plt.title('Distribution: Actual vs Predicted')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 5: Model Comparison (MSE)
plt.subplot(3, 3, 5)
models = ['Simple Model\n(max_depth=5)', 'Best Tuned Model']
train_mse = [mean_squared_error(y_train, y_train_pred_simple), mean_squared_error(y_train, y_train_pred_best)]
test_mse = [mean_squared_error(y_test, y_test_pred_simple), mean_squared_error(y_test, y_test_pred_best)]

x = np.arange(len(models))
width = 0.35

plt.bar(x - width/2, train_mse, width, label='Train MSE', color='lightblue', edgecolor='black')
plt.bar(x + width/2, test_mse, width, label='Test MSE', color='lightcoral', edgecolor='black')

plt.xlabel('Model')
plt.ylabel('Mean Squared Error')
plt.title('Model Comparison: MSE')
plt.xticks(x, models)
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 6: Learning Curve (Tree Depth)
plt.subplot(3, 3, 6)
depths = range(1, 21)
train_scores = []
test_scores = []

for depth in depths:
    dt_temp = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dt_temp.fit(X_train, y_train)
    train_scores.append(mean_squared_error(y_train, dt_temp.predict(X_train)))
    test_scores.append(mean_squared_error(y_test, dt_temp.predict(X_test)))

plt.plot(depths, train_scores, 'o-', color='blue', label='Training MSE', linewidth=2)
plt.plot(depths, test_scores, 'o-', color='red', label='Test MSE', linewidth=2)
plt.xlabel('Tree Depth')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curve: Effect of Tree Depth')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 7: Correlation Heatmap
plt.subplot(3, 3, 7)
correlation_matrix = X.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, square=True, fmt='.2f')
plt.title('Feature Correlation Heatmap')

# Plot 8: Cross-validation Scores
plt.subplot(3, 3, 8)
cv_results = pd.DataFrame({
    'Fold': range(1, 6),
    'Best Model': -cv_scores_best,
    'Simple Model': -cv_scores_simple
})

plt.plot(cv_results['Fold'], cv_results['Best Model'], 'o-', label='Best Model', linewidth=2, markersize=8)
plt.plot(cv_results['Fold'], cv_results['Simple Model'], 's-', label='Simple Model', linewidth=2, markersize=8)
plt.xlabel('CV Fold')
plt.ylabel('Mean Squared Error')
plt.title('Cross-Validation Performance')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 9: Tree Visualization (Simple Model)
plt.subplot(3, 3, 9)
plot_tree(simple_dt, max_depth=3, feature_names=X.columns, filled=True, fontsize=8)
plt.title('Decision Tree Structure (Simplified View)')

plt.tight_layout()
plt.show()

# 8. DETAILED ANALYSIS AND INSIGHTS
print("\n8. Model Analysis and Insights...")

# Analyze overfitting
train_r2_best = r2_score(y_train, y_train_pred_best)
test_r2_best = r2_score(y_test, y_test_pred_best)
train_r2_simple = r2_score(y_train, y_train_pred_simple)
test_r2_simple = r2_score(y_test, y_test_pred_simple)

print(f"\nOverfitting Analysis:")
print(f"Best Model - Train R²: {train_r2_best:.4f}, Test R²: {test_r2_best:.4f}")
print(f"Simple Model - Train R²: {train_r2_simple:.4f}, Test R²: {test_r2_simple:.4f}")

overfitting_best = train_r2_best - test_r2_best
overfitting_simple = train_r2_simple - test_r2_simple
print(f"Overfitting Gap (Best): {overfitting_best:.4f}")
print(f"Overfitting Gap (Simple): {overfitting_simple:.4f}")

# Model complexity analysis
print(f"\nModel Complexity:")
print(f"Best Model - Tree Depth: {best_dt.get_depth()}, Leaves: {best_dt.get_n_leaves()}")
print(f"Simple Model - Tree Depth: {simple_dt.get_depth()}, Leaves: {simple_dt.get_n_leaves()}")

# Feature importance insights
print(f"\nKey Insights:")
print(f"• Most important feature: {feature_importance.iloc[0]['feature']} ({feature_importance.iloc[0]['importance_best']:.3f})")
print(f"• Least important feature: {feature_importance.iloc[-1]['feature']} ({feature_importance.iloc[-1]['importance_best']:.3f})")

# Performance summary
print(f"\nFinal Model Performance Summary:")
print(f"• Best model RMSE on test set: ${np.sqrt(mean_squared_error(y_test, y_test_pred_best)):.2f} (hundreds of thousands)")
print(f"• This represents an average prediction error of ~${np.sqrt(mean_squared_error(y_test, y_test_pred_best))*100000:.0f}")
print(f"• Model explains {test_r2_best:.1%} of the variance in house prices")

print("\n=== Analysis Complete ===")

Thanks for reading!

Support Vector Machine (SVM)

August 10, 2018May 16, 2021 / RP / 2 Comments

What is Support Vector Machine?

Support Vector Machine are supervised machine learning algorithms used mainly for classification and regression tasks. If a SVM is used for classification, it’s called Support Vector Classifier (SVC). Similarly, for regression it’s called Support Vector Regressor (SVR)

Where is SVM used?

SVM can be used wherever we use other machine learning techniques such as Logistic Regression, Decision Trees, Linear Regression, Naive Bayes Classifier etc. However, SVM may be particularly more suitable for following cases-

Sparse data
High Dimensional data
Text Classification
Data is nonlinear
Image classification
Data has complex patterns
Etc.

How does an SVM work?

A support vector machine, works to separate the pattern in the data by drawing a linear separable hyperplane in high dimensional space. For example in the 2D image below, we need to separate the green points from the red points. We can draw many hyperplanes such as H1, H2, H3 and H4. They all help in separating the points.

However, since there are many possible hyperplanes as denoted in the image below, which hyperplane should be chosen? The answer is- the plane which maximizes the separation between the green and the red points. In this case, it happens to be H3.

What happens if the data is not linearly separable?

Kernel trick or Kernel function helps transform the original non-linearly separable data into a higher dimension space where it can be linearly transformed. See image below-

What is the best hyperplane?

As we discussed earlier, the best hyperplane is the one that maximizes the distance (you can think about the width of the road) between the classes as shown below

How to build this in Python?

Here is an excellent link for the hyper-parameter optimisation-

https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769

For more info on sklearn library, refer below links-

http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

http://scikit-learn.org/stable/modules/svm.html

Thanks for reading!

How to Change Browser for Jupyter Notebook

August 10, 2018May 16, 2021 / RP / 1 Comment

Here are the step by step directions on how to open Jupyter Notebook in the browser of your preference.

Step 1- Go to Anaconda Navigator and start Jupyter Notebook

Step 2- Go to Ananconda Prompt to grab URL

Step 3- Put the URL (will be unique for your application) in browser of your choice

That’s it. You will have Jupyter notebook open in the browser of your choice.

For clearing up anaconda prompt screen, simply type – ‘cls’ on the command prompt

Thanks for reading!

Overview of Banking and Financial Services Industry

August 2, 2018May 16, 2021 / RP / Leave a comment

What is BFSI?

BFSI is an acronym for Banking, Financial Services and Insurance. This covers a whole gamut of activities and business models.
Wiki defines – “ BFSI comprises commercial banks, insurance companies, non-banking financial companies, cooperatives, pensions funds, mutual funds and other smaller financial entities. Banking may include core banking, retail, private, corporate, investment, cards and the like ”

Activity- Explore the below pages. List down different products and services that you see?

Thank you!

Exploratory Data Analysis and kNN Classification on Iris Dataset

July 21, 2018May 16, 2021 / RP / Leave a comment

Auto Regressive Integrated Moving Average (ARIMA) Time Series Forecasting

July 19, 2018May 16, 2021 / RP / 4 Comments

WORDCLOUD

Autoregressive Integrated Moving Average (ARIMA) is one of the most popular technique for time series modeling. This is also called Box-Jenkins method, named after the statisticians who pioneered some of the latest developments on this technique.

We will focus on following broad areas-

What is a time series? We have covered this in another article. Click here
Explore a time series data. Please refer to the slides 2 to 7 of the below deck and Click here
What is an ARIMA modeling
Discuss stationarity of a time series
Fit an ARIMA model, evaluate model’s accuracy and forecast for future

What is an ARIMA modeling-

An ARIMA model has following main components. However, not all models need to have all of the below mentioned components.

Autoregressive (AR)

Value of a time series at time period t (yt) is a function of values of time series at previous time periods ‘p’

yt = Linear function of yt-1, yt-2,….., yt-p + error

Integrated (I)

To make a time series stationary (discussed below), sometimes we need to difference successive observation and model that. This process is known as integration and differencing order is represented as ‘d’ in an ARIMA model.

Moving Average (MA)

Value of a time series at time period t (yt) is a function of errors at previous time periods ‘q’

yt = Linear function of Et-1, Et-2,….., Et-q + error

Based on the combinations of the above factors, we can have following and other models-

AR- Only autoregressives terms
MA- Only moving averages terms
ARMA- Both autoregressive and moving average terms
ARIMA- Autoregressive, moving average terms and integration terms. After the differencing step, the model becomes ARMA

A general ARIMA model is represented as ARIMA(p,d,q) where p, d and q represent AR, Integrated and moving averages respectively. Whereas each of p,d and q are integers higher than or equal to zero.

Stationarity of a time series-

A time series is called stationary where it has a constant mean and variance across the time period, i.e. mean and variance don’t depend on time. It other words, it should not have any trend and dispersion in variance of the data over a period of time. This is also called white noise.

Please refer to slides 8 to 11 of the below deck for live examples of this discussion

From the plot of our air passengers time series, we can tell that the time series is not stationary. Moreover, a time series needs to be stationary or made stationary before being fed into ARIMA modeling.

Statistically, Augmented Dickey–Fuller test is used for testing the stationarity of a time series. Generally speaking the null hypothesis (H0) is that the series is “Non-Stationary” and the alternative hypothesis (Ha) is that series is “Stationary”.

If the p statistics generated from the test is less than <0.05 we can reject the null hypothesis. Otherwise, we need to accept the null hypothesis.

From the ADF test we can see that the p values is close to 0.78 and which is more than 0.05 and hence we need to accept the null hypothesis that is the series is “Non Stationary”

How do we make a time series stationary? Well, we can do it two ways-

Manual- Transformation and Differecing etc. Let’s look at an example.
Automated- The Integrated term (d)in the ARIMA will make it stationary. This we will do in the model fitting phase. Generally speaking we don’t require d>1 to make a time series stationary
Auto.arima ( ) will take care of this automatically and fit the best model

Fit a model, evaluate model’s accuracy and forecast

We will use auto.arima ( ) to fit the best model and evaluate model fitment and performance using following main parameters.

Please refer to slides 12-18 of the below deck

A good time series model should have following characteristics-

Residuals shouldn’t show any trends over time.
Auto correlation Factors(ACF) and Partial Auto correlation Factor (PACF) shouldn’t have large values (beyond significance level)for any lags. ACF indicates correlation between the current value to all the previous values in a range. PACF is an extension of ACF, where it removes the correlation of the intermediate lags. You can read more on this here.
Errors shouldn’t show any seasonality
Errors should be normally distributed
Error (MAE, MAPE, MSE etc.) should be low
AIC, BIC should be relatively lower compared to other alternative models.

The codes and presentation

For those who would like to read more about the time series analysis in R, here is an excellent free book.

Thank you!

Holt Winters Time Series Forecasting

July 9, 2018May 16, 2021 / RP / 1 Comment

What is a time series?

When we track a certain variable over an interval of time (generally at an equal interval of time) the resulting process is called a time series.

Let’s Look at some examples of time series in our daily life

1. Closing price of Apple stock on a daily basis will be a time series

Example of Time Series- Apple Stock Price Trend Pulled from Google Finance

2. GDP of the world over last several decades so will again be a time series again-

Example of Time Series- World GDP Trend Over Last Several Decades from World Bank

3. Similarly, the hourly movement of the Bitcoin prices in a day will be a time series

Example of Time Series- Hourly Bitcoin Prices from Coindesk

As you can see from the above examples, the duration of the time can vary for a time series. It can be minutes, days, hours, weeks. months, quarters, years or any other time period. However, one thing that will be common in all time series will be that a particular variable is being measured over a period of time.

What is a time series modeling?

A time series modelling is a statistical exercise where we try to achieve following two main objectives,

1. Visualize and understand the pattern of a particular time series. For example, if you are looking at the sales of an eCommerce company you would like to understand how it has performed over a period of time, which months it goes higher and lower etc.

2. By looking into the historical pattern, forecast what may happen in the future in that particular time series

What are the business usage of a time series modeling?

Time series modelling is used for a variety of different purpose. Some examples are listed below-

1. Forecast sales of an eCommerce company for the next quarter and next one year for financial planning and budgeting

2. Forecast call volume on a given day to efficiently plan resources in a call center

3. Predict trends in the future stock price movement for technical trading of that stock in a stock market

How is a time series forecasting different from a regression modeling?

One of the biggest difference between a time series and regression modeling is that a time series leverages the past value of the same variable to predict what is going to happen in the future.

On the other hand, a regression modeling such as a multiple linear regression will predict the value of a certain variable as a function of other variables

Let’s take an example to make this point more clear. If you are trying to protect the sales of an E-Commerce company as a function of what has been the sales in the past quarter this is a time series modelling

On the other hand, if you are trying to predict the sales of the same E-Commerce company as a function of other variables such as the marketing spend, price of the product and other such contributing factors, it is a regression modelling

What are the constituents of a time series?

A time series could be made up of following main parts

1. Trend- A systematic pattern of how the time series is behaving over a period of time. For example- GDP of emerging economies such as India is growing over a period of time

2. Seasonality- Peaks and troughs which happen during the same time. For example- sales of retailers in US goes higher during Thanksgiving and Black Friday

3. Random noise- As the name suggests, this is the random pattern in a time series

4. Cyclical- Cycles such as Fuel prices go low during certain time and higher at other times. Generally speaking a cycle is long in duration.

Please note that not all time series will have all these components.

Let’s look at example of the time series components. This has been done in R using the decompose function.

Additive Seasonal Model-

This model is used when the time series shows additive seasonality. For example, an eCommerce company sales in October of each year is $2MM USD higher than the base level sales regardless of what is the base level sales in that particular year. In very simplified mathematical equation it can be represented as

Observed = Trend + Seasonal + Random

Please take a look at the slide 2 and 3 of the below presentation

Multiplicative Seasonal Model-

This model is used when the time series shows multiplicative seasonality. For example, for an eCommerce company sales in October of each year is 1.2 times the base level sales in the year. If a particular year has low base level sales, the sales in October will be lower in absolute sense, however it will be 1.2x of the base level sales. In very simplified mathematical equation it can be represented as

Observed = Trend x Seasonal x Random

Please take a look a the slide 4 of the below presentation

Let’s now fit Exponential Smoothing to the above data example. Holt Winters is one of the most popular technique for doing exponential smoothing of a time series data. Moreover, we can fit both additive and multiplicative seasonal time series using HoltWinters() function in R.

There are many parameters that one can pass on this method, however one doesn’t need to pick these parameters as R will automatically pick the best settings to minimize the Square Error between the predicted and the actual values for the forecast.

The three most important parameters that one needs to pay attention to are-

alpha = Value of smoothing parameter for the base level.
beta = Value of smoothing parameter for the trend.
gamma = Value of smoothing parameter for the seasonal component.

All three of the above parameters range between 0 and 1

If beta and gamma are both zero and alpha is non zero, this is known as Single Exponential Smoothing
If gamma is zero but both beta and alpha are non zero, this is known as Double Exponential Smoothing with trend
If all three of them are non zero, this is knows as Triple Exponential Smoothing or Holt Winters with trend and seasonality.

In the below example, we will let R choose the optimized parameters for us.

Additive Seasonal Holt Winters Model

Let’s fit an additive model first and compute MAE. The general form of an additive model is shown below.

yt = base + linear * t + St + Random Error

Where

yt = forecast at time period t

base = Base signal

linear = linear trend component

t= time period t

St = Additive seasonal factor

This is the model that R has fitted for us-

HoltWinters(x = fl, seasonal = “additive”)

Smoothing parameters:
alpha: 0.2479595
beta : 0.03453373
gamma: 1

Coefficients:
[,1]
a 477.827781
b 3.127627
s1 -27.457685
s2 -54.692464
s3 -20.174608
s4 12.919120
s5 18.873607
s6 75.294426
s7 152.888368
s8 134.613464
s9 33.778349
s10 -18.379060
s11 -87.772408
s12 -45.827781

See Slide # 11 on how to use the above model output to compute forecast for any given time period. However, you don’t have to do it by hand as R will do it for you. Nevertheless, good to know how to use the above model output.

Finally, let’s notice that MAE of the additive model comes out to be 9.774438

Multiplicative Seasonal Holt Winters Model

The general form of a multiplicative model is shown below-

yt = (base + linear * t )* St + Random Error

Where

yt = forecast at time period t

base = Base signal

linear = linear trend component

t= time period t

St = Additive seasonal factor

This is the model that R has fitted for us-

Call:
HoltWinters(x = fl, seasonal = “multiplicative”)

Smoothing parameters:
alpha: 0.2755925
beta : 0.03269295
gamma: 0.8707292

Coefficients:
[,1]
a 469.3232206
b 3.0215391
s1 0.9464611
s2 0.8829239
s3 0.9717369
s4 1.0304825
s5 1.0476884
s6 1.1805272
s7 1.3590778
s8 1.3331706
s9 1.1083381
s10 0.9868813
s11 0.8361333
s12 0.9209877

As you can see from the above output, the seasonality shows that demand for the air travel is the highest in July and August of each year and lowest in November.

Moreover the MAE for this model is 8.393662. Therefore, in this case a multiplicative Holt Winters seasonal model is able to provide us a better forecast compared to an additive model.

All the codes and output can be found here and in the below presentation.

Here is the forecast generated from the model-

HoltWinters Timeseries in R- Forecast for next 20 months using Multiplicative Model

You can do the HoltWinters Forecast in Excel as well using the below simple steps-

Thank you!

Analytical Problem Solving- Types of Reasoning

March 18, 2018May 16, 2021 / RP / Leave a comment

To solve any problem we need to have some way of breaking down the problem. There are two main ways of reasoning to that effect-

Deductive Reasoning– This is also called as “Top Down” approach or “Formal Logic” approach. The key here is to form hypotheses to explain a certain phenomenon and then go to reject or accept related hypotheses. The conclusions and recommendation coming out from this sort of reasoning are more certain and factual in nature.
- For example, let’s say you are trying to explain why a certain car gives lower miles per gallon. Because you know the business and have more context on this problem, you can start with potential hypotheses-
  - Weight of the car is high
  - Car has higher number or cylinders
  - Car has higher horse power
  - and so on…

You will check each of the above hypotheses and reach to a definite conclusion.

Inductive Reasoning– On the other hand, this is a “Bottom Up” approach or “Informal Logic” approach. This sort of reasoning is more exploratory in nature. The end goal is to form some hypotheses to give possible reasons to explain certain phenomenon.
- For example, let’s say you are trying to explain why sales of an eCommerce company has gone down in a particular quarter. You may begin by an exploratory analysis where you can begin with potential driver factors such as-
  - Marketing spend of the company
  - Pricing
  - Competitive landscape
  - Macro economic factors

You will do data analysis to correlate each of the above factors to the sales and find potential reasons or build potential hypotheses to be tested further.

Cheers!

Lasso, Ridge and Elastic Net Regularization

March 18, 2018August 5, 2025 / RP / 1 Comment

Regularized Regression Techniques: Lasso, Ridge, and Elastic Net

Simple image showing regression types comparison without text

Regularized regression methods are essential tools in machine learning and statistics that help prevent overfitting and improve model generalization by adding penalty terms to the standard linear regression cost function. These techniques are particularly valuable when dealing with high-dimensional data or when the number of features approaches or exceeds the number of observations.

Ridge regression, also known as L2 regularization, adds a penalty term proportional to the sum of squared coefficients to the ordinary least squares (OLS) cost function. This method shrinks the regression coefficients toward zero but never makes them exactly zero.

Ridge Regression ( L2 )

Objective
minimize: Σ ( yᵢ − ŷᵢ )² + λ · Σ βⱼ²

ŷᵢ = β₀ + Σ βⱼ xᵢⱼ
λ ≥ 0 controls the amount of shrinkage
All predictors stay in the model, but coefficients are pulled toward 0.

Lasso Regression ( L1 )

Lasso regression (Least Absolute Shrinkage and Selection Operator) uses L1 regularization, adding a penalty term proportional to the sum of absolute values of coefficients. Unlike Ridge regression, Lasso can drive coefficients to exactly zero, effectively performing feature selection.

Objective
minimize: Σ ( yᵢ − ŷᵢ )² + λ · Σ |βⱼ|

Same ŷᵢ and λ as above
The absolute-value penalty can push some βⱼ exactly to 0, giving automatic feature selection.

Elastic Net ( L1 + L2 )

Elastic Net regression combines both L1 and L2 penalties, leveraging the strengths of both Ridge and Lasso regression methods. This hybrid approach is particularly effective when dealing with groups of correlated features.

Objective
minimize: Σ ( yᵢ − ŷᵢ )² + λ · [ α · Σ |βⱼ| + (1 − α) · Σ βⱼ² ]

0 ≤ α ≤ 1 mixes the two penalties
– α = 1 → pure Lasso
– α = 0 → pure Ridge
Balances variable selection (L1) with coefficient shrinkage (L2), handling groups of correlated features well.

How to choose

Ridge – keep all features, tame multicollinearity.
Lasso – need a sparse, easily interpretable model.
Elastic Net – expect correlated predictors or want a middle ground.

Tune λ (and α for Elastic Net) with cross-validation for best performance.

Here is a working example code on the Housing data. Please note, generally before doing regularized GLM regression it is advised to scale variables. However, in the below example we are working with the variables on the original scale to demonstrate each algorithms working.

# Import necessary libraries
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the California Housing dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

# Preprocess the data
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Fit Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred_linear = linear_model.predict(X_test)

# Fit Lasso Regression model with hyperparameter tuning
lasso_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
lasso_model = GridSearchCV(Lasso(), lasso_params, cv=5, scoring='r2')
lasso_model.fit(X_train, y_train)
y_pred_lasso = lasso_model.best_estimator_.predict(X_test)

# Fit Ridge Regression model with hyperparameter tuning
ridge_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_model = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='r2')
ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.best_estimator_.predict(X_test)

# Fit Elastic Net Regression model with hyperparameter tuning
elastic_params = {'alpha': [0.01, 0.1, 1, 10, 100], 'l1_ratio': [0.1, 0.5, 0.9]}
elastic_model = GridSearchCV(ElasticNet(), elastic_params, cv=5, scoring='r2')
elastic_model.fit(X_train, y_train)
y_pred_elastic = elastic_model.best_estimator_.predict(X_test)

# Evaluate models
results = pd.DataFrame({
    'Model': ['Linear', 'Lasso', 'Ridge', 'Elastic Net'],
    'MSE': [mean_squared_error(y_test, y_pred_linear),
            mean_squared_error(y_test, y_pred_lasso),
            mean_squared_error(y_test, y_pred_ridge),
            mean_squared_error(y_test, y_pred_elastic)],
    'MAE': [mean_absolute_error(y_test, y_pred_linear),
            mean_absolute_error(y_test, y_pred_lasso),
            mean_absolute_error(y_test, y_pred_ridge),
            mean_absolute_error(y_test, y_pred_elastic)],
    'R²': [r2_score(y_test, y_pred_linear),
           r2_score(y_test, y_pred_lasso),
           r2_score(y_test, y_pred_ridge),
           r2_score(y_test, y_pred_elastic)]
})
print(results)

# Plot model performance
results.set_index('Model').plot(kind='bar', figsize=(10, 6))
plt.title('Model Performance Comparison')
plt.ylabel('Metric Value')
plt.show()

# Display coefficients for each model
coefficients = pd.DataFrame({
    'Feature': data.feature_names,
    'Linear': linear_model.coef_,
    'Lasso': lasso_model.best_estimator_.coef_,
    'Ridge': ridge_model.best_estimator_.coef_,
    'Elastic Net': elastic_model.best_estimator_.coef_
})
print(coefficients)

# Plot coefficients
coefficients.set_index('Feature').plot(kind='bar', figsize=(12, 8))
plt.title('Feature Coefficients for Each Model')
plt.ylabel('Coefficient Value')
plt.show()

Model Persistence Using Python Pickle

March 9, 2018May 16, 2021 / RP / 1 Comment

After you have built a machine learning model which is doing a great job in prediction, you don’t have to retrain your model again and again for future usage. Instead, you can use Python pickle serialization for reusing this model in future and transferring it into a production environment where non modelers can also use this model to make predictions.

By Renee Comet (photographer) [Public domain], via Wikimedia Commons

First let’s look at how Wikipedia defines a pickle

Pickling is the process of preserving or expanding the lifespan of food by either anaerobic fermentation in brine or immersion in vinegar. The resulting food is called a pickle.

Python pickling is the same process without brine or vinegar, whereas you will pickle your model for longer usage without the need for you to recook your models. In a “Pickling” process a Python object is converted into a byte stream. On the other hand, in an “Unpickling” process a byte stream is converted back into Python object.

I strongly recommend that you read Python Official Documentation on this topic before moving forward.

Now let’s see this live in action. We will first look at a simple example and then look at a model example.

Example 1- In this we will pickle and un-pickle a simple Python list

pickle1

Example 2- In this we will pickle and un-pickle a Decision Tree classifier and use it later for making predictions on a new data

pickle2 pickle3

For more details, do check out this excellent presentation.

Cheers!

RP’s Blog on AI

Connect with RP- https://www.linkedin.com/in/ratnakarpandey/

Author: RP

Decision Tree Regression using Scikit

Decision Tree for Regression: Complete Guide and Implementation

Core Concepts and How It Works

How Decision Trees Work for Regression

Key Mathematical Concepts

Standard Deviation Reduction

Sum of Squared Errors (SSE)

Mean Squared Error (MSE)

Core Algorithm Components

Splitting Criteria

Stopping Criteria

Pruning

Key Assumptions and Highlights

Advantages and Limitations

Complete Python Implementation

Support Vector Machine (SVM)

How to Change Browser for Jupyter Notebook

Overview of Banking and Financial Services Industry

Exploratory Data Analysis and kNN Classification on Iris Dataset

Auto Regressive Integrated Moving Average (ARIMA) Time Series Forecasting

Holt Winters Time Series Forecasting

Analytical Problem Solving- Types of Reasoning

Lasso, Ridge and Elastic Net Regularization

Regularized Regression Techniques: Lasso, Ridge, and Elastic Net

Ridge Regression ( L2 )

Lasso Regression ( L1 )

Elastic Net ( L1 + L2 )

Model Persistence Using Python Pickle

Decision Tree for Regression: Complete Guide and Implementation

Core Concepts and How It Works

How Decision Trees Work for Regression

Key Mathematical Concepts

Standard Deviation Reduction

Sum of Squared Errors (SSE)

Mean Squared Error (MSE)

Core Algorithm Components

Splitting Criteria

Stopping Criteria

Pruning

Key Assumptions and Highlights

Advantages and Limitations

Complete Python Implementation

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Regularized Regression Techniques: Lasso, Ridge, and Elastic Net

Ridge Regression ( L2 )

Lasso Regression ( L1 )

Elastic Net ( L1 + L2 )

Share this:

Share this: