Machine Learning for Online Retail Analytics & Customer Segmentation

This repository contains the Machine Learning Project, which is part of my MSc in Business Mathematics.
The project focuses on leveraging machine learning techniques for customer segmentation, product-description categorization, association rules mining and network graph analysis.
By analyzing and segmenting customers based on their purchasing behaviors, categorizing products using NLP techniques, uncovering purchasing patterns through association rules and analyzing invoice networks with graph algorithms, this project aims to derive valuable insights from the Online Retail Data Set available at the UCI Machine Learning Repository.

Supervising Professor: D. Panagopoulos

Achievement 🏆

Evaluation: Excellent.
The project was recognized for its innovative approach and effective application of advanced machine learning methods.

Overview

The Data Set
Part I
Part II
- 2.1 RFM Analysis
- 2.2 Customer Segmentation using RFM Quantile Scores and Category Spending Patterns
Part III

The Data Set

The project utilizes the Online Retail Data Set from the UCI Machine Learning Repository. This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Access the Data Set: Online Retail UCI Machine Learning Repository
Variables Table

Variable Name	Role	Type	Description
InvoiceNo	ID	Categorical	A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation or a reversal.
StockCode	ID	Categorical	A 5-digit integral number uniquely assigned to each distinct product.
Description	Feature	Categorical	Product name.
Quantity	Feature	Integer	The quantities of each product (item) per transaction.
InvoiceDate	Feature	Date	The day and time when each transaction was generated.
UnitPrice	Feature	Continuous	Product price per unit.
CustomerID	ID	Categorical	A 5-digit integral number uniquely assigned to each customer.
Country	Feature	Categorical	The name of the country where each customer resides.

1. Part I

View Part I Notebook

1.1 Data Preparation & Feature Engineering

`Deduplication, Missing CustomerID Removal and Data Type Conversion`

Remove duplicate entries to ensure the uniqueness of each transaction in the dataset.
Entries without a 'CustomerID' are removed since IDs are essential for customer-specific analyses.
Convert data types for better compatibility with analysis tools. For instance, ensuring that dates are in datetime format and categorical data are treated as such.

`Handling Cancellations and Returns`

The process for handling cancellations and returns is crucial to ensure data integrity and accuracy in analysis:

Negative 'quantity' transactions are considered potential returns if there exists a corresponding prior positive transaction with the same 'product_id' from the same 'customer_id'. This ensures that the negative transaction is a genuine return and not an error or irregularity. If no corresponding positive transaction exists before the negative one, it is treated as a cancellation. It is important to note that some cancellations might be genuine returns without a recorded prior purchase. For such cancellations, we remove these entries from the dataset to correct the data and avoid skewing our analysis.

`Data Enhancement`

Date Features
Extract additional features from 'InvoiceDate' such as day of the week, month, and hour to uncover patterns related to time.
Invoice Total
Calculate the total amount for each invoice to facilitate revenue analysis.

1.2 Exploratory Analysis

This phase involved a deep dive into the dataset to understand the distribution of variables.

`Key Areas of Focus`

Sales Distribution by Time
Analyzed sales data to uncover trends across different timescales—hourly, daily, and monthly. This helps in understanding peak shopping hours, busiest shopping days, and seasonal trends which are essential for inventory and marketing strategies.

`Sales by Time of the Day`

`Sales by Day of the Month`

`Sales by Month`

Customer Demographics
Explored demographic data, enhancing the understanding of our customer base.

Revenue by Country
Mapped out revenue generation across different countries to pinpoint high-value markets and assess the global reach of the business.

Product Popularity
Investigated the most frequently purchased items by analyzing the 'Description' field.

1.3 NLP-Driven Product Categorization

Organized a vast array of product descriptions, amounting to 3,878 entries, into well-defined categories. This categorization is driven by the goal to uncover inherent groupings that reveal subtle patterns and similarities not immediately apparent in the raw data.

`Methodology`

Our approach involves a process of text preprocessing and key term extraction to prepare the data for robust clustering:

`Text Preprocessing & Key Term Extraction`

Identifying Key Parts of Speech
We specifically target nouns in product descriptions as they often indicate key features or elements.
Condensing Words to Their Roots
Applying stemming techniques, we reduce words to their root forms.
Extracting and Counting Terms
From the stemmed nouns, we compile a frequency map to measure the significance of each term within the set of texts.
Selecting Representative Terms
For groups of words sharing the same root, we choose the shortest term as the representative for simplicity and clarity.

`Words Occurrence`

`Keyword Filtering Strategy`

Exclusion Criteria
We filter out terms based on their commonality, rarity, and informativeness. Only terms that contribute meaningfully to product differentiation are retained.
Threshold Settings
We implement a word length threshold to ensure focus on substantial terms, excluding words with non-contributive characters like '+' or '/'.

`Constructing a Data Matrix for Clustering`

Binary Variable Transformation
Each qualifying keyword is converted into a binary variable for each product description, indicating the presence or absence of that keyword.
Integration of Price Range Data
By including price segmentation, we add an economic dimension to the clustering process, aligning it with both qualitative and quantitative attributes.

`K-Means Clustering`

`Clustering Output`

The (scaled) data matrix, comprising rows of individual products and columns of features (keywords and price ranges), serves as the input for clustering algorithms. The outcome is a set of clusters, each labeled with an indicative name reflecting the thematic essence captured by the cluster:

Cluster Number	Cluster Name
0	Vintage Design
1	Classic Artistry
2	Urban Home & Jewellery
3	Accessories
4	Holiday Essentials

`t-SNE Visualization of Clusters`

`Word Clouds of Clusters`

Note: The names assigned to each cluster are not absolute but serve as labels to facilitate easier analysis and discussion.

`Product Categories`

`Further Handling of Negative Values using Product Categories`

To further address negative total prices in the dataset, we identify transactions with negative amounts, group the data by customer and product category, and filter out instances with negative total spending in any category. Specific invoices contributing to this "negativity" are isolated, and their total prices are set to zero. This adjustment ensures accurate and reliable records by preventing distortions in analysis caused by returns, discounts, or data entry errors.

2. Part II

View Part II Notebook

2.1 RFM Analysis

Applied Recency, Frequency, Monetary (RFM) metrics to segment customers based on their purchasing patterns, identifying key customer groups.

R (Recency)
Recency measures how recently a customer made a purchase. This metric helps to identify customers who have engaged with the brand recently, under the assumption that the more recent the purchase, the more likely the customer will remain engaged.
F (Frequency)
Frequency measures how often a customer makes a purchase within a defined time period. A higher frequency indicates a higher engagement level and loyalty.
M (Monetary Value)
Monetary value measures how much money a customer has spent with the brand over a period of time. It helps in identifying the highest spending customers who are contributing more to the revenue.

`RFM Distributions`

`Outlier Handling - Upper Clipping`

`K-Means Clustering`

`Clustering Results`

Customer Category	Average Recency	Average Frequency	Average Monetary	Customer Count
At-Risk	249.5	1.7	408.8	1019
Loyal Customers	30.4	9.1	3224.9	714
Potential Loyalists	48.1	2.9	720.2	2404
VIPs	10.6	25.5	5356.5	202

2.2 Customer Segmentation using RFM Quantile Scores and Category Spending Patterns

In this stage of our analysis, we further refine our customer segmentation by applying quantile scoring to the RFM metrics and analyzing spending patterns across various product categories. This dual approach allows us to gain a more nuanced understanding of customer behaviors and preferences.

`Features`

Revenue Percentage per Category per Customer
We calculate the revenue percentage for each product category per customer by dividing the revenue generated by a customer in a specific category by their total spending across all categories. This metric is expressed as a percentage, highlighting the proportion of total spending dedicated to each category.
"Dominant" Category per Customer
The 'Dominant' Category for each customer is determined by identifying which product category has the highest total spending or transaction count for that customer.
Recency, Frequency & Monetary Scores (0-99) using Quantiles
Each customer receives a score from 0 to 99 for each RFM metric based on their quantile rank. These scores standardize the RFM metrics to other percentage-based measures, improving the integration and comparability of analyses.

`Clustering Evaluation Methods`

Feature Importance Analysis using Random Forest
Trained a Random Forest on the cluster labels. This analysis helps identify which attributes are most influential in defining customer segments.
Performance Metrics
Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index
t-SNE Visualization
Visualization of High-Dimensional data in two dimensions, highlighting the distribution of clusters

`Clustering Techniques Applied`

Following the evaluation of feature importance and performance metrics, we adopted a multi-step clustering approach to optimize our segmentation strategy. This involved the sequential application of different clustering algorithms to refine our clusters and improve the granularity of our customer segmentation. All features were scaled appropriately before applying the algorithms and the results were then assessed back on the initial dataset to ensure consistency and reliability.

K-Means Clustering
Initially, we applied the K-Means clustering algorithm to establish a baseline for segmentation. K-Means was chosen for its efficiency and effectiveness in grouping large data sets into k distinct clusters based on attribute similarity.

Hierarchical Clustering
To further refine our cluster definitions and potentially identify a more optimal number of clusters, we then employed Hierarchical Clustering. This method allowed us to visualize and assess different cluster possibilities through a dendrogram, providing insight into how data points are grouped at various levels of granularity.

Revised K-Means Clustering
Based on the insights gained from Hierarchical Clustering, specifically the number of clusters suggested by the dendrogram, we performed a second round of K-Means clustering. This time, we used the cluster count obtained from the hierarchical method as the input for 'k'. This refined approach allowed us to fine-tune our segmentation, leading to more distinct and actionable customer groups.

This iterative clustering strategy — starting with K-Means, refining with Hierarchical Clustering, and concluding with a revised K-Means — proved to be highly effective. It enabled us to leverage the strengths of both methods: the computational efficiency of K-Means and the detailed insight provided by Hierarchical Clustering. The final iteration of K-Means, using the informed choice of 'k', yielded the most meaningful and practical customer segments.

Below is a breakdown of our customer segmentation analysis, showcasing the distribution of average percentage spending per category, average dominant category indicators, and average RFM scores for each defined customer segment:

Customer Segment	% Accessories	% Classic Artistry	% Holiday Essentials	% Urban Home & Jewellery	% Vintage Design	Dominant Accessories	Dominant Classic Artistry	Dominant Holiday Essentials	Dominant Urban Home & Jewellery	Dominant Vintage Design	Recency Score	Frequency Score	Monetary Score	Number of Customers
Active Urban Home & Jewellery Enthusiasts	14.93	16.67	9.15	43.54	15.72	0.0	0.0	0.0	1.0	0.0	51.78	47.07	46.82	554
High-Value Accessory Enthusiasts	40.13	18.10	8.84	16.30	16.64	1.0	0.0	0.0	0.0	0.0	66.37	72.63	70.53	534
Low-Engagement Accessory Seekers	54.44	13.18	6.51	12.78	12.08	1.0	0.0	0.0	0.0	0.0	26.66	26.70	28.70	594
Low-Engagement Vintage Design Shoppers	11.90	16.72	9.28	11.08	51.01	0.0	0.0	0.0	0.0	1.0	33.92	30.36	30.28	453
Occasional Holiday Essentials Shoppers	9.89	19.77	47.86	9.38	13.10	0.0	0.0	1.0	0.0	0.0	45.52	42.58	38.49	392
Premium Vintage Design Lovers	14.70	21.52	10.52	14.48	38.78	0.0	0.0	0.0	0.0	1.0	70.68	76.25	76.77	455
Understated Classic Artistry Enthusiasts	11.16	50.08	13.13	10.46	15.17	0.0	1.0	0.0	0.0	0.0	33.59	26.80	29.05	635
Valued Classic Artistry Connoisseurs	13.87	37.18	15.01	15.43	18.51	0.0	1.0	0.0	0.0	0.0	66.64	71.89	71.95	722

3. Part III

View Part III Notebook

3.1 Association Rules Mining

Association rules mining is a key technique used to discover interesting relationships between variables in large databases. We utilized this method to uncover patterns and associations between different product categories in our dataset. Here's an overview of the metrics and the algorithm used:

`Metrics Explained`

Support: This metric measures how frequently an itemset appears in the dataset. A higher support indicates that the itemset is more common.
Confidence: Confidence assesses the likelihood that an item Y is purchased when item X is purchased, expressed as a percentage. It indicates the strength of the implication found in the data.
Lift: Lift compares the likelihood of Y being purchased when X is purchased against the likelihood of Y being purchased independently. It helps identify itemsets that are more likely to be bought together than separately.

`FP-Growth Algorithm`

For our analysis, we chose the FP-Growth Algorithm due to its efficiency in mining frequent itemsets without candidate generation, which is particularly useful for large datasets.

`Configuration Parameters`

Minimum Support: 0.005
This threshold filters out itemsets that appear in less than 0.5% of all transactions, focusing analysis on more commonly purchased items.
Minimum Lift Threshold: 1.2
Sets a baseline to find item pairs at least 20% more likely to be purchased together than independently, highlighting potentially strong associations.
Max Itemset Length: 2
Restricts itemsets to pairs, simplifying the complexity of analysis and making the findings more actionable for marketing strategies like product bundling or promotions.

3.2 Simple Recommendation System

A straightforward recommendation system was developed, utilizing the insights from association rules mining to suggest products to customers based on their historical purchasing patterns, aiming to enhance customer satisfaction and increase sales potential.

`"Because you Liked this Item" Recommendations Overview`

Products in Basket	Recommendations
JAM MAKING SET WITH JARS	JAM MAKING SET PRINTED Confidence (Probability): 32.25%, Lift: 8 SET OF 3 CAKE TINS PANTRY DESIGN Confidence: 25.75%, Lift: 4.5 RECIPE BOX PANTRY YELLOW DESIGN Confidence: 20.29%, Lift: 5 SET OF 4 PANTRY JELLY MOULDS Confidence: 19.45%, Lift: 5
RED KITCHEN SCALES	IVORY KITCHEN SCALES Confidence (Probability): 58%, Lift: 20 MINT KITCHEN SCALES Confidence: 26.62%, Lift: 18
HAND WARMER OWL DESIGN	HAND WARMER RED LOVE HEART Confidence (Probability): 48%, Lift: 20

3.3 Network Graph Analysis of Product Invoices and Recommendations

In our analysis, we employed association rules mining for both products and product categories to construct a directed graph that encapsulates the relationships between products based on invoice data. The graph is constructed using both direct product-to-product associations and broader category-to-category relationships as a fallback mechanism. This approach aids in identifying key influencers within the product network and simplifies the understanding of complex interdependencies in purchasing behavior.

Detailed Process

Initialize Nodes
- Description: Initializes graph nodes using unique product descriptions from the data.
- Implementation: Iterates over unique descriptions and adds each as a node to the graph.
Add Edge Attributes
- Description: Adds edges between nodes with attributes such as confidence, support, and lift, which signify the direction, strength and significance of the relationships.
- Implementation: For each pair of connected nodes, adds an edge and assigns attributes based on derived association rules. If attributes are missing, defaults to predefined minimal values.
Precompute Descriptions to Categories
- Description: Caches category information for each product description to speed up graph construction.
- Implementation: Compiles a dictionary mapping each product description to its respective category, utilizing data from the dataset to facilitate quick access during graph operations.
Build Graph
- Description: Constructs the network graph using both direct product-to-product associations and broader category-to-category relationships.
- Implementation:
  - Processes groups of products by invoices to determine co-purchase relationships.
  - Directly connects products with strong association rules.
  - Where direct connections are lacking, uses category-level associations as a fallback to ensure comprehensive network connectivity.
Connect Isolated Nodes
- Description: Integrates nodes that remain isolated after the initial construction into the broader network to maintain connectivity.
- Implementation:
  - Identifies isolated nodes using NetworkX functionalities.
  - For each isolated node, evaluates potential category connections and links each to the most central node in the most connected category, determined by the highest confidence measures and using degree centrality to assess centrality within categories.

Personalized PageRank Recommendations

Personalized PageRank adapts the original PageRank algorithm to focus on a specific node (product), emphasizing paths that start from this node. This provides a personalized ranking of all nodes in the graph based on their relevance to the starting node. By tailoring the PageRank to a starting node, we ensure that the recommendations are highly relevant to the user's current interest or recent activity. Personalized PageRank considers the entire graph's structure, which helps identify both direct and indirect relationships between products. This holistic view can reveal less obvious but potentially valuable recommendations.

Example

A* Graph Search Algorithm Recommendations ("Path" of recommended products)

A* search algorithm finds the shortest path between nodes in a graph. In this context, the "shortest path" is based on the confidence of co-purchase edges, with higher confidence leading to lower costs. The path offers a personalized sequence of products that a customer is likely to purchase. By dynamically generating product paths starting from a user-selected product, this approach not only enhances the shopping experience by suggesting highly relevant products but also provides actionable insights for optimizing product placements.

Example

Start Node: 'LUNCH BAG RED RETROSPOT'
Goal Node: 'HAND WARMER OWL DESIGN'

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
Part_1		Part_1
Part_2		Part_2
Part_3		Part_3
images_outputs		images_outputs
LICENSE		LICENSE
README.md		README.md
online_retail.zip		online_retail.zip

Folders and files

Latest commit

History

Repository files navigation

Machine Learning for Online Retail Analytics & Customer Segmentation

Achievement 🏆

Overview

The Data Set

1. Part I

1.1 Data Preparation & Feature Engineering

Deduplication, Missing CustomerID Removal and Data Type Conversion

Handling Cancellations and Returns

Data Enhancement

1.2 Exploratory Analysis

Key Areas of Focus

Sales by Time of the Day

Sales by Day of the Month

Sales by Month

1.3 NLP-Driven Product Categorization

Methodology

Text Preprocessing & Key Term Extraction

Words Occurrence

Keyword Filtering Strategy

Constructing a Data Matrix for Clustering

K-Means Clustering

Clustering Output

t-SNE Visualization of Clusters

Word Clouds of Clusters

Product Categories

Further Handling of Negative Values using Product Categories

2. Part II

2.1 RFM Analysis

RFM Distributions

Outlier Handling - Upper Clipping

K-Means Clustering

Clustering Results

2.2 Customer Segmentation using RFM Quantile Scores and Category Spending Patterns

Features

Clustering Evaluation Methods

Clustering Techniques Applied

3. Part III

3.1 Association Rules Mining

Metrics Explained

FP-Growth Algorithm

Configuration Parameters

3.2 Simple Recommendation System

"Because you Liked this Item" Recommendations Overview

3.3 Network Graph Analysis of Product Invoices and Recommendations

Detailed Process

Personalized PageRank Recommendations

A* Graph Search Algorithm Recommendations ("Path" of recommended products)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Deduplication, Missing CustomerID Removal and Data Type Conversion`

`Handling Cancellations and Returns`

`Data Enhancement`

`Key Areas of Focus`

`Sales by Time of the Day`

`Sales by Day of the Month`

`Sales by Month`

`Methodology`

`Text Preprocessing & Key Term Extraction`

`Words Occurrence`

`Keyword Filtering Strategy`

`Constructing a Data Matrix for Clustering`

`K-Means Clustering`

`Clustering Output`

`t-SNE Visualization of Clusters`

`Word Clouds of Clusters`

`Product Categories`

`Further Handling of Negative Values using Product Categories`

`RFM Distributions`

`Outlier Handling - Upper Clipping`

`K-Means Clustering`

`Clustering Results`

`Features`

`Clustering Evaluation Methods`

`Clustering Techniques Applied`

`Metrics Explained`

`FP-Growth Algorithm`

`Configuration Parameters`

`"Because you Liked this Item" Recommendations Overview`

Packages