Analysis Modules
Plots
Line Plot
Line plots are particularly good for visualizing sequences that are ordered or sequential, but not necessarily categorical, such as:
- Days since an event (e.g., -2, -1, 0, 1, 2)
- Months since a competitor opened
- Tracking how metrics change across key events
They are often used to compare trends across categories, show the impact of events on performance, and visualize changes over time-like sequences.
Note: While this module can handle datetime values on the x-axis, the plots.time_line plot module has additional features that make working with datetimes easier, such as easily resampling the data to alternate time frames.
Example:
import pandas as pd
from pyretailscience.plots import line
df = pd.DataFrame({
"months_since_event": range(-5, 6),
"category A": [10000, 12000, 13000, 15000, 16000, 17000, 18000, 20000, 21000, 20030, 25000],
"category B": [9000, 10000, 11000, 13000, 14000, 15000, 10000, 7000, 3500, 3000, 2800],
})
line.plot(
df=df,
value_col=["category A", "category B"],
x_label="Months Since Event",
y_label="Revenue (£)",
title="Revenue Trends across Categories",
x_col="months_since_event",
group_col=None,
source_text="Source: PyRetailScience - 2024",
move_legend_outside=True,
)
Period-on-Period Plot
Period-on-period plots help compare the same metric across two or more time intervals, all aligned to a common starting point. This is useful when you want to:
- Compare different promotional weeks
- Analyze performance across multiple holiday seasons
- Benchmark key metrics across repeated events (e.g., monthly product launches)
Each period is overlaid on the same plot, allowing for easy visual comparison of trends across intervals.
Note: Dates are automatically realigned to a reference start year, so all lines start at the same x=0 point, regardless of calendar time.
Example
import pandas as pd
from pyretailscience.plots.period_on_period import plot
periods = [
("2022-01-01", "2022-04-01"),
("2023-01-01", "2023-04-01"),
]
data = {
'date': [
'2022-01-02', '2022-01-09', '2022-01-16', '2022-01-23', '2022-01-30',
'2022-02-06', '2022-02-13', '2022-02-20', '2022-02-27', '2022-03-06',
'2022-03-13', '2022-03-20', '2022-03-27',
'2023-01-03', '2023-01-08', '2023-01-15', '2023-01-22', '2023-01-29',
'2023-02-05', '2023-02-12', '2023-02-19', '2023-02-26', '2023-03-05',
'2023-03-12', '2023-03-19', '2023-03-26',
],
'sales': [
1024, 1199, 1214, 1295, 1249, 1194, 988, 973, 1029, 910, 952, 976, 1099,
1195, 1316, 1317, 1361, 1403, 1240, 1164, 1053, 984, 1051, 1079, 1141, 1169,
]
}
df = pd.DataFrame(data)
plot(
df=df,
x_col="date",
value_col="sales",
periods=periods,
x_label=" ",
y_label="Sales",
title="Period on Period Comparison",
legend_title="Periods",
source_text="Source: PyRetailScience - Sales FY2024",
move_legend_outside=True,
)
Area Plot
Area plots are useful for visualizing cumulative trends, showing relative contributions, and comparing multiple data series over time. They are often used for:
- Visualizing stacked contributions (e.g., market share over time)
- Comparing cumulative sales or revenue
- Showing growth trends across multiple categories
Similar to line plots, area plots can display time-series data, but they emphasize the area under the curve, making them ideal for tracking proportions and cumulative metrics.
Example:
import pandas as pd
import numpy as np
from pyretailscience.plots import area
periods = 6
rng = np.random.default_rng(42)
data = {
"transaction_date": np.repeat(pd.date_range("2023-01-01", periods=periods, freq="ME"), 3),
"unit_spend": rng.integers(1, 6, size=3 * periods),
"category": ["Jeans", "Shoes", "Dresses"] * periods,
}
df = pd.DataFrame(data)
df_pivoted = df.pivot(index="transaction_date", columns="category", values="unit_spend").reset_index()
area.plot(
df=df_pivoted,
value_col=["Jeans", "Dresses", "Shoes"],
x_label="",
y_label="Sales",
title="Sales Trends by Product Category",
x_col="transaction_date",
source_text="Source: PyRetailScience - 2024",
move_legend_outside=True,
alpha=0.5,
)
Scatter Plot
Scatter plots are useful for visualizing relationships between two numerical variables, detecting patterns, and identifying outliers. They are often used for:
- Exploring correlations between variables
- Identifying clusters in data
- Spotting trends and outliers
Scatter plots are particularly useful when analyzing distributions and understanding how one variable influences another. They can also be enhanced with colors and sizes to represent additional dimensions in the data.
Example:
import random
import pandas as pd
from pyretailscience.plots import scatter
months = [
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
]
categories = ["Electronics", "Clothing", "Home Decor", "Sports", "Books"]
data = {
"month": months * len(categories),
"sales": [random.randint(500, 5000) for _ in range(12 * len(categories))],
"profit": [random.randint(100, 2000) for _ in range(12 * len(categories))],
"expenses": [random.randint(300, 4000) for _ in range(12 * len(categories))],
"category": categories * 12,
}
df = pd.DataFrame(data)
scatter.plot(
df=df,
value_col=["sales", "profit", "expenses"],
x_col="month",
x_label="",
y_label="Sales",
title="Sales, Profit & Expenses Scatter Plot",
source_text="Source: PyRetailScience - 2024",
move_legend_outside=True,
alpha=0.8,
)
Venn Diagram
Venn diagrams are useful for visualizing overlaps and relationships between multiple categorical sets. They help in:
- Identifying commonalities and differences between groups
- Understanding intersections between two or three sets
- Highlighting exclusive and shared elements
Venn diagrams provide a clear way to analyze how different groups relate to each other. They are often used in market segmentation, user behavior analysis, and set comparisons.
Example:
import pandas as pd
from pyretailscience.plots import venn
df = pd.DataFrame({
"groups": [(1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1, 1)],
"percent": [0.119403, 0.089552, 0.238806, 0.208955, 0.134328, 0.208955, 0.104111]
})
labels = ["Frequent Buyers", "High-Spenders", "Loyal Members"]
venn.plot(
df,
labels=labels,
title="E-commerce Customer Segmentation",
source_text="Source: PyRetailScience - 2024",
vary_size=False,
subset_label_formatter=lambda v: f"{v:.1%}"
)
Histogram Plot
Histograms are particularly useful for visualizing the distribution of data, allowing you to see how values in one or more metrics are spread across different ranges. This module also supports grouping by categories, enabling you to compare the distributions across different groups. When grouping by a category, multiple histograms are generated on the same plot, allowing for easy comparison across categories.
Histograms are commonly used to analyze:
- Sales, revenue or other metric distributions
- Distribution of customer segments (e.g., by age, income)
- Comparing metric distributions across product categories
This module allows you to customize legends, axes, and other visual elements, as well as apply clipping or filtering on the data values to focus on specific ranges.
Example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pyretailscience.plots import histogram
df = pd.DataFrame({
'first_purchase_revenue': np.concatenate([
np.random.normal(70, 10, 50000),
np.random.normal(90, 15, 50000)
]),
'product': ['Product A'] * 50000 + ['Product B'] * 50000
})
histogram.plot(
df=df,
value_col='first_purchase_revenue',
group_col='product',
title="First Purchase Revenue by Product (£)",
x_label="Revenue (£)",
y_label="Number of Customers",
source_text="Source: PyRetailScience - 2024",
move_legend_outside=True,
use_hatch=True
)
Bar Plot
Bar plots are ideal for visualizing comparisons between categories or groups, showing how metrics such as revenue, sales, or other values vary across different categories. This module allows you to easily group bars by different categories and stack them when comparing multiple metrics. You can also add data labels to display absolute or percentage values for each bar.
Bar plots are frequently used to compare:
- Product sales across regions or quarters
- Revenue across product categories or customer segments
- Performance metrics side by side
This module provides flexibility in customizing legends, axes, and other visual elements, making it easy to represent data across different dimensions, either as grouped or single bar plots.
Example:
import pandas as pd
from pyretailscience.plots import bar
# Example DataFrame with sales data for different product categories
df = pd.DataFrame({
"product": ["A", "B", "C", "D"],
"sales_q1": [25000, 18000, 22000, 15000],
"sales_q2": [35000, 50000, 2000, 5000]
})
# Plot grouped bar chart to show sales across different products and quarters
bar.plot(
df=df,
value_col=["sales_q1", "sales_q2"],
x_col="product",
title="Sales by Product (Q1 vs Q2)",
x_label="Product",
y_label="Sales (£)",
data_label_format="percentage_by_bar_group",
source_text="Source: PyRetailScience - 2024",
move_legend_outside=True,
num_digits=3
)
Waterfall Plot
Waterfall plots are particularly good for showing how different things add or subtract from a starting number. For instance,
- Changes in sales figures from one period to another
- Breakdown of profit margins
- Impact of different product categories on overall revenue
They are often used to identify key drivers of financial performance, highlight areas for improvement, and communicate complex data stories to stakeholders in an intuitive manner.
Example:
from pyretailscience.plots import waterfall
labels = ["New", "Continuning", "Churned"]
amounts = [660000, 420000, -382000]
waterfall.plot(
labels=labels,
amounts=amounts,
title="New customer growth hiding churn issue",
source_text="Source: PyRetailScience - Sales FY2024 vs FY2023",
display_net_bar=True,
rot=0,
)
Index Plots
Index plots are visual tools used in retail analytics to compare different categories or segments against a baseline or average value, typically set at 100. Index plots allow analysts to:
Quickly identify which categories over- or underperform relative to the average Compare performance across diverse categories on a standardized scale Highlight areas of opportunity or concern in retail operations Easily communicate relative performance to stakeholders without revealing sensitive absolute numbers
In retail contexts, index plots are valuable for:
Comparing sales performance across product categories Analyzing customer segment behavior against the overall average Evaluating store or regional performance relative to company-wide metrics Identifying high-potential areas for growth or investment
By normalizing data to an index, these plots facilitate meaningful comparisons and help focus attention on significant deviations from expected performance, supporting more informed decision-making in retail strategy and operations.
Example:
from pyretailscience.plots import index
import pandas as pd
import numpy as np
np.random.seed(42)
categories = ["Music", "Electronics", "Books", "Clothing", "Food", "Home", "Sports", "Beauty"]
segments = ["Light", "Medium", "Heavy"]
data = []
for segment in segments:
for category in categories:
base_price = np.random.uniform(10, 100)
for quarter in ["Q1", "Q2", "Q3", "Q4"]:
data.append({
"segment_name": segment,
"category_0_name": category,
"unit_price": base_price * (1 + np.random.uniform(-0.2, 0.3)),
"quarter": quarter
})
df = pd.DataFrame(data)
index.plot(
df,
value_col="unit_price",
group_col="category_0_name",
index_col= "segment_name",
value_to_index="Light",
agg_func="mean",
title="Music an opportunity category for Light?",
y_label="Categories",
x_label="Indexed Spend",
source_text="Source: Transaction data financial year 2023",
sort_by="value",
sort_order="descending",
legend_title="Quarter",
highlight_range=None
)
Cohort Plot
Cohort plots are essential for understanding customer retention and behavior over time. These visualizations help identify trends in customer engagement, repeat purchases, and churn rates by grouping customers based on their initial interaction or purchase period. They are particularly useful for:
- Analyzing customer retention patterns over time
- Understanding the effectiveness of marketing campaigns in retaining customers
- Identifying the impact of seasonality on repeat purchases
- Evaluating long-term customer engagement with products or services
Example:
import pandas as pd
import numpy as np
from pyretailscience.plots import cohort
cohort_start_dates = [
"2022-12", "2023-01", "2023-02", "2023-03", "2023-04",
"2023-05", "2023-06", "2023-07", "2023-08", "2023-09",
"2023-10", "2023-11", "2023-12"
]
def generate_retention():
values = [1.0]
for _ in range(11):
values.append(max(values[-1] - np.random.uniform(0.05, 0.12), np.random.uniform(0.10, 0.25)))
return values
cohort_data = {"min_period_shopped": cohort_start_dates}
for i in range(12):
cohort_data[i] = [generate_retention()[i] for _ in cohort_start_dates]
df = pd.DataFrame(cohort_data)
df = df.set_index("min_period_shopped").reset_index()
df = df.melt(id_vars=["min_period_shopped"], var_name="period_since", value_name="retention")
df_pivot = df.pivot(index="min_period_shopped", columns="period_since", values="retention")
cohort.plot(
df=df_pivot,
x_label="Months Since Initial Purchase",
y_label="Cohort Start Date",
title="Customer Retention Cohort Analysis",
source_text="Source: PyRetailScience - 2024",
cbar_label="Number of Retained Customers",
percentage=True,
figsize=(8,8),
)
Timeline Plot
Timeline plots are a fundamental tool for interpreting transactional data within a temporal context. By presenting data in a chronological sequence, these visualizations reveal patterns and trends that might otherwise remain hidden in raw numbers, making them essential for both historical analysis and forward-looking insights. They are particularly useful for:
- Tracking sales performance across different periods (e.g., daily, weekly, monthly)
- Identifying seasonal patterns or promotional impacts on sales
- Comparing the performance of different product categories or store locations over time
- Visualizing customer behavior trends, such as purchase frequency or average transaction value
Example:
import numpy as np
import pandas as pd
from pyretailscience.plots import time
# Create a sample DataFrame with 3 groups
rng = np.random.default_rng(42)
df = pd.DataFrame(
{
"transaction_date": pd.concat(
[pd.Series(pd.date_range(start="2022-01-01", periods=200, freq="D"))] * 3
),
"total_price": np.concatenate(
[rng.integers(1, 1000, size=200) * multiplier for multiplier in range(1, 4)]
),
"group": ["Group A"] * 200 + ["Group B"] * 200 + ["Group C"] * 200,
},
)
time.plot(
df,
period="M",
group_col="group",
value_col="total_price",
agg_func="sum",
title="Monthly Sales by Customer Group",
y_label="Sales",
legend_title="Customer Group",
source_text="Source: PyRetailScience - Sales FY2024",
move_legend_outside=True,
)
Analysis Modules
Cohort Analysis
The cohort analysis module provides functionality for analyzing customer retention patterns over time. It helps businesses understand customer behavior by tracking groups of users (cohorts) based on their first interaction and observing their activity over subsequent periods.
Cohort analysis is useful in multiple business applications:
- Customer Retention Analysis: Identifies how long users stay engaged with a product or service.
- Churn Rate Measurement: Helps determine at which stage customers tend to drop off.
- Marketing Performance Evaluation: Measures the long-term impact of marketing campaigns.
- Revenue Analysis: Tracks spending behavior over time to optimize pricing strategies.
- User Engagement Trends: Understands how different user segments behave based on their joining time.
This module calculates cohort tables using various aggregation functions such as nunique
, sum
, and mean
, allowing
flexible analysis of customer data.
The following key metrics are used in the analysis:
- Aggregation Column: Defines the metric to track (e.g., unique customers, total spend).
- Aggregation Function: Determines how values are aggregated (e.g., sum, mean, count).
- Cohort Period: Defines the period granularity (year, quarter, month, week, or day).
- Retention Percentage: Calculates retention rates as a percentage of the first-period cohort.
Example:
import pandas as pd
import datetime
from pyretailscience.analysis.cohort import CohortAnalysis
data = {
"transaction_id": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
"customer_id": [1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5, 4],
"transaction_date": [
datetime.date(2023, 1, 15),
datetime.date(2023, 1, 20),
datetime.date(2023, 2, 5),
datetime.date(2023, 2, 10),
datetime.date(2023, 3, 1),
datetime.date(2023, 3, 15),
datetime.date(2023, 3, 20),
datetime.date(2023, 4, 10),
datetime.date(2023, 4, 25),
datetime.date(2023, 5, 5),
datetime.date(2023, 5, 20),
datetime.date(2023, 6, 10),
],
"unit_spend": [100, 150, 200, 120, 160, 210, 130, 170, 220, 140, 180, 230]
}
df = pd.DataFrame(data)
cohort = CohortAnalysis(
df=df,
aggregation_column="unit_spend",
agg_func="sum",
period="month",
percentage=True,
)
cohort.table.head()
min_period_shopped | 0 | 1 | 2 | 3 |
---|---|---|---|---|
2023-01-01 | 1.00 | 1.00 | 1.00 | 1.00 |
2023-02-01 | 0.80 | 1.75 | 0.76 | 0.00 |
2023-03-01 | 0.00 | 0.00 | 0.00 | 0.00 |
2023-04-01 | 0.00 | 0.00 | 0.00 | 0.00 |
2023-05-01 | 1.28 | 1.92 | 0.00 | 0.00 |
Product Association Rules
The product association module implements functionality for generating product association rules, a powerful technique in retail analytics and market basket analysis.
Product association rules are used to uncover relationships between different products that customers tend to purchase together. These rules provide valuable insights into consumer behavior and purchasing patterns, which can be leveraged by retail businesses in various ways:
-
Cross-selling and upselling: By identifying products frequently bought together, retailers can make targeted product recommendations to increase sales and average order value.
-
Store layout optimization: Understanding product associations helps in strategic product placement within stores, potentially increasing impulse purchases and overall sales.
-
Inventory management: Knowing which products are often bought together aids in maintaining appropriate stock levels and predicting demand.
-
Marketing and promotions: Association rules can guide the creation ofeffective bundle offers and promotional campaigns.
-
Customer segmentation: Patterns in product associations can reveal distinct customer segments with specific preferences.
-
New product development: Insights from association rules can inform decisions about new product lines or features.
The module uses metrics such as support, confidence, and uplift to quantifythe strength and significance of product associations:
- Support: The frequency of items appearing together in transactions.
- Confidence: The likelihood of buying one product given the purchase of another.
- Uplift: The increase in purchase probability of one product when another is bought.
Example:
from pyretailscience.analysis.product_association import ProductAssociation
pa = ProductAssociation(
df,
value_col="product_name",
group_col="transaction_id",
)
pa.df.head()
product_name_1 | product_name_2 | occurrences_1 | occurrences_2 | cooccurrences | support | confidence | uplift |
---|---|---|---|---|---|---|---|
100 Animals Book | 100% Organic Cold-Pressed... | 78 | 78 | 1 | 0.000039 | 0.0128205 | 4.18 |
100 Animals Book | 20K Sousaphone | 78 | 81 | 3 | 0.000117 | 0.0384615 | 12.10 |
100 Animals Book | 360 Sport 2.0 Boxer Briefs | 78 | 79 | 1 | 0.000039 | 0.0128205 | 4.13 |
100 Animals Book | 4-Series 4K UHD | 78 | 82 | 1 | 0.000039 | 0.0128205 | 3.98 |
100 Animals Book | 700S Eterna Trumpet | 78 | 71 | 1 | 0.000039 | 0.0128205 | 4.60 |
Cross Shop
Cross Shop analysis visualizes the overlap between different customer groups or product categories, helping retailers understand cross-purchasing behaviors. This powerful visualization technique employs Venn or Euler diagrams to show how customers interact across different product categories or segments.
Key applications include:
- Identifying opportunities for cross-selling and bundling
- Evaluating product category relationships
- Analyzing promotion cannibalization
- Understanding customer shopping patterns across departments
- Planning targeted marketing campaigns based on complementary purchasing behavior
The module provides options to visualize both the proportional size of each group and the percentage of overlap, making it easy to identify significant patterns in customer shopping behavior.
Example:
import pandas as pd
from pyretailscience.analysis import cross_shop
data = {
"customer_id": [1, 2, 3, 4, 5, 5, 6, 9, 7, 7, 8, 9, 5, 8],
"category_name" = [
"Electronics", "Clothing", "Home", "Sports", "Clothing", "Electronics", "Electronics"
"Clothing", "Home", "Electronics", "Clothing", "Electronics", "Home", "Home"
]
"unit_spend": [100, 200, 300, 400, 200, 500, 100, 200, 300, 350, 400, 500, 250, 360]
}
df = pd.DataFrame(data)
cs_customers = cross_shop.CrossShop(
df,
group_1_col="category_name",
group_1_val="Electronics",
group_2_col="category_name",
group_2_val="Clothing",
group_3_col="category_name",
group_3_val="Home",
labels=["Electronics", "Clothing", "Home"],
)
cs_customers.plot(
title="Customer Spend Overlap Across Categories",
source_text="Source: PyRetailScience",
)
Gain Loss
The Gain Loss module (also known as switching analysis) helps analyze changes in customer behavior between two time periods. It breaks down revenue or customer movement between a focus group and a comparison group by:
- New customers: Customers who didn't purchase in period 1 but did in period 2
- Lost customers: Customers who purchased in period 1 but not in period 2
- Increased/decreased spending: Existing customers who changed their spending level
- Switching: Customers who moved between the focus and comparison groups
This module is particularly valuable for:
- Analyzing promotion cannibalization
- Understanding customer migration between brands or categories
- Evaluating the effectiveness of marketing campaigns
- Quantifying the sources of revenue changes
Example:
import pandas as pd
import numpy as np
from pyretailscience.analysis.gain_loss import GainLoss
np.random.seed(42)
n_customers = 30
df = pd.DataFrame({
"customer_id": [f"C{i:03d}" for i in range(n_customers)] * 2,
"unit_spend": np.random.randint(10, 100, size=n_customers * 2),
"brand": np.random.choice(["Brand A", "Brand B"], size=n_customers * 2),
"period": ["p1"] * n_customers + ["p2"] * n_customers,
})
gain_loss = GainLoss(
df=df,
p1_index= df["period"] == "p1",
p2_index= df["period"] == "p2",
focus_group_index=df["brand"] == "Brand A",
focus_group_name="Brand A",
comparison_group_index=df["brand"] == "Brand B",
comparison_group_name="Brand B",
)
gain_loss.plot(
title="Brand A vs Brand B: Customer Movement Analysis",
x_label="Revenue Change",
source_text="Source: PyRetailScience",
move_legend_outside=True,
)
Customer Decision Hierarchy
A Customer Decision Hierarchy (CDH), also known as a Customer Decision Tree, is a powerful tool in retail analytics that visually represents the sequential steps and criteria customers use when making purchase decisions within a specific product category. Here's a brief summary of its purpose and utility:
CDHs allow analysts to:
- Map out the hierarchical structure of customer decision-making processes
- Identify key product attributes that drive purchase decisions
- Understand product substitutions and alternatives customers consider
- Prioritize product attributes based on their importance to customers
In retail contexts, CDHs are valuable for:
- Optimizing product assortments and shelf layouts
- Developing targeted marketing strategies
- Identifying opportunities for new product development
- Understanding competitive dynamics within a category
By visualizing the decision-making process, CDHs help retailers align their offerings and strategies with customer preferences, potentially increasing sales and customer satisfaction. They provide insights into how customers navigate choices, enabling more effective category management and merchandising decisions.
Example:
from pyretailscience.analysis.customer_decision_hierarchy import CustomerDecisionHierarchy
cdh = CustomerDecisionHierarchy(df)
ax = cdh.plot(
orientation="right",
source_text="Source: Transactions 2024",
title="Snack Food Substitutions",
)
Revenue Tree
The Revenue Tree is a hierarchical breakdown of factors contributing to overall revenue, allowing for detailed analysis of sales performance and identification of areas for improvement.
Key Components of the Revenue Tree:
-
Revenue: The top-level metric, calculated as Customers * Revenue per Customer.
-
Revenue per Customer: Average revenue generated per customer, calculated as: Orders per Customer * Average Order Value.
-
Orders per Customer: Average number of orders placed by each customer.
-
Average Order Value: Average monetary value of each order, calculated as: Items per Order * Price per Item.
-
Items per Order: Average number of items in each order.
-
Price per Item: Average price of each item sold.
Example:
import pandas as pd
import numpy as np
from pyretailscience.analysis import revenue_tree
np.random.seed(42)
# Generate 100 records
num_records = 100
df = pd.DataFrame({
"group_id": np.random.choice([1, 2], size=num_records),
"customer_id": np.random.randint(1, 31, size=num_records),
"transaction_id": np.arange(1, num_records + 1),
"unit_spend": np.random.uniform(50, 500, size=num_records).round(2),
"unit_quantity": np.random.randint(1, 6, size=num_records),
"transaction_date": pd.to_datetime(
np.random.choice(pd.date_range("2023-01-01", "2023-01-10"), size=num_records)
)
})
df["period"] = df["transaction_date"].apply(lambda x: "P1" if x < pd.Timestamp("2023-01-04") else "P2")
rev_tree = revenue_tree.RevenueTree(
df,
period_col="period",
p1_value = "P1",
p2_value = "P2",
)
HML Segmentation
Heavy, Medium, Light (HML) is a segmentation that places customers into groups based on their percentile of spend or the number of products they bought. Heavy customers are the top 20% of customers, medium are the next 30%, and light are the bottom 50% of customers. These values are chosen based on the proportions of the Pareto distribution. Often, purchase behavior follows this distribution, typified by the expression "20% of your customers generate 80% of your sales." HML segmentation helps answer questions such as:
- How much more are your best customers worth?
- How much more could you spend acquiring your best customers?
- What is the concentration of sales with your top (heavy) customers?
The module also handles customers with zero spend, with options to include them with light customers, exclude them entirely, or place them in a separate "Zero" segment.
Example:
from pyretailscience.plots import bar
from pyretailscience.segmentation.hml import HMLSegmentation
seg = HMLSegmentation(df, zero_value_customers="include_with_light")
bar.plot(
seg.df.groupby("segment_name")["unit_spend"].sum(),
value_col="unit_spend",
source_text="Source: PyRetailScience",
sort_order="descending",
x_label="",
y_label="Segment Spend",
title="What's the value of a Heavy customer?",
rot=0,
)
Threshold Segmentation
Threshold Segmentation offers a flexible approach to customer grouping based on custom-defined percentile thresholds. Unlike the fixed 20/30/50 split in HML segmentation, Threshold Segmentation allows you to specify your own thresholds and segment names, making it adaptable to various business needs.
This flexibility enables you to:
- Create quartile segmentations (e.g., top 25%, next 25%, etc.)
- Define custom tiers based on your specific business model
- Segment customers based on alternative metrics beyond spend, such as visit frequency or product variety
Like HML segmentation, the module provides options for handling customers with zero values, allowing you to include them with the lowest segment, exclude them entirely, or place them in a separate segment.
Example:
from pyretailscience.plots import bar
from pyretailscience.segmentation.threshold import ThresholdSegmentation
# Create custom segmentation with quartiles
# Define thresholds at 25%, 50%, 75%, and 100% (quartiles)
thresholds = [0.25, 0.50, 0.75, 1.0]
segments = ["Bronze", "Silver", "Gold", "Platinum"]
seg = ThresholdSegmentation(
df=df,
thresholds=thresholds,
segments=segments,
zero_value_customers="separate_segment",
)
bar.plot(
seg.df.groupby("segment_name")["unit_spend"].sum(),
value_col="unit_spend",
source_text="Source: PyRetailScience",
sort_order="descending",
x_label="",
y_label="Segment Spend",
title="Customer Value by Segment",
rot=0,
)
Segmentation Stats
The Segmentation Stats module provides functionality to calculate transaction statistics by segment for a particular segmentation. It makes it easy to compare key metrics across different segments, helping you understand how your customer (or transactions or promotions) groups differ in terms of spending behavior and transaction patterns. This module calculates metrics such as total spend, number of transactions, average spend per customer, and transactions per customer for each segment. It's particularly useful when combined with other segmentation approaches like HML segmentation.
Example:
from pyretailscience.segmentation.segstats import SegTransactionStats
from pyretailscience.segmentation.hml import HMLSegmentation
seg = HMLSegmentation(df, zero_value_customers="include_with_light")
# First, segment customers using HML segmentation
segmentation = HMLSegmentation(df)
# Add segment labels to the transaction data
df_with_segments = segmentation.add_segment(df)
# Calculate transaction statistics by segment
segment_stats = SegTransactionStats(df_with_segments)
# Display the statistics
segment_stats.df
segment_name | spend | transactions | customers | spend_per_customer | spend_per_transaction | transactions_per_customer | customers_pct |
---|---|---|---|---|---|---|---|
Heavy | 2927.21 | 30 | 10 | 292.721 | 97.5735 | 3 | 0.2 |
Medium | 1014.97 | 45 | 15 | 67.6644 | 22.5548 | 3 | 0.3 |
Light | 662.107 | 75 | 25 | 26.4843 | 8.82809 | 3 | 0.5 |
Total | 4604.28 | 150 | 50 | 92.0856 | 30.6952 | 3 | 1 |
RFM Segmentation
Recency, Frequency, Monetary (RFM) segmentation categorizes customers based on their purchasing behavior:
- Recency (R): How recently a customer made a purchase
- Frequency (F): How often a customer makes purchases
- Monetary (M): How much a customer spends
Each metric is typically scored on a scale, and the combined RFM score helps businesses identify loyal customers, at-risk customers, and high-value buyers.
RFM segmentation helps answer questions such as:
- Who are your most valuable customers?
- Which customers are at risk of churn?
- Which customers should be targeted for re-engagement?
Example:
import pandas as pd
from pyretailscience.segmentation.rfm import RFMSegmentation
data = pd.DataFrame({
"customer_id": [1, 1, 2, 2, 3, 3, 3],
"transaction_id": [101, 102, 201, 202, 301, 302, 303],
"transaction_date": ["2024-03-01", "2024-03-10", "2024-02-20", "2024-02-25", "2024-01-15", "2024-01-20", "2024-02-05"],
"unit_spend": [50, 75, 100, 150, 200, 250, 300]
})
data["transaction_date"] = pd.to_datetime(data["transaction_date"])
current_date = "2024-07-01"
rfm_segmenter = RFMSegmentation(df=data, current_date=current_date)
rfm_results = rfm_segmenter.df
customer_id | recency_days | frequency | monetary | r_score | f_score | m_score | rfm_segment | fm_segment |
---|---|---|---|---|---|---|---|---|
1 | 113 | 2 | 125 | 0 | 0 | 0 | 0 | 0 |
2 | 127 | 2 | 250 | 1 | 1 | 1 | 111 | 11 |
3 | 147 | 3 | 750 | 2 | 2 | 2 | 222 | 22 |
Purchases Per Customer
The Purchases Per Customer module analyzes and visualizes the distribution of transaction frequency across your customer base. This module helps you understand customer purchasing patterns by percentile and is useful for determining values like your churn window.
Example:
from pyretailscience.analysis.customer import PurchasesPerCustomer
ppc = PurchasesPerCustomer(transactions)
ppc.plot(
title="Purchases per Customer",
percentile_line=0.8,
source_text="Source: PyRetailScience",
)
Days Between Purchases
The Days Between Purchases module analyzes the time intervals between customer transactions, providing valuable insights into purchasing frequency and shopping patterns. This analysis helps you understand:
- How frequently your customers typically return to make purchases
- The distribution of purchase intervals across your customer base
- Which customer segments have shorter or longer repurchase cycles
- Where intervention might be needed to prevent customer churn
This information is critical for planning communication frequency, timing promotional campaigns, and developing effective retention strategies. The module can visualize both standard and cumulative distributions of days between purchases.
Example:
dbp = DaysBetweenPurchases(transactions)
dbp.plot(
bins=15,
title="Average Days Between Customer Purchases",
percentile_line=0.5, # Mark the median with a line
)
Transaction Churn
The Transaction Churn module analyzes how customer churn rates vary based on the number of purchases customers have made. This helps reveal critical retention thresholds in the customer lifecycle when setting a churn window
Example:
from pyretailscience.analysis.customer import TransactionChurn
tc = TransactionChurn(transactions, churn_period=churn_period)
tc.plot(
title="Churn Rate by Number of Purchases",
cumulative=True,
source_text="Source: PyRetailScience",
)
Composite Rank
The Composite Rank module creates a composite ranking of several columns by giving each column an individual rank and then combining those ranks together. Composite rankings are particularly useful for:
- Product range reviews when multiple factors need to be considered together
- Prioritizing actions based on multiple performance metrics
- Creating balanced scorecards that consider multiple dimensions
- Identifying outliers across multiple metrics
This module allows you to specify different sort orders for each column (ascending or descending) and supports various aggregation functions to combine the ranks, such as mean, sum, min, or max.
Key features:
- Supports both ascending and descending sort orders
- Handles ties in rankings with configurable options
- Combines multiple individual ranks into a single composite rank
- Works with both pandas DataFrames and ibis Tables
Example:
import pandas as pd
from pyretailscience.analysis.composite_rank import CompositeRank
# Create sample data for products
df = pd.DataFrame({
"product_id": [1, 2, 3, 4, 5],
"spend": [100, 150, 75, 200, 125],
"customers": [20, 30, 15, 40, 25],
"spend_per_customer": [5.0, 5.0, 5.0, 5.0, 5.0],
})
# Create CompositeRank with multiple columns
cr = CompositeRank(
df=df,
rank_cols=[
("spend", "desc"), # Higher spend is better
("customers", "desc"), # Higher customer count is better
("spend_per_customer", "desc") # Higher spend per customer is better
],
agg_func="mean", # Use mean to aggregate ranks
ignore_ties=False # Keep ties (rows with same values get same rank)
)
cr.df.sort_values("composite_rank")
product_id | spend | customers | spend_per_customer | spend_rank | customers_rank | spend_per_customer_rank | composite_rank |
---|---|---|---|---|---|---|---|
4 | 200 | 40 | 5.0 | 1 | 1 | 1 | 1.0 |
2 | 150 | 30 | 5.0 | 2 | 2 | 1 | 1.67 |
5 | 125 | 25 | 5.0 | 3 | 3 | 1 | 2.33 |
1 | 100 | 20 | 5.0 | 4 | 4 | 1 | 3.0 |
3 | 75 | 15 | 5.0 | 5 | 5 | 1 | 3.67 |
Utils
Filter and Label by Periods
The Filter and Label by Periods module allows you to:
- Filter transaction data to specific time periods (e.g., quarters, months, promotional periods)
- Add period labels to your data for easy segmentation and comparison
- Analyze before-and-after performance for events or promotions
- Compare metrics across different time frames consistently
This functionality is particularly useful for:
- Comparing KPIs across fiscal quarters or years
- Analyzing seasonal performance patterns
- Measuring the impact of promotions or events
- Creating period-based visualizations with consistent data preparation
Example:
import pandas as pd
import ibis
from pyretailscience.utils.date import filter_and_label_by_periods
# Create a sample transactions table
data = pd.DataFrame({
"transaction_id": range(1, 101),
"transaction_date": pd.date_range(start="2023-01-01", periods=100, freq="D"),
"customer_id": [f"C{i % 20 + 1}" for i in range(100)],
"amount": [float(i % 5 * 25 + 50) for i in range(100)]
})
transactions = ibis.memtable(data)
# Define period ranges for analysis
period_ranges = {
"Pre-Promotion": ("2023-01-01", "2023-01-31"),
"Promotion": ("2023-02-01", "2023-02-28"),
"Post-Promotion": ("2023-03-01", "2023-03-31")
}
# Filter transactions to the defined periods and add period labels
result_df = filter_and_label_by_periods(transactions, period_ranges).execute()
# Calculate KPIs by period
result_df.groupby("period_name").agg(
transaction_count=("transaction_id", "count"),
total_sales=("amount", "sum"),
avg_transaction_value=("amount", "mean")
)
period_name | transaction_count | total_sales | avg_transaction_value |
---|---|---|---|
Pre-Promotion | 31 | 1937.5 | 62.50 |
Promotion | 28 | 1750.0 | 62.50 |
Post-Promotion | 31 | 1937.5 | 62.50 |
Find Overlapping Periods
The Find Overlapping Periods module allows you to:
- Identify overlapping periods between a given start and end date.
- Split the date range into yearly periods that start from the given start date for the first period and then yearly thereafter, ending on the provided end date.
- Return results either as ISO-formatted strings (
"YYYY-MM-DD"
) or asdatetime
objects.
This functionality is particularly useful for:
- Analyzing seasonal or yearly patterns in datasets.
- Comparing data across specific date ranges.
- Structuring time-based segmentations efficiently.
Example:
from datetime import datetime
from pyretailscience.utils.date import find_overlapping_periods
# Example with string input
overlapping_periods = find_overlapping_periods("2022-06-15", "2025-03-10")
print(overlapping_periods)
Start Date | End Date |
---|---|
2022-06-15 | 2023-03-10 |
2023-06-15 | 2024-03-10 |
2024-06-15 | 2025-03-10 |
Filter and Label by Condition
The Filter and Label by Condition module allows you to:
- Filter data based on arbitrary conditions (e.g., category, region, price range)
- Add descriptive labels to filtered rows for easier segmentation
- Prepare labeled subsets for downstream analysis or visualization
- Combine multiple Boolean conditions into a single, labeled dataset
This functionality is particularly useful for:
- Segmenting customers or products by custom-defined rules
- Categorizing transactions based on business logic
- Creating labeled training data for machine learning
- Analyzing metrics across different business segments
Example:
import pandas as pd
import ibis
from pyretailscience.utils.filter_and_label import filter_and_label_by_condition
# Sample product table
df = pd.DataFrame({
"product_id": range(1, 9),
"category": ["toys", "shoes", "toys", "books", "electronics", "toys", "shoes", "books"],
"price": [15, 55, 25, 10, 200, 35, 60, 20]
})
products = ibis.memtable(df)
# Define filter conditions
conditions = {
"Toys": products["category"] == "toys",
"Shoes": products["category"] == "shoes",
"Premium Electronics": (products["category"] == "electronics") & (products["price"] > 100)
}
# Apply filtering and labeling
labeled_data = filter_and_label_by_condition(products, conditions).execute()
product_id | category | price | label |
---|---|---|---|
1 | toys | 15 | Toys |
2 | shoes | 55 | Shoes |
3 | toys | 25 | Toys |
5 | electronics | 200 | Premium Electronics |
6 | toys | 35 | Toys |
7 | shoes | 60 | Shoes |