Skip to content

Product Associations

Market Basket Analysis and Product Association Rules for Retail Optimization.

Business Context

Product association analysis (market basket analysis) uncovers the hidden relationships in customer purchasing behavior. This intelligence transforms how retailers approach merchandising, marketing, and operations by revealing which products naturally sell together and why.

The Business Problem

Retailers lose revenue from missed cross-selling opportunities and poor product placement. Without understanding product associations, stores might: - Place complementary items in different aisles, reducing impulse purchases - Miss bundling opportunities that could increase average transaction value - Stock-out on associated items when promoting a product - Fail to identify new product opportunities based on basket gaps

Real-World Applications

  1. Strategic Merchandising
  2. Place chips near beer when data shows strong association
  3. Position phone cases near phones based on attachment rates
  4. Create end-cap displays with products that sell together
  5. Optimize shelf space allocation using association strength

  6. Dynamic Bundle Pricing

  7. Create "Breakfast bundle": Coffee + Pastry when uplift shows synergy
  8. Design seasonal bundles based on historical associations
  9. Price bundles to incentivize larger baskets while maintaining margins
  10. Test bundle combinations using confidence metrics

  11. Personalized Recommendations

  12. Power "Customers who bought X also bought Y" suggestions
  13. Enhance cart abandonment recovery with associated items
  14. Design email campaigns based on previous purchase associations
  15. Improve search results by showing associated products

  16. Inventory Optimization

  17. Stock pasta sauce when pasta is promoted (if association exists)
  18. Prepare battery inventory when toys are featured
  19. Coordinate supply chain for products that sell together
  20. Reduce stockouts by understanding product relationships

  21. New Product Placement

  22. Place new organic items near existing organic purchases
  23. Position private label next to associated national brands
  24. Test new products within high-association categories
  25. Leverage existing associations to drive trial

Key Metrics Explained

Support (Frequency)

The proportion of all transactions containing both products. Higher support indicates a more common pairing. Use this to identify: - Core product relationships for everyday decisions - Sufficient sample size for confident conclusions - Opportunities worth marketing investment

Confidence (Conditional Probability)

The probability of buying product B given product A was purchased. This answers "If a customer buys A, how likely are they to buy B?" Use this for: - Recommendation engine rules - Promotional targeting decisions - Cross-sell prioritization

Uplift/Lift (Synergy Indicator)

Measures how much more likely products are bought together than would be expected by chance. Uplift = Observed probability / Expected probability. Interpretation: - Uplift > 1: Products have positive association (sell better together) - Uplift = 1: Products are independent (no relationship) - Uplift < 1: Products have negative association (rarely bought together)

Higher uplift values indicate stronger synergies worth exploiting.

Actionable Decision Framework

Retailers should consider multiple metrics together:

High Support + High Confidence - Strong, frequent relationship - Priority for permanent merchandising changes - Core bundle candidates

Low Support + High Confidence - Niche but reliable relationship - Targeted marketing opportunities - Specialized customer segments

High Support + Low Confidence - Common but weak relationship - Test before major changes - Monitor for shifts over time

High Uplift (regardless of support) - Strong synergy exists - Test for merchandising opportunities - Consider for promotional strategies

Implementation Considerations

  • Validate associations across time periods before major changes
  • Consider seasonality in association patterns
  • Test recommendations with A/B experiments
  • Monitor associations as product mix evolves
  • Account for external factors (promotions, events) affecting associations

ProductAssociation

A class for generating and analyzing product association rules.

This class calculates association rules between products based on transaction data, helping to identify patterns in customer purchasing behavior.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing transaction data.

required
value_col str

The name of the column in the input DataFrame that contains the product identifiers.

required
group_col str

The name of the column that identifies unique transactions or customers. Defaults to option column.column_id.

None
target_item str | float | list[str | float] | None

A specific product or list of products to focus the association analysis on. If None, associations for all products are calculated. Defaults to None.

None

Attributes:

Name Type Description
df DataFrame

A DataFrame containing the calculated association rules and their metrics.

Example

import pandas as pd transaction_df = pd.DataFrame({ ... 'customer_id': [1, 1, 2, 2, 3], ... 'product_id': ['A', 'B', 'B', 'C', 'A'] ... }) pa = ProductAssociation(df=transaction_df, value_col='product_id', group_col='customer_id') print(pa.df) # View the calculated association rules

Note

The resulting DataFrame (pa.df) contains the following columns: - product_1, product_2: The pair of products for which the association is calculated. - occurrences_1, occurrences_2: The number of transactions containing each product. - cooccurrences: The number of transactions containing both products. - support: The proportion of transactions containing both products. - confidence: The probability of buying product_2 given that product_1 was bought. - uplift: The ratio of the observed support to the expected support if the products were independent.

Source code in pyretailscience/analysis/product_association.py
class ProductAssociation:
    """A class for generating and analyzing product association rules.

    This class calculates association rules between products based on transaction data,
    helping to identify patterns in customer purchasing behavior.

    Args:
        df (pandas.DataFrame): The input DataFrame containing transaction data.
        value_col (str): The name of the column in the input DataFrame that contains
            the product identifiers.
        group_col (str, optional): The name of the column that identifies unique
            transactions or customers. Defaults to option column.column_id.
        target_item (str | float | list[str | float] | None, optional): A specific product or list of products to focus the
            association analysis on. If None, associations for all products are
            calculated. Defaults to None.

    Attributes:
        df (pandas.DataFrame): A DataFrame containing the calculated association
            rules and their metrics.

    Example:
        >>> import pandas as pd
        >>> transaction_df = pd.DataFrame({
        ...     'customer_id': [1, 1, 2, 2, 3],
        ...     'product_id': ['A', 'B', 'B', 'C', 'A']
        ... })
        >>> pa = ProductAssociation(df=transaction_df, value_col='product_id', group_col='customer_id')
        >>> print(pa.df)  # View the calculated association rules

    Note:
        The resulting DataFrame (pa.df) contains the following columns:
        - product_1, product_2: The pair of products for which the association is calculated.
        - occurrences_1, occurrences_2: The number of transactions containing each product.
        - cooccurrences: The number of transactions containing both products.
        - support: The proportion of transactions containing both products.
        - confidence: The probability of buying product_2 given that product_1 was bought.
        - uplift: The ratio of the observed support to the expected support if the products were independent.
    """

    _df: pd.DataFrame | None = None

    def __init__(
        self,
        df: pd.DataFrame | ibis.Table,
        value_col: str,
        group_col: str | None = None,
        target_item: str | None = None,
        min_occurrences: int = 1,
        min_cooccurrences: int = 1,
        min_support: float = 0.0,
        min_confidence: float = 0.0,
        min_uplift: float = 0.0,
    ) -> None:
        """Initialize the ProductAssociation object.

        Args:
            df (pd.DataFrame | ibis.Table) : The input DataFrame or ibis Table containing transaction data.
            value_col (str): The name of the column in the input DataFrame that contains the product identifiers.
            group_col (str, optional): The name of the column that identifies unique transactions or customers. Defaults
                to option column.unit_spend.
            target_item (str | float | list[str | float] | None, optional): A specific product or list of products to focus the association analysis on. If None,
                associations for all products are calculated. Defaults to None.
            min_occurrences (int, optional): The minimum number of occurrences required for each product in the
                association analysis. Defaults to 1. Must be at least 1.
            min_cooccurrences (int, optional): The minimum number of co-occurrences required for the product pairs in
                the association analysis. Defaults to 1. Must be at least 1.
            min_support (float, optional): The minimum support value required for the association rules. Defaults to
                0.0. Must be between 0 and 1.
            min_confidence (float, optional): The minimum confidence value required for the association rules. Defaults
                to 0.0. Must be between 0 and 1.
            min_uplift (float, optional): The minimum uplift value required for the association rules. Defaults to 0.0.
                Must be greater or equal to 0.

        Raises:
            ValueError: If the number of combinations is not 2 or 3, or if any of the minimum values are invalid.
            ValueError: If the minimum support, confidence, or uplift values are outside the valid range.
            ValueError: If the minimum occurrences or cooccurrences are less than 1.
            ValueError: If the input DataFrame does not contain the required columns or if they have null values.
        """
        group_col = group_col or get_option("column.customer_id")
        required_cols = [group_col, value_col]
        missing_cols = set(required_cols) - set(df.columns)
        if len(missing_cols) > 0:
            msg = f"The following columns are required but missing: {missing_cols}"
            raise ValueError(msg)

        self.table = self._calc_association(
            df=df,
            value_col=value_col,
            group_col=group_col,
            target_item=target_item,
            min_occurrences=min_occurrences,
            min_cooccurrences=min_cooccurrences,
            min_support=min_support,
            min_confidence=min_confidence,
            min_uplift=min_uplift,
        )

    @staticmethod
    def _validate_minimum_values(
        min_occurrences: int,
        min_cooccurrences: int,
        min_support: float,
        min_confidence: float,
        min_uplift: float,
    ) -> None:
        """Validate minimum value parameters.

        Args:
            min_occurrences (int): The minimum number of occurrences required for each product.
            min_cooccurrences (int): The minimum number of co-occurrences required for product pairs.
            min_support (float): The minimum support value required for association rules.
            min_confidence (float): The minimum confidence value required for association rules.
            min_uplift (float): The minimum uplift value required for association rules.

        Raises:
            ValueError: If any parameter is outside the valid range.
        """
        if min_occurrences < 1:
            raise ValueError("Minimum occurrences must be at least 1.")
        if min_cooccurrences < 1:
            raise ValueError("Minimum cooccurrences must be at least 1.")
        if min_support < 0.0 or min_support > 1.0:
            raise ValueError("Minimum support must be between 0 and 1.")
        if min_confidence < 0.0 or min_confidence > 1.0:
            raise ValueError("Minimum confidence must be between 0 and 1.")
        if min_uplift < 0.0:
            raise ValueError("Minimum uplift must be greater or equal to 0.")

    @staticmethod
    def _calc_association(
        df: pd.DataFrame | ibis.Table,
        value_col: str,
        group_col: str = get_option("column.customer_id"),
        target_item: str | float | list[str | float] | None = None,
        min_occurrences: int = 1,
        min_cooccurrences: int = 1,
        min_support: float = 0.0,
        min_confidence: float = 0.0,
        min_uplift: float = 0.0,
    ) -> pd.DataFrame:
        """Calculate product association rules based on transaction data.

        This method calculates association rules between products based on transaction data,
        helping to identify patterns in customer purchasing behavior.

        Args:
            df (pd.DataFrame | ibis.Table) : The input DataFrame or ibis Table containing transaction data.
            value_col (str): The name of the column in the input DataFrame that contains the product identifiers.
            group_col (str, optional): The name of the column that identifies unique transactions or customers. Defaults
                to option column.unit_spend.
            target_item (str | float | list[str | float] | None, optional): A specific product or list of products to focus the association analysis on. If None,
                associations for all products are calculated. Defaults to None.
            min_occurrences (int, optional): The minimum number of occurrences required for each product in the
                association analysis. Defaults to 1. Must be at least 1.
            min_cooccurrences (int, optional): The minimum number of co-occurrences required for the product pairs in
                the association analysis. Defaults to 1. Must be at least 1.
            min_support (float, optional): The minimum support value required for the association rules. Defaults to
                0.0. Must be between 0 and 1.
            min_confidence (float, optional): The minimum confidence value required for the association rules. Defaults
                to 0.0. Must be between 0 and 1.
            min_uplift (float, optional): The minimum uplift value required for the association rules. Defaults to 0.0.
                Must be greater or equal to 0.

        Returns:
            pandas.DataFrame: A DataFrame containing the calculated association rules and their metrics.

        Raises:
            ValueError: If the number of combinations is not 2 or 3, or if any of the minimum values are invalid.
            ValueError: If the minimum support, confidence, or uplift values are outside the valid range.
            ValueError: If the minimum occurrences or cooccurrences are less than 1.

        Note:
            The resulting DataFrame contains the following columns:
            - product_1, product_2: The pair of products for which the association is calculated.
            - occurrences_1, occurrences_2: The number of transactions containing each product.
            - cooccurrences: The number of transactions containing both products.
            - support: The proportion of transactions containing both products.
            - confidence: The probability of buying product_2 given that product_1 was bought.
            - uplift: The ratio of the observed support to the expected support if the products were independent.
        """
        ProductAssociation._validate_minimum_values(
            min_occurrences=min_occurrences,
            min_cooccurrences=min_cooccurrences,
            min_support=min_support,
            min_confidence=min_confidence,
            min_uplift=min_uplift,
        )

        # Normalize target_item to a list for consistent processing
        if target_item is not None:
            if not isinstance(target_item, list):
                target_item = [target_item]

            # Validate that all items in target_item are of supported types
            for item in target_item:
                if not isinstance(item, str | float):
                    msg = f"target_item must contain only str or float values. Got {type(item)}"
                    raise TypeError(msg)

            # Ensure target_item is not empty
            if len(target_item) == 0:
                raise ValueError("target_item cannot be an empty list")

        if isinstance(df, pd.DataFrame):
            df = ibis.memtable(df)

        unique_transactions = df.select(_[group_col], _[value_col]).distinct()
        total_transactions = unique_transactions.alias("t")[group_col].nunique().name("total_count")

        product_occurrences = (
            unique_transactions.group_by(value_col)
            .aggregate(occurrences=_[group_col].nunique())
            .mutate(occurrence_probability=_.occurrences / total_transactions)
            .filter(_.occurrences >= min_occurrences)
        )

        left_table = unique_transactions.rename({"item_1": value_col})
        right_table = unique_transactions.rename({"item_2": value_col})

        join_logic = [left_table[group_col] == right_table[group_col]]
        if target_item is None:
            join_logic.append(left_table.item_1 < right_table.item_2)
        else:
            join_logic.extend(
                [
                    left_table.item_1 != right_table.item_2,
                    left_table.item_1.isin(target_item),
                ],
            )
        merged_df = left_table.join(right_table, predicates=join_logic)

        cooccurrences = (
            merged_df.group_by(["item_1", "item_2"])
            .aggregate(cooccurrences=merged_df[group_col].nunique())
            .mutate(support=_.cooccurrences / total_transactions)
            .filter((_.cooccurrences >= min_cooccurrences) & (_.support >= min_support))
        )

        product_occurrences_1_rename = product_occurrences.rename(
            {"item_1": value_col, "occurrences_1": "occurrences", "prob_1": "occurrence_probability"},
        )
        product_occurrences_2_rename = product_occurrences.rename(
            {"item_2": value_col, "occurrences_2": "occurrences", "prob_2": "occurrence_probability"},
        )

        result = (
            cooccurrences.join(product_occurrences_1_rename, "item_1")
            .join(product_occurrences_2_rename, "item_2")
            .mutate(
                confidence=_.cooccurrences / _.occurrences_1,
                uplift=_.support / (_.prob_1 * _.prob_2),
            )
            .filter(_.uplift >= min_uplift)
        )

        if target_item is None:
            col_order = [
                "item_1",
                "item_2",
                "occurrences_1",
                "occurrences_2",
                "cooccurrences",
                "support",
                "confidence",
                "uplift",
            ]
            inverse_pairs = result.mutate(
                item_1=result.item_2,
                item_2=result.item_1,
                occurrences_1=result.occurrences_2,
                occurrences_2=result.occurrences_1,
                prob_1=result.prob_2,
                prob_2=result.prob_1,
                confidence=result.cooccurrences / result.occurrences_2,
            )
            result = result[col_order].union(inverse_pairs[col_order])

        result = (
            result.filter(result.confidence >= min_confidence)
            .order_by(["item_1", "item_2"])
            .rename({f"{value_col}_1": "item_1", f"{value_col}_2": "item_2"})
        )
        return result[
            [
                f"{value_col}_1",
                f"{value_col}_2",
                "occurrences_1",
                "occurrences_2",
                "cooccurrences",
                "support",
                "confidence",
                "uplift",
            ]
        ]

    @property
    def df(self) -> pd.DataFrame:
        """Returns the executed DataFrame."""
        if self._df is None:
            self._df = self.table.execute().reset_index(drop=True)
        return self._df

df: pd.DataFrame property

Returns the executed DataFrame.

__init__(df, value_col, group_col=None, target_item=None, min_occurrences=1, min_cooccurrences=1, min_support=0.0, min_confidence=0.0, min_uplift=0.0)

Initialize the ProductAssociation object.

Parameters:

Name Type Description Default
df pd.DataFrame | ibis.Table)

The input DataFrame or ibis Table containing transaction data.

required
value_col str

The name of the column in the input DataFrame that contains the product identifiers.

required
group_col str

The name of the column that identifies unique transactions or customers. Defaults to option column.unit_spend.

None
target_item str | float | list[str | float] | None

A specific product or list of products to focus the association analysis on. If None, associations for all products are calculated. Defaults to None.

None
min_occurrences int

The minimum number of occurrences required for each product in the association analysis. Defaults to 1. Must be at least 1.

1
min_cooccurrences int

The minimum number of co-occurrences required for the product pairs in the association analysis. Defaults to 1. Must be at least 1.

1
min_support float

The minimum support value required for the association rules. Defaults to 0.0. Must be between 0 and 1.

0.0
min_confidence float

The minimum confidence value required for the association rules. Defaults to 0.0. Must be between 0 and 1.

0.0
min_uplift float

The minimum uplift value required for the association rules. Defaults to 0.0. Must be greater or equal to 0.

0.0

Raises:

Type Description
ValueError

If the number of combinations is not 2 or 3, or if any of the minimum values are invalid.

ValueError

If the minimum support, confidence, or uplift values are outside the valid range.

ValueError

If the minimum occurrences or cooccurrences are less than 1.

ValueError

If the input DataFrame does not contain the required columns or if they have null values.

Source code in pyretailscience/analysis/product_association.py
def __init__(
    self,
    df: pd.DataFrame | ibis.Table,
    value_col: str,
    group_col: str | None = None,
    target_item: str | None = None,
    min_occurrences: int = 1,
    min_cooccurrences: int = 1,
    min_support: float = 0.0,
    min_confidence: float = 0.0,
    min_uplift: float = 0.0,
) -> None:
    """Initialize the ProductAssociation object.

    Args:
        df (pd.DataFrame | ibis.Table) : The input DataFrame or ibis Table containing transaction data.
        value_col (str): The name of the column in the input DataFrame that contains the product identifiers.
        group_col (str, optional): The name of the column that identifies unique transactions or customers. Defaults
            to option column.unit_spend.
        target_item (str | float | list[str | float] | None, optional): A specific product or list of products to focus the association analysis on. If None,
            associations for all products are calculated. Defaults to None.
        min_occurrences (int, optional): The minimum number of occurrences required for each product in the
            association analysis. Defaults to 1. Must be at least 1.
        min_cooccurrences (int, optional): The minimum number of co-occurrences required for the product pairs in
            the association analysis. Defaults to 1. Must be at least 1.
        min_support (float, optional): The minimum support value required for the association rules. Defaults to
            0.0. Must be between 0 and 1.
        min_confidence (float, optional): The minimum confidence value required for the association rules. Defaults
            to 0.0. Must be between 0 and 1.
        min_uplift (float, optional): The minimum uplift value required for the association rules. Defaults to 0.0.
            Must be greater or equal to 0.

    Raises:
        ValueError: If the number of combinations is not 2 or 3, or if any of the minimum values are invalid.
        ValueError: If the minimum support, confidence, or uplift values are outside the valid range.
        ValueError: If the minimum occurrences or cooccurrences are less than 1.
        ValueError: If the input DataFrame does not contain the required columns or if they have null values.
    """
    group_col = group_col or get_option("column.customer_id")
    required_cols = [group_col, value_col]
    missing_cols = set(required_cols) - set(df.columns)
    if len(missing_cols) > 0:
        msg = f"The following columns are required but missing: {missing_cols}"
        raise ValueError(msg)

    self.table = self._calc_association(
        df=df,
        value_col=value_col,
        group_col=group_col,
        target_item=target_item,
        min_occurrences=min_occurrences,
        min_cooccurrences=min_cooccurrences,
        min_support=min_support,
        min_confidence=min_confidence,
        min_uplift=min_uplift,
    )