Association rules - shopping cart analysis¶
One of the types of descriptive data analysis are association rules. They represent a specific type of descriptive model that is applicable to transactional data. These are data representing e.g., shopping or other transactions. The rows in them correspond to individual transactions (e.g. individual purchases) and the columns then represent individual goods. The values in the individual columns then indicate whether or not any of the goods occurred in the given purchase.
We then use association rules to uncover frequent combinations of items within a set of transactions. This principle will therefore help us to identify e.g. items that are often purchased together (hence the derived name - shopping cart analysis).
To search for association rules in Python, we first need to install modules containing the necessary algorithms for their search. Since the Scikit-learn library does not contain such, we will install the mlxtend
module, which contains the Apriori algorithm, often used for finding association rules.
To install in the Anaconda distribution, it is necessary to switch to the home application (Anaconda Navigator). In the Environments
tab, then click on the triangular symbol next to base (root)
. In it, select Open Terminal
and type pip install mlxtend
in the command line. This way we will install the module and be able to use it in scripts and Jupyter notebooks.
import pandas as pd
from mlxtend.frequent_patterns import apriori # import library for apriori algorithm
from mlxtend.frequent_patterns import association_rules # import library for rules
As a sample dataset, we will use the Online Retail dataset, which contains records of transactions of purchases of various goods through an online store in various countries. For demonstration purposes, the dataset used in this notebook is reduced.
We load the data from the file and after writing the header we see the structure of the dataset:
- invoice number
- product identification number
- product name
- quantity
- Purchase Date
- price per unit of goods
- Customer ID
- country
data = pd.read_excel('../data/retail.xlsx')
data.head()
As we can see, we do not have the data in the required - transactional - form. For these purposes, we need to preprocess the data and change its structure. We adjust the spaces for the Description
attribute and throw out lines that do not have a valid invoice.
data['Description'] = data['Description'].str.strip() # trim unwanted spaces at the beginning and end of descriptions
data.dropna(axis=0, subset=['InvoiceNo'], inplace=True) # remove the rows that have a missing invoice
data.head()
data.tail(30)
data['InvoiceNo'] = data['InvoiceNo'].astype('str') # encode the invoice number as a string
data = data[~data['InvoiceNo'].str.contains('C')]
We will then use the invoice as a transaction identifier - we will transform the input data into the basket
data frame, which will contain rows representing purchases (identified by the invoice number) and individual attributes will represent the number of purchased goods. We do this using grouping (groupBy
) by invoice number, item and quantity. For data selection, we can also choose an attribute characterizing the country, and we can thus identify frequent combinations of purchases in different markets.
The resulting table can be viewed by typing the header.
basket = (data[data['Country'] == "France"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
basket.head(20)
For easier work with association rules, we binarize the data - the individual attribute values will only indicate whether the given item was purchased or not (we will not take the number into account). Therefore, we use a simple function to transform the data. At the same time, we also throw away the POSTAGE
attribute describing the postage.
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)
basket_sets.head()
To search for association rules, we first identify the so-called frequent items. In this case, it will be a list of frequently purchased items (separately, not together), which will be accompanied by information about their support (support
) - i.e. the share of transactions in which the given item occurs among all transactions. When creating a list of these items, we can use the support as a parameter to trim the number of identified patterns.
frequent_itemsets = apriori(basket_sets, min_support=0.13, use_colnames=True)
frequent_itemsets.head(20)
We will then generate association rules from the frequent items in the form of IF
assumptions (antecenteds
) THEN
conclusions (consequents
). For individual rules, the library also lists the support of the given rule (support of the assumption, conclusion and even the entire rule), reliability (confidence') or
lift' of the rule.
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=5)
rules
With a large number of rules, it is of course possible to browse and trim the generated rules using condition shaping and rules display criteria. For example the example below demonstrates displaying only those rules whose support is greater than 0.2 and confidence greater than 0.9.
rules[ (rules['support'] >= 0.13) & (rules['confidence'] >= 0.8) ]