Summary analysis of customer data
Association rule mining tries to find correlations between different items or itemset/sets
of items in a database.
- A rule-based machine learning method for discovering interesting relations between
variables in large databases.
- Initially used for market basket analysis to find how items purchased by customers
are related.
- Assumes all data are categorical. There is no good algorithm for numerical data.
Terminology
Set of items I = {i1, i2 …, im}
Supermarket example: a list of all articles sold in a supermarket.
Transaction t (smaller set of items)
t I, t is a subset of the entire set of items I.
the articles that someone buys in a supermarket.
Transaction database T = {t1, t2, …, tn}
Collection of transactions.
Example transaction database:
Another example: you could analyze a set of documents to see what kind of words, concepts,
or tags co-occur often. At that point, each document could be correlated to a list of
keywords.
- The keywords are items.
- Each document is a transaction.
- All the documents put together make up a transaction dataset.
- All the words that appear in the entire transaction database is I, the set of all possible
items.
Association rules
Association rules are "if-then" statements, that help to show the probability of relationships
between data items, within large data sets in various types of databases.
An association rule is an implication of the form:
- X Y, where X, Y I, and X Y =
A single item is an itemset of size 1.
A k-itemset is an itemset with k items.
, - E.g., {milk, bread, cereal} is a 3-itemset.
How do you determine which patterns are good/interesting or not interesting?
Support measure tells us how often a particular item occurs in our database or
how often a particular association rule holds in our dataset.
Sup = Pr(X Y)
It measures the probability of encountering an association rule or an itemset in
the dataset.
Confidence measure looks at the conditional probability. X implies Y but that
doesn’t mean that Y also implies X. it’s a one-directional association rule.
Conf = Pr(Y | X)
The confidence tells us what proportion of transactions that contain X also
contain Y.
When you’ve computed the support and confidence of a particular association rule, you can
tell how frequently it occurs and you can tell how confident you are in that rule. So, an
association rule tells us that when X occurs, Y occurs with a certain probability.
Support count = number of transactions in T that contain X.
- T = transaction dataset and X = itemset
n = the number of transactions in T.
Support and support count are often used as synonyms.
Absolute support = support count (number of transactions that contain the itemset)
Whole number
Relative support = support (proportion of transactions that contain the itemset)
Real number between 0 and 1
Goal = to find all rules that satisfy the user-specified minimum support (minsup) and
minimum confidence (minconf).
- Confidence measures how strong the rule is, support measures how reliable it is.
Example:
Transaction data:
Minsup = 30%
Minconf = 80%
An example frequent itemset:
- {Chicken, Clothes, Milk} [sup = 3/7]
Association rules form the itemset:
, - {Clothes} {Milk, Chicken} [sup = 3/7, conf = 3/3] confident
- {Milk, Chicken} {Clothes} [sup = 3/7, conf = 3/4] not confident
Mining algorithms
There are many. They all use different strategies and data structures.
- Any algorithm should find the same set of rules although their computational
efficiencies and memory requirements may be different.
We’ll study only one:
The apriori algorithm
Has two steps:
1. Find all item sets that satisfy the minimum support threshold (frequent item sets).
2. Use frequent item sets to generate rules.
Step 1: mining all frequent itemsets
Frequent itemset = itemset whose support minsup.
The apriori property (the key idea of the apriori algorithm): all subsets of a frequent itemset
are also frequent itemsets.
Iterative algo (also called level-wise search):
- Find all 1-item frequent itemsets, then all 2-item itemsets, etc.
- In each iteration k, only consider itemsets that contain some k-1 frequent itemset.
Example:
- Minsup = 0.5
You can see that when looking at the itemsets from a specific length, you only keep the ones
that comply to the minsup value. That is why in the F1 step itemset {4} is dropped.
The items are sorted in lexicographical order.
- Also called dictionary order.
For example, the permutations of {1,2,3} in lexicographic order are 123, 132, 213,
231, 312, and 321.
Apriori candidate generation
, Join step: generate all possible candidate itemsets Ck of length k.
Prune step: remove those candidates in Ck that cannot be frequent.
Simpler explanation:
1. Generate a list of all candidate itemsets of length-1 each item in the dataset is a
candidate itemset.
2. Prune infrequent 1-itemsets any candidate 1-itemset that does not meet the
minsup threshold is discarded.
3. Generate a list of candidate 2-itemsets the algorithm generates itemsets of length-
2 by combining frequent 1-itemsets.
4. Prune infrequent 2-itemsets any candidate 2-itemset that does not meet the
minsup threshold is discarded.
Step 2: generating rules from frequent itemsets
One more step is needed to generate association rules.
The algorithm generates association rules form the frequent itemsets, where each rule
consists of an antecedent (a set of items) and a consequent (a set of items). The confidence
of each rule is calculated as the support of the combined antecedent and consequent divided
by the support of the antecedent alone.
- A B is an association rule if: confidence (A B) > minconf
(support(A B) = support (AB) = support(X),
Confidence(A B) = support (AB) / support(A))
Generating rules example:
Suppose {2, 3, 4} is frequent, with sup=50%
- Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup = 50%, 50%, 75%
75%, 75%, 75%, 75% respectively
These generate the following association rules:
The apriori algorithm is a very computational expensive algorithm.
Clearly the space of all association rules is exponential, O(3 m), where m is the number of
items in I. The mining exploits sparseness of data, and high minimum support and high
minimum confidence values. Still, it typically produces a huge number of rules, thousands,
tens of thousands, millions, …
Itemsets are unordered collections of items, events, etc.
Sequential patterns items, elements, events, etc. are ordered.
Transactional database vs. sequential database