Association rule mining - explained
posted on 14 Jun 2020 under category tutorial
Association rule mining is primarily focused on finding frequent co-occurring associations among a collection of items. It is sometimes referred to as “Market Basket Analysis”, since that was the original application area of association mining. The goal is to find associations of items that occur together more often than you would expect from a random sampling of all possibilities. The classic example of this is the famous Beer and Diapers association that is often mentioned in data mining books. The story goes like this: men who go to the store to buy diapers will also tend to buy beer at the same time.
Some examples are listed below:
An association rule has 2 parts:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it can be understood as a retail store’s association rule to target their customers better. If the above rule is a result of a thorough analysis of some data sets, it can be used to not only improve customer service but also improve the company’s revenue. Association rules are created by thoroughly analyzing data and looking for frequent if/then patterns.
Association mining is usually done on transactions data from a retail market or from an online e-commerce store. Since most transactions data is large, the apriori algorithm makes it easier to find these patterns or rules quickly.
So, What is a rule?
A rule is a notation that represents which item/s is frequently bought with what item/s. It has an LHS and an RHS part and can be represented as follows:
itemset A => itemset B
This means, the item/s on the right were frequently purchased along with items on the left.
The apriori algorithm, most common algorithm of ARM generates the most relevent set of rules from a given transaction data. It also shows the support, confidence and lift of those rules. These three measure can be used to decide the relative strength of the rules. So what do these terms mean?
Lets consider the rule A => B in order to compute these metrics.
Support: Support indicates how frequently the if/then relationship appears in the database. Confidence: Confidence tells about the number of times these relationships have been found to be true. Lift:Lift is the factor by which, the co-occurence of A and B exceeds the expected probability of A and B co-occuring, had they been independent. So, higher the lift, higher the chance of A and B occurring together.
Other measures include Conviction, All-Confidence, Collective strength and Leverage.
Problem can be seen as:
Simple rule looks like -
t_1 ⇒ t_2 (Here, t_i is generally a single item or a set of items) *t1: Antecedent, t2: Consequent
I = {milk, bread, butter, beer, diapers} D is as shown below:
Rule: {butter, bread} ⇒ {milk}, meaning that if butter and bread are bought, customers also buy milk.
Thresholds used for Relations
- Support — Indication of how frequently the itemset appears in the database. It is defined as the fraction of records that contain X∪Y to the total number of records in the database. Suppose, the support of an item is 0.1%, it means only 0.1% of the transactions contain that item. Support (XY) = Support count of (XY) / Total number of transaction in D
Confidence (X Y) = Support (XY) / Support (X)
Uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. Apriori has three parts that we disucssed above.
The algorithm then proceeds in the following steps:
It must be kept in mind that the values of Support, Life and Confidence may seem mathematical in the equations above, but are experimental in nature. We choose a value for the parameters, run some the algorithm and then change the value of those parameters and run the algorithm again. We base these values on the empirical data, i.e. the set of rules obtained in this example.
Pros
In the eclat model, we only have support. When we calculate the support, in an Eclat model, we are consider the prevalence of a set of items and not individual models. This makes sense because in case of Eclat models, since we only have support, the individual items is just the frequency of the items and nothing more than that.
The algorithm, as one would intuitively assume it to, works as follows:
Frequent-Pattern Growth algorithm works in the following manner-
Pros