Association Rules on Business Problem

Mehmet Akturk
5 min readFeb 4, 2021

Association analysis applications are among the most common applications in data science. It will also coincide as “Recommendation Systems”. I will try to explain to you this system, which works in the background of applications and sites where we live almost every moment, on an example kaggle notebook.

https://www.edugrad.com/build-recommendation-system-in-python

It is a rule-based machine learning technique used to find patterns (relationships, structures) in the data.

Association analysis applications are among the most common applications in data science. It will also coincide as “Recommendation Systems”.

These applications may have come up in the following ways, such as “bought this product that bought that product” or “those who viewed that ad also looked at these ads” or “we created a playlist for you” or “recommended video for the next video”.

These scenarios are the most frequently encountered scenarios within the scope of e-commerce data science data mining studies.

In the world’s largest e-commerce companies spotify, amazon, it uses many platforms like netflix recommendation systems can know a little more closely.

https://www.forbes.com/sites/jonmarkman/2019/07/31/spotify-shares-are-getting-killed-and-its-a-big-opportunity/?sh=2cd6734f5878

So what does this association analysis summarize?

Apriori Algorithm

It is the most used method in this field.

Association rule analysis is carried out by examining some metrics:

  • Support
    This measure gives an idea of how frequent an itemset is in all the transactions. Consider itemset1 = {bread} and itemset2 = {shampoo}. There will be far more transactions containing bread than those containing shampoo. So as you rightly guessed, itemset1 will generally have a higher support than itemset2. Now consider itemset1 = {bread, butter} and itemset2 = {bread, shampoo}. Many transactions will have both bread and butter on the cart but bread and shampoo? Not so much. So in this case, itemset1 will generally have a higher support than itemset2. Mathematically, support is the fraction of the total number of transactions in which the itemset occurs.
    Support(X,Y)= Freq(X,Y)/N
    X: Product Y: Product N: Total Shopping
    Y: Product
    N: Total Shopping
    Value of support helps us identify the rules worth considering for further analysis. For example, one might want to consider only the itemsets which occur at least 50 times out of a total of 10,000 transactions i.e. support = 0.005. If an itemset happens to have a very low support, we do not have enough information on the relationship between its items and hence no conclusions can be drawn from such a rule.
  • Confidence
    What do you think would be the confidence for {Butter} → {Bread}? That is, what fraction of transactions having butter also had bread? Very high i.e. a value close to 1? That’s right. What about {Yogurt} → {Milk}? High again. {Toothbrush} → {Milk}? Not so sure? Confidence for this rule will also be high since {Milk} is such a frequent itemset and would be present in every other transaction.
    Confidence(X,Y) = Freq(X,Y)/Freq(X)
    It does not matter what you have in the antecedent for such a frequent consequent. The confidence for an association rule having a very frequent consequent will always be high.
    For example, Confidence for {Toothbrush} → {Milk} will be 10/(10+4) = 0,7.
    Looks like a high confidence value. But we know intuitively that these two products have a weak association and there is something misleading about this high confidence value. Lift is introduced to overcome this challenge.
  • Lift (The purchase of one product increases the level of purchase of the other)
    Lift = Support (X, Y) / (Support (X) * Support (Y))

The part of understanding data and preparing the data for processing (for example: observations in categorical variables consisting of 0 and 1 s) should be completed. Given the necessary formulas, we will try to find them in order.

As all environment and conditions are ready, there is not much left to do, don’t worry! Now, firstly, one-hot encoding of dataframe is done with the help of “apriori” function:

freq_items = apriori(ohe_df, min_support = 0.2, use_colnames = True, verbose = 1)

If you want, frequency can be shown by applying a head() at this stage. And if you are on this path, your hand will inevitably go to head use to see what I have done.

Finally, by making confidence on the dataframe we created, we get our final result:

association_rules(freq_items, metric = "confidence", min_threshold = 0.6)

And the association analysis results in front of you:

https://github.com/Mathchi/Association-Rules-on-Business-Problem/blob/master/association-rules-on-business-problem.ipynb

Association Rule Mining

Now that we understand how to quantify the importance of association of products within an itemset, the last step is to generate rules from the entire list of items and identify the most important ones. This is not as simple as it might sound. Supermarkets will have thousands of different products in store. After some simple calculations, it can be shown that just 10 products will lead to 57000 rules!! And this number increases exponentially with the increase in number of items. Finding lift values for each of these will get computationally very very expensive. How to deal with this problem? How to come up with a set of most important association rules to be considered? Apriori algorithm comes to our rescue for this. You can find a dynamic example repository for this on my github page.

I have tried to cover all the important terms and concepts related to mining of association rules through this blog going into details wherever necessary. Following are one line summaries for few terms introduced in this process

  1. Association rule mining:
    i) Itemset generation,
    ii) Rule generation
  2. Apriori principle: All subsets of a frequent itemset must also be frequent
  3. Apriori algorithm: Pruning to efficiently get all the frequent itemsets
  4. Maximal frequent itemset: none of the immediate supersets are frequent
  5. Closed frequent itemset: none of the immediate supersets have the same value of support

After analysis we can easily see how often there is a connection between which products.

I hope you enjoyed reading this and have more clarity in thoughts than before.

This is all I have written about the “Association Rules on Business Problem”. If you want to know more about Data Science, Big Data and related others, you can check out my other serial articles. Sample:
What is the Big Data?

You can reach me from my Linkedin account for all your questions and requests.

Hope to meet you in other series articles and articles…🖖🏼

References
1. https://www.kaggle.com/mathchi/association-rules-on-business-problem
2. https://www.edugrad.com/build-recommendation-system-in-python
3. https://www.forbes.com/sites/jonmarkman/2019/07/31/spotify-shares-are-getting-killed-and-its-a-big-opportunity/?sh=2cd6734f5878
4. https://github.com/Mathchi/Association-Rules-on-Business-Problem/blob/master/association-rules-on-business-problem.ipynb
5. https://towardsdatascience.com/association-rules-2-aa9a77241654
6. https://github.com/Mathchi/DS_Dynamics-Association-Rules-Learning_ARL

--

--

Mehmet Akturk

Experienced Ph.D. with a demonstrated history of working in the higher education industry. Skilled in Data Science,AI,NLP,Deep Learning,Big Data,& Mathematics.