100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Data Mining for Business & Governance full course $11.09   Add to cart

Summary

Summary Data Mining for Business & Governance full course

 49 views  3 purchases
  • Course
  • Institution

Summary of 133 pages for the course Data Mining For Business And Governance at UVT (Full course notes)

Preview 4 out of 133  pages

  • April 29, 2021
  • 133
  • 2020/2021
  • Summary
avatar-seller
DATA MINING FOR BUSINESS AND GOVERNANCE
Chris Emmery, Çiçek Güven & Gonzalo Nápoles



TABLE OF CONTENTS

Introduction to Data Mining ........................................................................................................................... 5
1. What is Data Mining? ................................................................................................................................ 5
1.1. Key aspects: Computation & Large data sets .................................................................................... 5
1.2. Big Data ............................................................................................................................................. 6
1.3. Applications ....................................................................................................................................... 6
2. What makes prediction possible?............................................................................................................... 6

3. Data Mining as Applied Machine Learning ................................................................................................ 7
3.1. Supervised learning ........................................................................................................................... 7
3.2. Unsupervised Learning ...................................................................................................................... 8

Introduction to Data Science ......................................................................................................................... 10
1. What is data science?............................................................................................................................... 10
1.1. Example ........................................................................................................................................... 10
1.2. Terminology..................................................................................................................................... 10
1.3. The algorithm .................................................................................................................................. 12
1.4. Evaluation ........................................................................................................................................ 12
1.5. Computer hardware ........................................................................................................................ 13
2. Representing data .................................................................................................................................... 14
2.1. How do we get data? ....................................................................................................................... 14
2.2. File formats: raw-level representation of files ................................................................................ 15
2.3. Databases: storing the data a bit more cleverly .............................................................................. 16
2.4. Data science in practice: 80% vs. 20% ............................................................................................. 16
2.5. Representation of data .................................................................................................................... 16

Articles week 1 ............................................................................................................................................. 17

Prediction (SL): regression & classification .................................................................................................... 20
1. What makes prediction possible?............................................................................................................. 20
1.1. Correlation Coefficient: Pearson’s r................................................................................................. 20
2. Regression ................................................................................................................................................ 23

3. Classification ............................................................................................................................................ 24
3.1. Decision boundaries to label parts of a data as being a certain category ....................................... 26
3.2. ML algorithms for classification using decision boundaries ............................................................ 26
3.3. Multiclass classification (ó binary classification) ........................................................................... 35
4. Fitting and tuning ..................................................................................................................................... 36
4.1. Fitting............................................................................................................................................... 37



1

, 4.2. Tuning .............................................................................................................................................. 38

5. Evaluation ................................................................................................................................................ 43
5.1. Metrics for evaluating a Regression Task ........................................................................................ 43
5.2. Metrics for evaluating a Classification Task..................................................................................... 43
5.3. Schemes for applying metrics in model selection ........................................................................... 46
5.4. Best practices & common pitfalls .................................................................................................... 49
6. Models ...................................................................................................................................................... 55
6.1. Model selection ............................................................................................................................... 55
6.2. What is ‘learning’? ........................................................................................................................... 55

Working with Text data ................................................................................................................................ 56
1. Representing text as vectors .................................................................................................................... 56
1.1. Converting to numbers .................................................................................................................... 56
2. Binary vectors for Decision Tree classification (ID3) ................................................................................. 58
2.1. Inferring rules (decisions) by information gain: EX: Spam detection .............................................. 58
3. Using Vector Spaces and weightings ........................................................................................................ 62
3.1. Binary vs. Frequency........................................................................................................................ 62
3.2. Term frequencies............................................................................................................................. 62
3.3. (Inverse) document frequency ........................................................................................................ 64
3.4. Putting it together: tf * idf weighting............................................................................................... 64
3.5. Normalizing vector representations ................................................................................................ 65
4. Document classification using 𝑘-NN ........................................................................................................ 66
4.1. 𝓵𝟐 normalization ............................................................................................................................. 66
4.2. Cosine similarity .............................................................................................................................. 67
4.3. Using similarity in 𝒌-nn.................................................................................................................... 67
5. Practical examples.................................................................................................................................... 70
5.1. Naive text cleaning .......................................................................................................................... 70
6. Document classification ........................................................................................................................... 73
6.1. Sentiment analysis ........................................................................................................................... 73
6.2. Build a model ................................................................................................................................... 75
6.3. Test our model ................................................................................................................................ 82

Dimensionality reduction .............................................................................................................................. 83
1. The importance of dimensions ................................................................................................................. 83

2. Visualization ............................................................................................................................................. 85
2.1. Box plots .......................................................................................................................................... 85
2.2. Histogram ........................................................................................................................................ 85
2.3. Scatter plots..................................................................................................................................... 85
3. Dimensionality reduction ......................................................................................................................... 86
3.1. Feature selection ............................................................................................................................. 86
3.2. Feature extraction ........................................................................................................................... 88
4. Deep neural networks .............................................................................................................................. 90

Unsupervised learning .................................................................................................................................. 91




2

, 1. Techniques................................................................................................................................................ 92
1.1. CRISP trough k-means algorithm (most important method) ........................................................... 92
1.2. Fuzzy trough Fuzzy c-means algorithm............................................................................................ 93
1.3. Hierarchical clustering ..................................................................................................................... 95
2. Distance function...................................................................................................................................... 96
3. Evaluation method ................................................................................................................................... 97
3.1. The Silhouette coefficient/score ..................................................................................................... 97
3.2. Dunn index ...................................................................................................................................... 97

Association mining........................................................................................................................................ 98
1. Measures: support & confidence .............................................................................................................. 99
1.1. Support ............................................................................................................................................ 99
1.2. Confidence....................................................................................................................................... 99
2. Mining association rules......................................................................................................................... 100
3. A priori algorithm ................................................................................................................................... 101
3.1. The algorithm ................................................................................................................................ 101
3.2. Considerations ............................................................................................................................... 102
3.3. Setting the support parameter (minsup)....................................................................................... 102
3.4. Pattern evaluation ......................................................................................................................... 103
4. Itemset taxonomy .................................................................................................................................. 104
4.1. Maximal frequent itemset ............................................................................................................. 104
4.2. Closed itemset ............................................................................................................................... 104
4.3. Maximal vs. closed......................................................................................................................... 105
5. Quantitative association rules ................................................................................................................ 105

Mining massive data ................................................................................................................................... 107
1. Parallelization......................................................................................................................................... 107
1.1. Requirements ................................................................................................................................ 108
1.2. How does parallelization work? .................................................................................................... 109
2. Bagging, Boosting, and Batching ........................................................................................................... 111
2.1. Boosting (ex. AdaBoost) ................................................................................................................ 111
2.2. Averaging (ex. Bagging, Random Forests) ..................................................................................... 113
2.3. Batching (online learning) ............................................................................................................. 115
2.4. Drawbacks of ensemble methods ................................................................................................. 116

3. Distributed Computing ........................................................................................................................... 117
3.1. Distributing Machine Learning models .......................................................................................... 117
3.2. Distributed file storage .................................................................................................................. 118
3.3. Map reduce ................................................................................................................................... 119

Deep learning ............................................................................................................................................. 121
1. A brief history of AI ................................................................................................................................. 121
1.1. Alan Turing .................................................................................................................................... 121
1.2. Sci-project (1974) .......................................................................................................................... 122
1.3. The Sojourner Rover (1997) .......................................................................................................... 123
1.4. “Sub-symbolic” AI (1988-2016) ..................................................................................................... 123



3

, 2. Recognizing patterns .............................................................................................................................. 123
2.1. Neural networks ............................................................................................................................ 123
2.2. McCulloch-Pitts Neurons (1947).................................................................................................... 125
2.3. Deep Learning (2015) .................................................................................................................... 126
3. Many successes of DL ............................................................................................................................. 131
4. Conclusion .............................................................................................................................................. 133




4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller clairevanroey. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $11.09. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

83507 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$11.09  3x  sold
  • (0)
  Add to cart