Contains everything from the Big Data Management & Analytics course. No information is left out, and I've added information from the pre-recorded videos not present on the slides themselves (additional context or explanations). In short: all the relevant info made available in a singular file with ...
RSM, Erasmus University
Msc BIM 2021 - 2022
Kathleen Gaillot
BM04BIM • 5 EC
Big Data Management & Analytics
Summary Notes/imagery based on lectures
Data Science for Business: What you need to 1
know about data mining and data-analytic thinking (2013) by
Provost, F. and Fawcett, T., ISBN: 9781449361327
, Table of Contents
1. Introduction & Data-Analytics Thinking ............................ 1 5.1.3 Fault-Tolerance Issues & Data Corruption ....30
1.1 Data Science, Engineering & Data-Driven Decision 5.1.4 Desired Properties of a Big Data System .......31
Making ........................................................................... 1 5.2 Lambda Architecture .............................................31
1.1.1 The Data Mining Process - CRISP-DM.............. 1 5.2.1 Data Representation in Lambda Architecture31
1.2 Data as a Strategic Asset.......................................... 2 5.3 Examples of Big Data Tools ....................................33
1.2.1 Different Types of Analytics ............................ 2 5.3.1 Distributed File Systems ................................33
2. Data Collection & Preparation........................................... 4 5.3.2 Hadoop MapReduce ......................................34
2.1 Data for a Data Mining Project ................................ 4 5.3.3 Big Data Infrastructure as a Service ..............35
2.2 Data from Online Sources: Webscraping ................. 5 5.3.4 Machine Learning Tools ................................35
2.3 Data from Online Sources: API ................................ 6 6. Additional Data Science Topics........................................ 36
2.4 Types and Properties of Data .................................. 7 6.1 Introduction to Clustering, Distance Measures .....36
2.4.1 Volume & Velocity: Data Structure ................. 7 6.1.1 Clustering: What is it? ...................................36
2.4.2 Variety: Types of Data ..................................... 8 6.1.2 Distance Measures: Euclidean Distance........37
2.4.3 Veracity: Missing Data ..................................... 9 6.1.3 Distance Measures: Manhattan Distance .....37
3. Predictive Modeling ........................................................ 10 6.1.4 Jaccard Distance ............................................37
3.1 Introduction to Predictive Modeling ..................... 10 6.1.5 Cosine Distance .............................................38
3.1.1 Data Science Tasks: two Groups.................... 10 6.1.6 Edit Distance: Levenshtein Metric ................38
3.1.2 Predictive Vs. Diagnostic: Causal/Explanatory 6.2 Clustering Methods................................................38
Modeling ................................................................ 10 6.2.1 Prototype-Based: K-Means Clustering ..........38
3.1.3 Predictive Modeling ...................................... 11 6.2.2 Hierarchical Clustering ..................................40
3.2 Regression via Mathematical Function: Linear 6.3 Text Mining: How to Represent Text Numerically .41
Regression ................................................................... 12 6.3.1 What is Text Mining.......................................41
3.3 Classification via Mathematical Functions: Logistic 6.3.2 Why Text Mining ...........................................41
Regression ................................................................... 12 6.3.3 How to Represent Text Numerically..............41
3.3.1 Logistic Regression ........................................ 13 6.3.4 Determining Features of a Document ...........41
3.4 Classification via Rule-based Formulas: Tree Induction 6.3.5 Assigning Values to Every Feature: Basic
..................................................................................... 13 Measurements .......................................................42
3.4.1 Decision Trees ............................................... 14 6.3.6 Assigning values to every feature: TFIDF.......43
3.5 Information & Entropy........................................... 15 6.3.7 How to represent text numerically: Summary43
3.5.1 Entropy .......................................................... 16 6.4 Text Mining: Cleaning and Pre-Processing Text.....43
3.6 Model Evaluation ................................................... 18 6.4.1 Case Normalization........................................43
3.6.1 The Confusion Matrix .................................... 18 6.4.2 Removing Punctuation ..................................44
4. Model Fit, Overfit & Evaluation ....................................... 19 6.4.3 Removing Numbers .......................................44
4.1 Model Performance & Overfit ............................... 19 6.4.4 Removing Stopwords.....................................44
4.1.1 Generalization ............................................... 19 6.4.5 Word Stemming & Stem Completion ............45
4.1.2 Overfiting....................................................... 19 6.4.6 Summary .......................................................45
4.1.3 Overfitting in Tree Induction ......................... 20 6.4.7 Visualizing Customer reviews ........................46
4.1.3 Avoiding Overfitting for Tree Induction ........ 21 7. Advanced Topics .............................................................. 48
4.1.4 Overfitting in Mathematical Functions ......... 21 7.1 Fairness & Machine Learning .................................48
4.2 Cross-Validation ..................................................... 23 7.1.1 Sensitive Characteristics ................................48
4.3 Visualizing Model Performance ............................. 23 7.1.3 Independence (R ⊥ A) ...................................49
4.3.1 Variable Importance ...................................... 23 7.1.4 Separation (R ⊥ A | Y)....................................49
4.4 ROC Curve & Area Under the Curve ...................... 25 7.1.5 Sufficiency (Y ⊥ A | R)....................................49
4.4.1 The ROC Space............................................... 25 7.2 Neural Networks & Deep Learning ........................50
4.4.2 Area Under the ROC Curve (AUC).................. 25 7.2.1 Deep Learning................................................50
4.5 Expected Value Framework ................................... 25 7.3 Ensemble Methods ................................................52
4.5.1 Comparing Models ........................................ 25 7.4 Winning Data Science Competitions......................53
4.5.2 Benefit/Cost Matrix ....................................... 26 7.5 Revisiting Predictive Modeling ..............................54
4.6 Variance-Bias Trade-off & Ensembles.................... 27 7.5.1 Problems with Predictive Modeling ..............54
4.6.1 Underfitting & Overfitting ............................. 27 7.5.2 Solution: Experimental Data..........................54
4.6.2 Ensemble Methods........................................ 27 7.6 Uplift Modeling ......................................................55
5. Big Data Architecture, Engineering & Tools .................... 29 7.6.1 How to Build an Uplidt Model .......................55
5.1 The Need for Big Data Technologies...................... 29 8. Examination Information................................................. 58
5.1.1 Client / Server Model .................................... 29 8.1 Course Recap & Learning Objectives .....................58
5.1.2 Scaling Problems ........................................... 30 8.2 Practice ..................................................................59
ii
, Learning Objectives
Describe the main steps of the Cross- Explain the main concepts and
Distinguish among the four different
Industry Standard Process for Data algorithms associated with predictive
Data Science methods covered in class;
Mining (CRISP-DM); modeling;
Given a specific Big Data problem, apply the adequate Data Using the frameworks discussed in class, recognize the
Science method to solve it, and evaluate its performance ethical dilemmas in collecting and analyzing Big Data;
(both practical and statistical) (Fairness)
Evaluate the ethical position of a firm in a specific data Apply the Big Data tools architectural principles to
collection situation; formulate a solution for real-world scenarios.
1. Introduction & Data-Analytics Thinking
1.1 Data Science, Engineering & Data-Driven Decision Making
Two major component that support data-driven decision-making:
1. Data science: set of processes & methods that allow us to reach conclusions from data
in an automated fashion [most of the course focuses on this aspect]
2. Data engineering & Processing: set of technologies and processes that support the data
science component (includes the big data technologies along other relevant communication
and hardware infrastructures)
AI: set of paradigms that try to create intelligent behavior using computers.
Machine-learning tends to be the most successful AI in the world. There are other methods. Gather
lots of data, perform statistical method, predict future. Late 80s attempted reasoning algorithms.
Creating new information. Less popular and less successful AI approach.
Computation: e.g. Google indexes the internet – it knows which term is on which page. Essentially
‘processing’ the data = computation.
1.1.1 The Data Mining Process - CRISP-DM
CRoss Industry Standard Process for Data Mining (CRISP-DM)
Data Science in Steps: An Iterative Process
1. Business understanding: Crucial to understand the problem to be solved (not trivial – what
exactly is the problem? Shouldn’t be too broad, else not realistic. Define the scope as
clearly as possible.)
2. Data understanding: What data is available? What are its strengths and
limitations? E.g. historical data is typically very useful, but in some situation it may not
be appropriate (may have been collected for a different purpose & may miss critical
information or be unreliable.) As we learn more about the available data & its
potential issues, we may want to go back to rethink the problem we want to
solve to refine those problems.
3. Data preparation: it is unlikely that the data we have can be directly used in
our model. There usually are restrictions on the format of the data that we use
as inputs. The data may need to be aggregated, manipulate its format,
normalize its values, etc. Usually a lot of time is spend on this step.
4. Modelling: here we create the models that will help us to answer the question
defined in the business understanding stage. This is where different data mining
1
, techniques will be applied. Different techniques require different format, hence why you may need to go back to the Data
preparation step.
5. Evaluation: The purpose of this stage is to assess the outcome of the modelling stage and determine whether the resulting
models are useful to help solve the problem. If outcomes are not good, may need to go back to the first Business understanding
stage and reassess whether we’re trying to solve the problem in the best way. If results are good, we move onto the last stage.
6. Deployment: results of the data mining output are put into real use in production – when the model is deployed, it starts
generating new information, which can be used to further refine the Business understanding. This could then prompt a new
iteration of the whole process.
1.2 Data as a Strategic Asset
1.2.1 Different Types of Analytics
Difficulty to
implement.
(potential) Value
you get in
return.
Correlation (correlations are the main drivers of
these models. ‘what will happen’, can be asked
when you know the items are correlated) vs
causation (you have to know why things
happen. Causal dimension. What causes what.)
Individual (has/will x customer bought/buy y
product? Different per individual) vs aggregate
(aggregate of the population, e.g. how many
items sold, how many people buy x product?
averages)
Different Types of Analytics Have Different Roles for Humans
From left to right: an increasing value and difficulty. From information to optimization.
2
, Descriptive/Diagnostic Analytics – Predictive Analytics Prescriptive Analytics
Tasks
What happened?/Why did it What will happen? How can we make it
happen? happen?
Hindsight Insight Foresight
Data Visualisation: Display and Classification: for each individual in a population, Uplift Modeling:
summarize data in ways that convey which of a (small) set of classes this individual predict how
crucial information about a large data belongs to? individuals behave
set. Regression: attempts to estimate or predict, for contingent on the
Clustering: groups individuals in a each individual, the numerical value of some action performed
population together by their similarity, variable for that individual. upon them.
but not driven by any specific purpose. Link Prediction: attempts Automation:
Co-occurrence Grouping: attempts to to predict connections determine optimal
find associations between entities between data items. (e.g. action based on
based on transactions involving them. social media: who wants to predicted reaction of
be friend with who) individuals.
Data is Mostly Unstructured
Traditional systems have been
designed for transactional, not
unstructured data.
Google example:
[Google file system] redundant
storage of massive maounts of
data on cheap and unreliable
computers. storage that can be
stored across multiple
computers. It is internal within
Google itself to be able to crawl
the pages.
[MapReduce] distributed computing paradigm. aggregates all the results from the computers to
get an overview of all the results. This technology is also proprietary (owned exclusively by)
Google.
The term big data was first used by NASA
researchers Michael Cos & David Ellsworth in 1997:
“Data sets are generally quite large, taxing the
capacities of main memory, local disk, and even
remote disk, we call this the problem of big data”
Gained in popularity in the past decade:
3
, 2. Data Collection & Preparation
2.1 Data for a Data Mining Project
Example: MegaTelCo wants to predict Churn.
- MegaTelCo is a large telecommunication firm in the United States
o They are having a major problem with customer retention in their wireless business.
o Their goal is to identify customers that are going to churn (stop being clients) so that they can try to retain them.
- Which data would be useful to predict churn?
Overview of Technologies on a Typical Business Intelligence (BI) Stack
Data source Data movement, Data warehouse / Mid-tier servers Front-end application
Streaming engines servers
Once this data is in Transforming data Transformed data is Once the data is stored, Search function,
the boundaries of coming from previous stored in relational it needs to be further displaying information
the firm. It has to be sources into formats that database manipulated so it may on spreadsheets,
processed though → are more manageable management systems be more easily dashboards, reports,
later on. Extract transform (DBMS) and retrieved consumed. performing specific ad
Excel sheet from load tech, complex event via SQL, or Can then be used to hoc queries (to get
customers… CSV files processing engine MapReduce engine to support search query, more knowledge)
from external (aggregates high frequency tackle big data spread sheet…
systems.. hardware… events into more
automate some data
manageable ones)
analytics systems
Source: Chaughuri, Dayal, Narasayya. An Overview of Business Intelligence Technology. Communications of the ACM (2011)
Q: From which of the above systems should we collect data for a Data Mining project?
Almost any data source can be used in the context of a Data Mining project
Data Mining is an exploratory process with uncertain outcomes
Being able to collect data from different systems allows for fast prototyping and adjustments as needed
o Note: A proper engineering solution should be deployed once the prototype demonstrates its merits
Example: Purchase behaviour in a VoD system.
- Context: Telecom operator, providing TV, internet, fixed and mobile phone. 1.5M households;
o Video-on-demand (VoD) service: users pay between 1.99 and 5.99 per movie.
- Goals: Learn how consumers react to a change in prices of movies available for renting;
o Understand how consumers influence each other in this environment.
Data sources used: o Movie catalogue (in advance) o Billing information (VoD purchases)
o CRM system (profile & demographics) o Call Detail Records (social network analysis)
o IMDB Votes & Ranking (movie quality)
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller kathleenou. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.41. You're not tied to anything after your purchase.