100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
BDMA: Big Data Management & Analytics (BM04BIM) 2021 Summary FULL slides/notes €4,99
In winkelwagen

Samenvatting

BDMA: Big Data Management & Analytics (BM04BIM) 2021 Summary FULL slides/notes

5 beoordelingen
 201 keer bekeken  21 keer verkocht

Contains everything from the Big Data Management & Analytics course. No information is left out, and I've added information from the pre-recorded videos not present on the slides themselves (additional context or explanations). In short: all the relevant info made available in a singular file with ...

[Meer zien]
Laatste update van het document: 3 jaar geleden

Voorbeeld 6 van de 61  pagina's

  • 7 december 2021
  • 13 december 2021
  • 61
  • 2021/2022
  • Samenvatting
Alle documenten voor dit vak (1)

5  beoordelingen

review-writer-avatar

Door: krisvanpelt • 9 maanden geleden

review-writer-avatar

Door: ellaschrijnemakers • 1 jaar geleden

review-writer-avatar

Door: juyeonkim • 1 jaar geleden

review-writer-avatar

Door: zuzia0211 • 1 jaar geleden

review-writer-avatar

Door: jasperkriellaars • 3 jaar geleden

avatar-seller
kathleenou
RSM, Erasmus University
Msc BIM 2021 - 2022
Kathleen Gaillot




BM04BIM • 5 EC

Big Data Management & Analytics
Summary Notes/imagery based on lectures




Data Science for Business: What you need to 1
know about data mining and data-analytic thinking (2013) by
Provost, F. and Fawcett, T., ISBN: 9781449361327

, Table of Contents
1. Introduction & Data-Analytics Thinking ............................ 1 5.1.3 Fault-Tolerance Issues & Data Corruption ....30
1.1 Data Science, Engineering & Data-Driven Decision 5.1.4 Desired Properties of a Big Data System .......31
Making ........................................................................... 1 5.2 Lambda Architecture .............................................31
1.1.1 The Data Mining Process - CRISP-DM.............. 1 5.2.1 Data Representation in Lambda Architecture31
1.2 Data as a Strategic Asset.......................................... 2 5.3 Examples of Big Data Tools ....................................33
1.2.1 Different Types of Analytics ............................ 2 5.3.1 Distributed File Systems ................................33
2. Data Collection & Preparation........................................... 4 5.3.2 Hadoop MapReduce ......................................34
2.1 Data for a Data Mining Project ................................ 4 5.3.3 Big Data Infrastructure as a Service ..............35
2.2 Data from Online Sources: Webscraping ................. 5 5.3.4 Machine Learning Tools ................................35
2.3 Data from Online Sources: API ................................ 6 6. Additional Data Science Topics........................................ 36
2.4 Types and Properties of Data .................................. 7 6.1 Introduction to Clustering, Distance Measures .....36
2.4.1 Volume & Velocity: Data Structure ................. 7 6.1.1 Clustering: What is it? ...................................36
2.4.2 Variety: Types of Data ..................................... 8 6.1.2 Distance Measures: Euclidean Distance........37
2.4.3 Veracity: Missing Data ..................................... 9 6.1.3 Distance Measures: Manhattan Distance .....37
3. Predictive Modeling ........................................................ 10 6.1.4 Jaccard Distance ............................................37
3.1 Introduction to Predictive Modeling ..................... 10 6.1.5 Cosine Distance .............................................38
3.1.1 Data Science Tasks: two Groups.................... 10 6.1.6 Edit Distance: Levenshtein Metric ................38
3.1.2 Predictive Vs. Diagnostic: Causal/Explanatory 6.2 Clustering Methods................................................38
Modeling ................................................................ 10 6.2.1 Prototype-Based: K-Means Clustering ..........38
3.1.3 Predictive Modeling ...................................... 11 6.2.2 Hierarchical Clustering ..................................40
3.2 Regression via Mathematical Function: Linear 6.3 Text Mining: How to Represent Text Numerically .41
Regression ................................................................... 12 6.3.1 What is Text Mining.......................................41
3.3 Classification via Mathematical Functions: Logistic 6.3.2 Why Text Mining ...........................................41
Regression ................................................................... 12 6.3.3 How to Represent Text Numerically..............41
3.3.1 Logistic Regression ........................................ 13 6.3.4 Determining Features of a Document ...........41
3.4 Classification via Rule-based Formulas: Tree Induction 6.3.5 Assigning Values to Every Feature: Basic
..................................................................................... 13 Measurements .......................................................42
3.4.1 Decision Trees ............................................... 14 6.3.6 Assigning values to every feature: TFIDF.......43
3.5 Information & Entropy........................................... 15 6.3.7 How to represent text numerically: Summary43
3.5.1 Entropy .......................................................... 16 6.4 Text Mining: Cleaning and Pre-Processing Text.....43
3.6 Model Evaluation ................................................... 18 6.4.1 Case Normalization........................................43
3.6.1 The Confusion Matrix .................................... 18 6.4.2 Removing Punctuation ..................................44
4. Model Fit, Overfit & Evaluation ....................................... 19 6.4.3 Removing Numbers .......................................44
4.1 Model Performance & Overfit ............................... 19 6.4.4 Removing Stopwords.....................................44
4.1.1 Generalization ............................................... 19 6.4.5 Word Stemming & Stem Completion ............45
4.1.2 Overfiting....................................................... 19 6.4.6 Summary .......................................................45
4.1.3 Overfitting in Tree Induction ......................... 20 6.4.7 Visualizing Customer reviews ........................46
4.1.3 Avoiding Overfitting for Tree Induction ........ 21 7. Advanced Topics .............................................................. 48
4.1.4 Overfitting in Mathematical Functions ......... 21 7.1 Fairness & Machine Learning .................................48
4.2 Cross-Validation ..................................................... 23 7.1.1 Sensitive Characteristics ................................48
4.3 Visualizing Model Performance ............................. 23 7.1.3 Independence (R ⊥ A) ...................................49
4.3.1 Variable Importance ...................................... 23 7.1.4 Separation (R ⊥ A | Y)....................................49
4.4 ROC Curve & Area Under the Curve ...................... 25 7.1.5 Sufficiency (Y ⊥ A | R)....................................49
4.4.1 The ROC Space............................................... 25 7.2 Neural Networks & Deep Learning ........................50
4.4.2 Area Under the ROC Curve (AUC).................. 25 7.2.1 Deep Learning................................................50
4.5 Expected Value Framework ................................... 25 7.3 Ensemble Methods ................................................52
4.5.1 Comparing Models ........................................ 25 7.4 Winning Data Science Competitions......................53
4.5.2 Benefit/Cost Matrix ....................................... 26 7.5 Revisiting Predictive Modeling ..............................54
4.6 Variance-Bias Trade-off & Ensembles.................... 27 7.5.1 Problems with Predictive Modeling ..............54
4.6.1 Underfitting & Overfitting ............................. 27 7.5.2 Solution: Experimental Data..........................54
4.6.2 Ensemble Methods........................................ 27 7.6 Uplift Modeling ......................................................55
5. Big Data Architecture, Engineering & Tools .................... 29 7.6.1 How to Build an Uplidt Model .......................55
5.1 The Need for Big Data Technologies...................... 29 8. Examination Information................................................. 58
5.1.1 Client / Server Model .................................... 29 8.1 Course Recap & Learning Objectives .....................58
5.1.2 Scaling Problems ........................................... 30 8.2 Practice ..................................................................59

ii

, Learning Objectives
Describe the main steps of the Cross- Explain the main concepts and
Distinguish among the four different
Industry Standard Process for Data algorithms associated with predictive
Data Science methods covered in class;
Mining (CRISP-DM); modeling;
Given a specific Big Data problem, apply the adequate Data Using the frameworks discussed in class, recognize the
Science method to solve it, and evaluate its performance ethical dilemmas in collecting and analyzing Big Data;
(both practical and statistical) (Fairness)
Evaluate the ethical position of a firm in a specific data Apply the Big Data tools architectural principles to
collection situation; formulate a solution for real-world scenarios.




1. Introduction & Data-Analytics Thinking
1.1 Data Science, Engineering & Data-Driven Decision Making
Two major component that support data-driven decision-making:
1. Data science: set of processes & methods that allow us to reach conclusions from data
in an automated fashion [most of the course focuses on this aspect]
2. Data engineering & Processing: set of technologies and processes that support the data
science component (includes the big data technologies along other relevant communication
and hardware infrastructures)

AI: set of paradigms that try to create intelligent behavior using computers.
Machine-learning tends to be the most successful AI in the world. There are other methods. Gather
lots of data, perform statistical method, predict future. Late 80s  attempted reasoning algorithms.
Creating new information. Less popular and less successful AI approach.
Computation: e.g. Google indexes the internet – it knows which term is on which page. Essentially
‘processing’ the data = computation.

1.1.1 The Data Mining Process - CRISP-DM
CRoss Industry Standard Process for Data Mining (CRISP-DM)
Data Science in Steps: An Iterative Process
1. Business understanding: Crucial to understand the problem to be solved (not trivial – what
exactly is the problem? Shouldn’t be too broad, else not realistic. Define the scope as
clearly as possible.)
2. Data understanding: What data is available? What are its strengths and
limitations? E.g. historical data is typically very useful, but in some situation it may not
be appropriate (may have been collected for a different purpose & may miss critical
information or be unreliable.) As we learn more about the available data & its
potential issues, we may want to go back to rethink the problem we want to
solve to refine those problems.
3. Data preparation: it is unlikely that the data we have can be directly used in
our model. There usually are restrictions on the format of the data that we use
as inputs. The data may need to be aggregated, manipulate its format,
normalize its values, etc. Usually a lot of time is spend on this step.
4. Modelling: here we create the models that will help us to answer the question
defined in the business understanding stage. This is where different data mining

1

, techniques will be applied. Different techniques require different format, hence why you may need to go back to the Data
preparation step.
5. Evaluation: The purpose of this stage is to assess the outcome of the modelling stage and determine whether the resulting
models are useful to help solve the problem. If outcomes are not good, may need to go back to the first Business understanding
stage and reassess whether we’re trying to solve the problem in the best way. If results are good, we move onto the last stage.
6. Deployment: results of the data mining output are put into real use in production – when the model is deployed, it starts
generating new information, which can be used to further refine the Business understanding. This could then prompt a new
iteration of the whole process.


1.2 Data as a Strategic Asset
1.2.1 Different Types of Analytics


Difficulty to
implement.
(potential) Value
you get in
return.




Correlation (correlations are the main drivers of
these models. ‘what will happen’, can be asked
when you know the items are correlated) vs
causation (you have to know why things
happen. Causal dimension. What causes what.)
Individual (has/will x customer bought/buy y
product? Different per individual) vs aggregate
(aggregate of the population, e.g. how many
items sold, how many people buy x product?
averages)

Different Types of Analytics Have Different Roles for Humans




From left to right: an increasing value and difficulty. From information to optimization.
2

, Descriptive/Diagnostic Analytics – Predictive Analytics Prescriptive Analytics
Tasks
What happened?/Why did it What will happen? How can we make it
happen? happen?
Hindsight Insight Foresight
 Data Visualisation: Display and  Classification: for each individual in a population,  Uplift Modeling:
summarize data in ways that convey which of a (small) set of classes this individual predict how
crucial information about a large data belongs to? individuals behave
set.  Regression: attempts to estimate or predict, for contingent on the
 Clustering: groups individuals in a each individual, the numerical value of some action performed
population together by their similarity, variable for that individual. upon them.
but not driven by any specific purpose.  Link Prediction: attempts  Automation:
 Co-occurrence Grouping: attempts to to predict connections determine optimal
find associations between entities between data items. (e.g. action based on
based on transactions involving them. social media: who wants to predicted reaction of
be friend with who) individuals.



Data is Mostly Unstructured
Traditional systems have been
designed for transactional, not
unstructured data.

Google example:
[Google file system] redundant
storage of massive maounts of
data on cheap and unreliable
computers. storage that can be
stored across multiple
computers. It is internal within
Google itself to be able to crawl
the pages.
[MapReduce] distributed computing paradigm. aggregates all the results from the computers to
get an overview of all the results. This technology is also proprietary (owned exclusively by)
Google.

The term big data was first used by NASA
researchers Michael Cos & David Ellsworth in 1997:
“Data sets are generally quite large, taxing the
capacities of main memory, local disk, and even
remote disk, we call this the problem of big data”
Gained in popularity in the past decade:




3

, 2. Data Collection & Preparation
2.1 Data for a Data Mining Project
Example: MegaTelCo wants to predict Churn.
- MegaTelCo is a large telecommunication firm in the United States
o They are having a major problem with customer retention in their wireless business.
o Their goal is to identify customers that are going to churn (stop being clients) so that they can try to retain them.
- Which data would be useful to predict churn?

Overview of Technologies on a Typical Business Intelligence (BI) Stack




Data source Data movement, Data warehouse / Mid-tier servers Front-end application
Streaming engines servers
Once this data is in Transforming data Transformed data is Once the data is stored, Search function,
the boundaries of coming from previous stored in relational it needs to be further displaying information
the firm. It has to be sources into formats that database manipulated so it may on spreadsheets,
processed though → are more manageable management systems be more easily dashboards, reports,
later on. Extract transform (DBMS) and retrieved consumed. performing specific ad
Excel sheet from load tech, complex event via SQL, or Can then be used to hoc queries (to get
customers… CSV files processing engine MapReduce engine to support search query, more knowledge)
from external (aggregates high frequency tackle big data spread sheet…
systems.. hardware… events into more
automate some data
manageable ones)
analytics systems
Source: Chaughuri, Dayal, Narasayya. An Overview of Business Intelligence Technology. Communications of the ACM (2011)

Q: From which of the above systems should we collect data for a Data Mining project?
Almost any data source can be used in the context of a Data Mining project
 Data Mining is an exploratory process with uncertain outcomes
 Being able to collect data from different systems allows for fast prototyping and adjustments as needed
o Note: A proper engineering solution should be deployed once the prototype demonstrates its merits

Example: Purchase behaviour in a VoD system.
- Context: Telecom operator, providing TV, internet, fixed and mobile phone. 1.5M households;
o Video-on-demand (VoD) service: users pay between 1.99 and 5.99 per movie.
- Goals: Learn how consumers react to a change in prices of movies available for renting;
o Understand how consumers influence each other in this environment.
 Data sources used: o Movie catalogue (in advance) o Billing information (VoD purchases)
o CRM system (profile & demographics) o Call Detail Records (social network analysis)
o IMDB Votes & Ranking (movie quality)
4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper kathleenou. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 53340 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€4,99  21x  verkocht
  • (5)
In winkelwagen
Toegevoegd