have put together the powerpoints in a clear way. Contains the learning material of the theory. Handy to go over this document with the document on ufora “learning material to know” and get the useful out of it.
Each data-driven business problem is unique. But there are sets of common tasks that underlie the
business problems.
E.g. Churn @ MegaTelco – Unique; Identifying which customers are more likely to terminate their
contracts – standard probability estimation problem.
Critical skill – Decompose problems into pieces such that each piece matches a known task.
1) Classification & class probability estimation
Attempt to predict, for each individual in a population which class the individual belongs to.
Purpose
• Classifictaion: to welke groep behoort deze?
• Class probability prediction: hoeveel kans is er dat deze behoort tot groep X of Y
2) Regression
Attempt to estimate or predict, for each individual, the numerical value of some variable.
Classification vs. regression
Classification: will something happen?
Regression: to what degree will something happen?
3) Similarity matching
Kijken naar twee objecten en in welke mate ze gelijkmatig zijn, identify similar objects
4) Clustering
Group by similarity, without a specific purpose
5) Co-occurrence grouping
Find associations between items, based on transactions involving them.
6) Profiling
Doel om beter inzicht te krijgen in het profiel van klanten
7) Link prediction
Proberen voorspellen van een link tussen twee personen (facebook; voorgestelde vrienden)
8) Data reduction
Attempt to replace a large set of data with a smaller set of data that contains as much
information
9) Causal modelling
Attempts to help us understand what events or actions influence others
,Expensive techniques:
• Investment in data
• Randomized controlled experiments
Counterfactual analysis
Two high-level primary goals
Prediction Description
Using some variables to predict unknown, Using some variables to find human-
or future values or other variables interpretable patterns describing the data
Often used to work toward a causal
understanding of the data
“Do our customers naturally fall into different groups ?”
> Has no target variable (unsupervised)
“Can we find groups of customers who have a particularly high likelihood of…
(defaulting/denying) ?”
> Has target variable (supervised)
supervised Unsupervised
• Has a target variable • Has no target variable
• Requires target data • No guarantee that results are
• More meaningful results meaningful or useful
• Bv. Kat of geen kat • Bv. Kat, hond, kip, …
,Classification vs. regression
Will something happen?
(Target is categorical variable, classification)
To what degree will something happen?
(Target is numerical variable, regression)
THE DATA MINING PROCESS
An important difference
1. Model bouwen: Doe je op historische data, je gaat een model maken (bv wat je op
weka doet) aan de hand van classifiers (die je dan vertaalt naar een model)
2. Model gebruiken: De vertaling van de classifier gebruiken in een bedrijf. het kan
bijvoorbeeld dat een software aan de hand van nieuwe data en het opgebouwde
model beslissingen zal nemen waar het management/de marketing mee kan werken.
, Knowledge discovery in databases
‘Two’ biggest players:
CRISP-DM
Cross Industry Standard Process for Data Mining
SEMMA
Sample, Explore, Modify, Model and Assess
CRISP-DM
Iteration is the rule
Process is an exploration of data
Business understanding:
• Craft: importance of analysts creativity
• Toolset of techniques
• Think about use-scenario
Data understanding:
• Material from which solution will be
constructed
• Strengths and limitations
• Availability and cost of data
• Think about; fraud detection
Data preparation:
• Techniques impose a certain requirments on data
• Think about : missing values, conversions, symbolic or categorical data, numerical values,
normalization of values
Think about: leaks (= Een variabele die in historische gegevens verzameld is, “informatie geeft” over
de target variable, maar niet daadwerkelijk beschikbaar is wanneer de beslissingen worden genomen!)
Modeling:
• Primary place to apply data mining techniques
Evaluation:
• Assess data mining results
• Test model
• Satisfy business goals?
• Sign off by stake-holders > comprehensibility
Deployment:
• May be a model
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller femkedw1. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.86. You're not tied to anything after your purchase.