Learning goals:
Data science fundamentals
Data science capability as strategic asset
The data mining process in business
Supervised versus unsupervised methods in data mining
Linking data science to the business world
Data-driven decision-making (DDD)
- Refers to the practice of basing decision on the analysis of data, rather than purely
on intuition.
Data science
- Involves principles, processes, and techniques for understanding phenomena via the
(automated) analysis of data
The sort of decisions of interest
- Need discovery (Non-obvious)
- Repetitive decisions
Big data
- Large data sets with 3 distinct characteristics (3 V’s)
1. Volume: The quantity of generated and stored data
2. Variety: The type and nature of the data
3. Velocity: The speed at which the data is generated
and processed
Data mining
- The extraction of knowledge from data, via
technologies that incorporate these principles
Data analytics
- The process of examining datasets in order to draw conclusions about the useful
information they may contain
Types of data analysis
1. Descriptive Analytics: What has happened?
a. Simple descriptive statistics, dashboard, charts and diagrams
2. Predictive Analytics: What could happen?
a. Segmentation, regression (Stata)
3. Prescriptive Analytics: What should we do?
a. Complex models for product planning and stock optimization (Weka)
Data and the capability to extract useful knowledge from data can be strategic asset
Strategic Analytics – Jeroen Bodaan
,From business problems to data mining tasks
- Decomposing a business problem into (solvable) subtasks
- Matching the subtasks with known tasks for which tools are available
- Solving the remaining non-matched subtasks (by creativity)
- Putting the subtasks together to solve the overall problem
Supervised learning: There is a specific target variable
Unsupervised learning: There is no specific target variable
Supervised learning
- Training data has one feature that has the “outcome”
- The goal is to build a model to predict the outcome (Machine learning to predict)
- The outcome data has a known value, model can be evaluated
o Split the data into a training and test set
o Model the training set/ predict the test
o Compare the predictions to the know values
- Algorithm:
o Model/ensemble
o Logistic regression
o Time series
Unsupervised learning
- Training data provides “examples” no specific “outcome”
- The machine tries to find specific pattern in the data
- Because the model has no “outcome” the outcome cannot be evaluated
- Algorithm:
o Clusters
o Anomaly detection
o Association discovery
o Topic modeling
Supervised learning I.E. questions Training data
How much is this home worth? Previous home sales
Will this customer default on a loan? Previous loan that were paid or defaulted
How many customers will apply for a loan Previous months of loan application
next month
Unsupervised learning I.E. questions Training data
Are these customers similar? Customer profile
Is this transaction unusual? Previous transactions
Are the products purchased together? Example of previous purchases
Strategic Analytics – Jeroen Bodaan
,The data mining process:
Business understanding
- This stage represents a part of the craft where the analysts’ creativity plays a large
role
o The design team should think carefully about the use scenario
This itself is one of the most important concepts of data science
- Business project seldom come pre-packaged as clear and unambiguous data mining
problems
Data understanding
- Important to understand strengths and limitations of the data
- Critical part is estimating cost and benefits of each data source and deciding whether
further investment is merited
Data preparation
- Data is manipulated and converted to forms that yield better results
- Quality of the data mining solution rests on how well the analysts structure the
problems and craft the variables
- Beware of ‘leaks’
- Leak: A situation where a variable collected in historical data gives information on
the target variable – information that appears in historical data but is not actually
available when the decision has to be made.
Modeling
- Primary place where data mining techniques are applied to the data
Evaluation
- To assess the data mining results rigorously and to gain confidence that they are
valid an reliable
- Serves to help ensure that the model satisfies the original business goal
- Includes both quantitative and qualitative assessment
- Comprehensibility of the model to stakeholders
- Usually a data mining solution is only a piece of the larger solution and it needs to be
evaluated as such
Deployment
- Two main reasons for deploying data mining system itself rather that the models
produced by the data mining system
o 1. World changes faster that data scientist can adapt
o 2. A business has to many modelling tasks for their data science team to
manually curate each model individually
- Deploying a model into the business systems requires to
model to be coded
Implications for managing data science team
- To view the data mining process as a software development
cycle
- Instead, analytics projects should prepare to invest in
information to reduce uncertainty in various ways
Strategic Analytics – Jeroen Bodaan
, Week 2 chapter 3 & 4
Learning goals:
Concepts
Models, Induction, Deduction
Supervised Segmentation
Classification Trees
Entropy & Information Gain
Parametric Models
Linear discriminant function
Logistic regression
Support vector machine
Terminology
Synonyms for ‘dataset’ Synonyms for ‘entity’
Sample Object
Population Instance
Data Observation
Set Element
Work set Line
Row
Feature vector
Synonyms for ‘attribute’:
- Feature, characteristic, variable, column
Model: a simplified representation of reality created to serve a purpose
- Abstraction of irrelevant details
Models serve different purposes in data science:
- Unsupervised setting: to identify (classes, groups, patterns, etc.), Descriptive
- Supervised setting: to predict (“to estimate an unknown value”), Predictive
Induction: “Generalizing from specific cases to general rules” (I.e. developing classification
and regression models)
Deduction: “Applying general rules and specific facts to create other specific facts” (i.e.
using classing classification and regression models)
Complications with supervised segmentation:
- Attributes rarely split a group perfectly
- Hard to tell if split produces the right subset
- Not all attributes are binary; many have three or more distinctive values
- Some attributes take on numeric values (continuous or integer)
Strategic Analytics – Jeroen Bodaan
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jeroenbodaan. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €5,49. Je zit daarna nergens aan vast.