ALL Lecture notes of P. Snoeren INCLUDING:
- Extra slides that he didn't include on canvas
- Notes of what he said during the lectures
- Exam info which he told us at the last lecture: which topics get how many questions
2 phenomena why data science is important
1. The possibility of data collection in every aspect of business
2. There is huge technological development
Big data = very large data set with 3 distinct characteristics
1. Volume = quantity of generated & stored data
2. Variety = type & nature of the data
3. Velocity = speed at which the data is generated & processed
You can own and recombine the data
Data science = involves principles, processes, and techniques for understanding phenomena
via the analysis of data
Business understanding→ data collection→ data storage→ data analysis→ implementation
➢ We focus on data analysis
Data mining = the extraction of knowledge from data, via technologies that incorporate
these principles
Data driven decision making (DDD) = refers to the practice of basing decisions on the
analysis of dtaa, rather than purely intuition
2 decisions of interest
1. Need discovery (find patterns in the data that help you understand the business)
a. E.g. Walmart after a hurricane looked at data and looked at changes in
demand after a hurricane. Saw that water was in more demand so had more
water in stock.
2. Repetitive decisions (happen on large scale)
a. E.g. when you have a contract with telecom provider at one point you want
to switch to another provider for a better offer. If the first provider can
predict when you will switch they can retain you with a better offer.
Marketing
- Online advertising (whenever you click on a link with an advert, and the page loads,
there is a bidding war going on how much people want to pay for your click)
- Recommendations for cross-selling (amazon does this when you want to buy your
photo camera, you can also buy an SD card) Things that are bought together
- Customer relationship management (Easyjet tries to give you info about how much
you travel to give you a warm feeling)
,Retail
- Marketing (AH bonus weeks are determined by customer behavior in the store)
- Supply chain management (predict which products are going to be bank ordered and
prevent this from happening)
Data analytics = the process of examining datasets in order to draw conclusions about the
useful info they may contain
3 types of data analytics
1. Descriptive analytics (BI): What has happened?
a. Simple descriptive statistics, dashboards, charts, diagrams
b. Simple correlational methods
2. Predictive analytics: What could happen?
a. Regression, classification
b. Advanced correlation methods
3. Prescriptive analytics: What should we do?
a. A-B testing, advanced econometric techniques
b. Causality
We focus on the first 2
Data science can help generate & sustain a CA if you align:
- Human capital
o Incentives
- Organization
o Center of excellence + local implementation (you need data scientist who can
do all the magic and local implementation with people who can speak to data
scientist and TM team)
- Culture
o Data science at core of strategy making
- Infrastructure
o No data, no DDD
Challenges in data science
From a large mass of data, you can always find something but it’s not always 100% clear if
this is generalizable to the big crowd
➢ Risk of over-fitting
Data mining process
Cross industry standards process for data mining/ analytics
➢ Also the core of the course make sure you structure your assignments according to
this model
Data analytic thinking
- Routinely transform business problems into data science problems
- Tacit skill that is only learned through trial & error
Supervised learning
Training data has one feature that is the target
,Supervised = classification, regression, similarity matching
Unsupervised = clustering, profiling, co-occurrence grouping
Both = similarity matching, link prediction, data reduction
Boundaries
- Knowledge discovery and data mining (KDD) is a subfield of machine learning
- Data science (prediction) is not econometrics (correlation & causality) is not a field of
statistics (interested if a observed distribution is likely to come from a random
distribution)
o Therefore, rely heavily on business understanding
o Always separate training, test and use data
o Also, this is why we are not interested in R2 or P-values (though we will use
other tools to evaluate models)
Case 1: Capital One
Right now very data driven company
Invest in high quality data
- Give customers random terms for their credit cards
- Allowed data on customers that normally weren’t given credit cards
- These turned out to be very profitable, i.e. those that pay off their det just enough
that they are not defaulting but Capital One still gets loads of interest
What can they do that other banks can’t?
- Customer acquisition
o Provide data driven services before they even spoke to them
- Product customization
o Differentiate interest rates for credit cards (make custom made products for
each individual customer)
- Customer retention
o Invested heavily in both IT and data analysts
What is required for Capital One to translate the business problem of fraud detection into a
data science task?
, Drawbacks of data driven strategy
- Cost and risk in data acquisition
o Providing customers with random terms for their credit cards is risky and in
short term likely to lead to losses
o Signet bank incurred losses for several years
- Capital One found out nobody recognized their brand
o Target variables generally short-term
o What is profitable in the short run does not necessarily help in the long run
- Might weed out certain customers
o Reciprocators vs. self-regulating stakeholders
o Customers who are likely to leave if someone else gives cheaper offer
Lecture 2
Datasets contain entities with certain attributes
Dataset = sample, population, data, set, work set
Entity = object, instance, observation, element, example, line, row, feature vector
Attribute = feature, characteristic, variable, column
- Predicted attribute = dependent, explained
- Predicting attribute = independent, explanatory
Model = a simplified representation of reality created to serve a purpose (abstraction of
irrelevant details)
Purpose
- Unsupervised setting: to identify (classes, group, patterns) → descriptive
- Supervised setting: to predict (try to estimate an unknown value) → predictive
o What is the value of this house?
Induction = generalizing from specific cases to general rules
e.g. developing classification and regression models
Deduction = applying general rules and specific facts to create other specific facts
e.g. using classification and regression models
Supervised & unsupervised not directly related to induction/ deduction, both can be both
Supervised segmentation
Objective: How can we segment the population into groups that differ from each with
respect to some quantity of interest?
Inputs: Informative attributes (have to be knowable beforehand, you can’t use the
value of an acquisition as input that still has to happen)
Knowable attributes that correlate with the target of interest
Outputs: Segments that are pure/ less impure in the quantity of interest
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller hannah2501. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $7.95. You're not tied to anything after your purchase.