Lecture 1: Introduction
Definitions
Data mining = “The automatic extraction of patterns from large amounts of data”
→ via tools/technologies that incorporate the principles of data science
Data science = “A set of fundamental principles that guide the extraction of knowledge from data”
→ that what’s underneath data mining
Importance of data science
You need more data savvy managers than you need data scientists. Everyone in the company should
know to some extent what to do with data. Firms that are data-driven will have 5-6% higher
productivity, higher market value, and higher return on equity.
“AI is the new electricity”: companies will go through the same transformation as electricity.
Roles in data science:
• Data architect (focuses on the storage of data; how should the data base look like?)
• Data analyst (won’t do data mining)
• Data & machine learning engineer (they put data science into production)
• Data scientist (works with the models and presents results)
Skills: Data scientist must know about all these 3 skills.
1.1 What is Data Mining?
Example: hurricane Frances
Target wanted to find non-obvious buying patterns of customers before a hurricane. Then, they could
stock up on it and have lots of sales.
Terminology
Big Data = data that is so large that traditional data processing systems are unable to deal with it (both
storage and analysis component)
Querying and reporting
• You know exactly what you are looking for: e.g. What is the profitability of my store in Brussels?
• SQL
OLAP (On-Line Analytical Processing) = advanced query and reporting
• Multidimensional analysis
• Nice visualization, data cubes, roll-up, slice and dice, ...
1
,Business Intelligence = Getting the right information to the right person at the right time.
Data warehousing: collect and coalesce data from across an enterprise, often from multiple
transaction-processing systems, each with its own database.
Machine Learning
• Improving the knowledge of a learning agent
• More than just data mining, also computer vision and robotics
Artificial Intelligence
• A computer interacts through data
• Learning from data leads to intelligence
• Big Data + Machine Learning = Artificial Intelligence
• Renewed interest from Deep Learning
• Most work in AI is on data mining
The separation between these 2 fields has blurred.
Data Mining
Example: credit scoring in banks
Bank: should I grant credit to this loan applicant? → We want to predict creditworthiness, based on
historical data.
All major banks use data mining for credit scoring. The model will predict whether you’re able to repay
the loan. The bank has information on everyone at their bank: income, profession, the amount of the
loan, the mortgage, the value of the house they want to buy etc. And they also know whether they
repaid their loans or not. That’s the dataset. You then give the dataset to the data mining algorithm to
learn what predicts whether the loan will be repaid. So, you need to know the value of what you want
to predict (= initial target variables) for some set of customers. In other words: you need to know which
people did and which people didn’t repay their loan. Only then the algorithm can learn.
We need:
• Input matrix X
• Target variable Y = the variable you want to predict
• Vector/column = feature/input variable
• Row = a data instance (e.g. a customer)
2
,The upper half of the figure illustrates the mining of historical data to produce a model. Importantly,
the historical data have the target value “class” specified. The bottom half shows the result of the data
mining, where the model is applied to new data for which we do not know the class value. The model
predicts both the class value and the probability that the class variable will take on that value.
Other examples:
• Market basket analysis
• Recommendation systems
• Facebook likes predict personality traits
o Study done at Cambridge: Can we use Facebook likes to predict personality
characteristics? For most of these characteristics, the prediction worked quite well.
o You build your dataset: each row is a user and each column is a potential Facebook
page that you like. If you liked the page it’s ‘1’, otherwise ‘0’.
o What to predict? For example: gender or IQ. Some people were willing to give their
information to Facebook. The company used this dataset to develop patterns.
o Nice thing about linear model is that you can ask things like: give me the top 10 pages
with highest predicted IQ scores.
• Clustering
• Predicting political preference with Twitter
1.2 Data Mining Process
CRISP-DM: Cross Industry Standard Process for Data Mining
3
, Business understanding: understanding the problem to be solved.
The initial formulation may not be complete, so multiple iterations may be necessary for an optimal
solution formulation.
Analyst’s creativity plays a great role here
What exactly do we want to do? How would we do it? What parts of this use scenario
constitute possible data mining models?
Data understanding: where is the data coming from, what is the data?
Understand the strengths and limitations of the data. Historical data often are collected for purposes
unrelated to the current business problem.
Costs of data can also vary → estimate costs and benefits of each data source.
Uncover the structure of the business problem and the data that are available, and then match them
to one or more data mining tasks for which we may have substantial science and technology to apply.
Data preparation:
You might have outliers, numbers in € and in $ etc. So, often data has to be manipulated and converted
into another form that yields better results.
Important → beware of leaks: a leak = a situation where a variable collected in historical data gives
information on the target variable, it’s information that appears in historical data but is not actually
available when the decision must be made.
Modeling: make the model and look for patterns in the dataset.
The output of this stage is a model/pattern capturing regularities in the data.
Evaluation: is this model good or not? It’s quite a difficult step!
Assess the data mining results and gain confidence that they are valid and reliable before moving on.
It is also used to help ensure that the model satisfies the original business goals.
If a model passes strict evaluation tests “in the lab”, there may be external considerations that
make it impractical.
Deployment: if the model is good, you start using it in practice.
The results of the data mining are put into real use. For example: implementing a predictive model in
an information system or business process.
Which step would be the most difficult one?
Data preparation! Takes up most time, it’s non-fun part. But it’s really important.
Modeling is an easier step! It’s often a matter of milliseconds.
Craft: you learn by doing: the more you do it, the easier it becomes.
Creativity: it’s typically an inherent skill. What kind of variables could be useful? What can I do with
this model? Could I use it in other settings as well?
Common sense: if the model comes out and the evaluation says it’s always correct in all predictions,
you should realize that’s not possible.
4
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
√ Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper emmabosteels. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €6,99. Je zit daarna nergens aan vast.