100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Samenvatting Data Mining $7.50   Add to cart

Summary

Samenvatting Data Mining

1 review
 290 views  32 purchases
  • Course
  • Institution

Comprehensive summary of slides and lesson notes. Got me a 15/20:)

Preview 4 out of 82  pages

  • October 3, 2022
  • 82
  • 2021/2022
  • Summary

1  review

review-writer-avatar

By: margotwillemse • 1 year ago

avatar-seller
Lecture 1: Introduction
Definitions
Data mining = “The automatic extraction of patterns from large amounts of data”
→ via tools/technologies that incorporate the principles of data science

Data science = “A set of fundamental principles that guide the extraction of knowledge from data”
→ that what’s underneath data mining

Importance of data science
You need more data savvy managers than you need data scientists. Everyone in the company should
know to some extent what to do with data. Firms that are data-driven will have 5-6% higher
productivity, higher market value, and higher return on equity.

“AI is the new electricity”: companies will go through the same transformation as electricity.

Roles in data science:
• Data architect (focuses on the storage of data; how should the data base look like?)
• Data analyst (won’t do data mining)
• Data & machine learning engineer (they put data science into production)
• Data scientist (works with the models and presents results)

Skills: Data scientist must know about all these 3 skills.




1.1 What is Data Mining?
Example: hurricane Frances
Target wanted to find non-obvious buying patterns of customers before a hurricane. Then, they could
stock up on it and have lots of sales.

Terminology
Big Data = data that is so large that traditional data processing systems are unable to deal with it (both
storage and analysis component)

Querying and reporting
• You know exactly what you are looking for: e.g. What is the profitability of my store in Brussels?
• SQL

OLAP (On-Line Analytical Processing) = advanced query and reporting
• Multidimensional analysis
• Nice visualization, data cubes, roll-up, slice and dice, ...


1

,Business Intelligence = Getting the right information to the right person at the right time.

Data warehousing: collect and coalesce data from across an enterprise, often from multiple
transaction-processing systems, each with its own database.




Machine Learning
• Improving the knowledge of a learning agent
• More than just data mining, also computer vision and robotics
Artificial Intelligence
• A computer interacts through data
• Learning from data leads to intelligence
• Big Data + Machine Learning = Artificial Intelligence
• Renewed interest from Deep Learning
• Most work in AI is on data mining
The separation between these 2 fields has blurred.

Data Mining
Example: credit scoring in banks
Bank: should I grant credit to this loan applicant? → We want to predict creditworthiness, based on
historical data.
All major banks use data mining for credit scoring. The model will predict whether you’re able to repay
the loan. The bank has information on everyone at their bank: income, profession, the amount of the
loan, the mortgage, the value of the house they want to buy etc. And they also know whether they
repaid their loans or not. That’s the dataset. You then give the dataset to the data mining algorithm to
learn what predicts whether the loan will be repaid. So, you need to know the value of what you want
to predict (= initial target variables) for some set of customers. In other words: you need to know which
people did and which people didn’t repay their loan. Only then the algorithm can learn.

We need:
• Input matrix X
• Target variable Y = the variable you want to predict
• Vector/column = feature/input variable
• Row = a data instance (e.g. a customer)




2

,The upper half of the figure illustrates the mining of historical data to produce a model. Importantly,
the historical data have the target value “class” specified. The bottom half shows the result of the data
mining, where the model is applied to new data for which we do not know the class value. The model
predicts both the class value and the probability that the class variable will take on that value.

Other examples:
• Market basket analysis
• Recommendation systems
• Facebook likes predict personality traits
o Study done at Cambridge: Can we use Facebook likes to predict personality
characteristics? For most of these characteristics, the prediction worked quite well.
o You build your dataset: each row is a user and each column is a potential Facebook
page that you like. If you liked the page it’s ‘1’, otherwise ‘0’.
o What to predict? For example: gender or IQ. Some people were willing to give their
information to Facebook. The company used this dataset to develop patterns.
o Nice thing about linear model is that you can ask things like: give me the top 10 pages
with highest predicted IQ scores.
• Clustering
• Predicting political preference with Twitter

1.2 Data Mining Process
CRISP-DM: Cross Industry Standard Process for Data Mining




3

, Business understanding: understanding the problem to be solved.
The initial formulation may not be complete, so multiple iterations may be necessary for an optimal
solution formulation.
 Analyst’s creativity plays a great role here
 What exactly do we want to do? How would we do it? What parts of this use scenario
constitute possible data mining models?

Data understanding: where is the data coming from, what is the data?
Understand the strengths and limitations of the data. Historical data often are collected for purposes
unrelated to the current business problem.
Costs of data can also vary → estimate costs and benefits of each data source.
Uncover the structure of the business problem and the data that are available, and then match them
to one or more data mining tasks for which we may have substantial science and technology to apply.

Data preparation:
You might have outliers, numbers in € and in $ etc. So, often data has to be manipulated and converted
into another form that yields better results.
Important → beware of leaks: a leak = a situation where a variable collected in historical data gives
information on the target variable, it’s information that appears in historical data but is not actually
available when the decision must be made.

Modeling: make the model and look for patterns in the dataset.
The output of this stage is a model/pattern capturing regularities in the data.

Evaluation: is this model good or not? It’s quite a difficult step!
Assess the data mining results and gain confidence that they are valid and reliable before moving on.
It is also used to help ensure that the model satisfies the original business goals.
 If a model passes strict evaluation tests “in the lab”, there may be external considerations that
make it impractical.

Deployment: if the model is good, you start using it in practice.
The results of the data mining are put into real use. For example: implementing a predictive model in
an information system or business process.

Which step would be the most difficult one?
Data preparation! Takes up most time, it’s non-fun part. But it’s really important.
Modeling is an easier step! It’s often a matter of milliseconds.




Craft: you learn by doing: the more you do it, the easier it becomes.
Creativity: it’s typically an inherent skill. What kind of variables could be useful? What can I do with
this model? Could I use it in other settings as well?
Common sense: if the model comes out and the evaluation says it’s always correct in all predictions,
you should realize that’s not possible.


4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller emmabosteels. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.50. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

67474 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$7.50  32x  sold
  • (1)
  Add to cart