100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary Data Mining | Midterm week 1-3 €4,99
In winkelwagen

Samenvatting

Summary Data Mining | Midterm week 1-3

1 beoordeling
 108 keer bekeken  5 keer verkocht

This summary includes all material of week 1-3. It serves for the first midterm of this course. * Lectures notes: 1-3

Voorbeeld 3 van de 30  pagina's

  • 25 februari 2020
  • 30
  • 2019/2020
  • Samenvatting
Alle documenten voor dit vak (6)

1  beoordeling

review-writer-avatar

Door: michielkoch • 4 jaar geleden

avatar-seller
ioumi
Week 1
Slides
Data Mining for Business & Governance

Lecture 1: What is Data Mining?
Data mining is the computational process of discovering patterns in large data
sets involving methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems

It is about extracting novel, interesting and potentially useful knowledge.

(main) relations to:
• Knowledge discovery in databases
• Machine learning → branch of computer science studying learning from data
• Statistics → branch of mathematics focused on data
• Artificial intelligence → interdisciplinary field aiming to develop intelligent
machines

Key aspects
• Computation vs large data sets: there is a trade-off to be made between
processing time and memory
• Computation enables analysis of large data sets: computers as a tool and with growing data → design
efficient computation methods to work on data to extract and give meaning to knowledge.
• Data mining often implies knowledge discovery from data bases: from unstructured data to
structured knowledge.
- Unstructured data: text
- Semi structured data: html page due to the tags which give us some more information
- Structured data: tables

What are large amounts or big data? (definition is always changing)
→ Current opinion: we should have smaller datasets, so we can enrich them, give them a higher quality
Volume Variety Velocity
• Too big for manual • Range of values: variance • Data changes quickly:
analysis • Outliers, confounders and require results before data
• Too big to fit in RAM noise changes
• Too big to store on disk • Different data types • Streaming data (no
storage)

Application of data mining
Companies: business intelligence → market Science: knowledge discovery → scientific
analysis and management discovery in large data
• Target marketing, CRM • DNA: sequence data
• Risk analysis and management • SETI program, time series
• Forecasting, customer retention, quality • Electronic Health Records
control, competitive analysis • Social Network Analysis
• Fraud detection and management • Text Mining (natural language
• AH bonus card, Amazon, Mastercard, processing): going from unstructured
Booking.com text → structured knowledge


What makes prediction possible?
Make sure of some structure in the data!
• Associations between features/target
• Association features in numerical variables: correlation coefficient
• Categorical: mutual information value of X1, contains information about value of X2


Different types of learning
? A program is said to learn from experience (E) on task (T) and a performance measure (P), if its performance
at tasks in T as measured by P improves with E.
• Supervised learning – label
= You train the machine in using data which is well ‘labeled’ --> so you are
mapping from the input to the essential output

- Classification: because we have a label, we could try to get a model
to classify different classes of diseases.
- Regression: when we have numerical data, e.g. specifying the risk
of getting a disease




1

,• Unsupervised learning – no labels
= We don’t know anything about the data; you are not aiming to produce output in the response of the input.
Instead, you want to discover patterns in the data.

- Dimensionality reduction: large number of attributes, we could try to reduce to the most
relevant/interesting ones.
- Clustering: you will investigate similar groups of patients

Inductive learning for algorithms: learns from samples/ training data / trial and error

Supervised learning workflow for algorithms




1. Collect data
• How do you select your sample?
4. Train model(s)
• Reliability of measurement
• Keep some examples for final evaluation: test
• Privacy and other regulations
set
• Use the rest for:
2. Label examples
- Learning: training set
• Annotation guidelines
- Tuning: validation set
• Measure inter-annotator agreement
• Crowdsourcing
Parameter or model tuning
• Learning algorithms typically have setting (aka
3. Choose representation
hyperparameters)
• Features: attributes describing examples
• For each value of hyperparameters:
- Numerical or categorical (binary)
- Apply algorithm to training set to learn
• Possibly convert to feature vector
- Check performance on validation set
- A vector is a fixed-size list of numbers
- Find/choose best-performing setting
- Feature vector: describes the object that
you want to use.
5. Evaluate
- Some learning algorithms require
• Check performance of tuned model on test set
examples represented as vectors →
• Goal: estimate how well your model will do in
spectra representation
the real world
• Keep evaluation realistic
• Decision tree models, neural networks etc.
• You want to have your data balanced, it’s bad
if one group is overrepresented or
underrepresented → learn to create a
representative sample, e.g. down sample data




2

, Correlation Coefficient
Pearson’s r measures the strength of a linear relationship (dependency)




Pearson’s correlation coefficient
• Numerator: covariance → to what extent do the features change together?
• Denominator: product of standard deviations → makes correlations independent of unit




Covariance and correlation
Covariance = indicates the relationship of two
variables whenever one variable changes. If an
increase in one variable results in an increase in
the other variable, both variables are said to have
a positive covariance

→ corresponds to the strength of the linear
relationship.



Magnitude (direction) of the covariance is not
easy to interpret


Correlation coefficient is normalized and
corresponds to strength of the linear relation

Divide variance by the product of the variable’s
standard deviations




3

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper ioumi. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 48756 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen
€4,99  5x  verkocht
  • (1)
In winkelwagen
Toegevoegd