Summary

Data Mining 2017/2018 - Summary

Name: Data Mining 2017/2018 - Summary
SKU: doc_386614
Rating: 1.67 (3 reviews)
Author: JHessels

3 reviews

12 purchases

Course
Data Mining

Institution
Tilburg University (UVT)

Extended summary (uitgebreide samenvatting) Data Mining Data Science Regression Classification Clustering Dimensionality Reduction

[Show more]

Preview 4 out of 43 pages

View example

Uploaded on January 10, 2018
Number of pages 43
Written in 2017/2018
Type Summary

data
mining
summary

Institution
Tilburg University (UVT)
Education
Data Science
Course
Data Mining

3 reviews

By: emilejaspar • 6 year ago

By: informationmanagementstudent • 6 year ago

Translated by Google

Unfortunately does not correspond with subject matter 18/19 and not much addition to sheets

By: JHessels • 6 year ago

Translated by Google

Sad to hear. I deliberately put 2017/2018 in the title to prevent this kind of disappointment.

By: informationmanagementstudent • 6 year ago

Translated by Google

I understand, but if the substance does not match, the summary of 17/18 is not really of value, of course

By: JHessels • 6 year ago

Translated by Google

You're quite right. Probably the content of the course has changed considerably compared to 2017/18. That course was not entirely faultlessly honest.

By: tiegee • 6 year ago

JHessels

Member since 7 year 49 documents sold

$4.82

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Data Mining W1
What is Data Mining?
“Data mining is the computational process of discovering patterns in large
data sets involving methods at the intersection of:

 Statistics (branch of mathematics focused on data);
 Machine Learning (branch of Computer Science studying learning from data);
 Artificial Intelligence (interdisciplinary field aiming to develop intelligent machines);
 Database systems.

Key aspects
 Computation vs Large data sets (trade-off between processing time and memory)
 Computation enables analysis of large data sets (computers as a tool and with growing data)
 Data Mining often implies data discovery from databases (from unstructured data to
structured knowledge)
 Text Mining (natural language processing): going from unstructured text to structured
knowledge

What is large amounts or big data?
 Volume (too big: for manual analysis, to fit in RAM, to store on disk)
 Variety (range of values: variance | Outliers, confounders and noise | Interactions, data is co-
dependent
 Velocity (data changes quickly: require results before data changes | Streaming data, no
storage)

Application of data mining
 Companies: Business Intelligence (Amazon, Booking, AH)
o Market analysis and management
 Science: Knowledge Discovery (University, Laboratories)
o Scientific discovery in large data

What makes prediction possible?
 Associations between features/target (Amazon)
 Numerical: correlation coefficient
 Categorical: mutual information Value of x1 contains information about value of x2

 Fitting data is easy, but predictions are hard!

,Iris dataset

Pearson’s r (correlation coefficient)
 Numerator: covariance (to what extent the features change together)
 Denominator: product of standard deviations (makes correlations independent of units)

Pearson’s coefficient of Petal Length by Petal Width:

Caveats
 Pearson’s r only measures linear dependency
 Other types of dependency can also be used for
prediction!
 Correlation does not imply causation, but it may still
enable prediction.

What is machine learning?
“A program is said to learn from experience (E) on task (T) and a performance (P) measure, if its
performance measured by P at tasks in T improves with E.”

,Supervised Learning
INPUT  OUTPUT

 Classification: output » class labels
 Regression: output » continuous values

Classification | Regression

Supervised learning Workflow
1. Collect data (How do you select your sample? Reliability, privacy and other regulations.)
2. Label example (Annotation guidelines, measure inter-annotator agreement, crowdsourcing.)
3. Choose example representation
 Features: attributes describing examples (
o Numerical
o Categorical
 Possibly convert to feature vectors
o A vector is a fixed-size list of numbers
o Some learning algorithms require examples represented as vectors
4. Train model(s)
 Keep some examples for final evaluation: test set
 Use the rest for
o Learning: training set
o Tuning: validation set
5. Evaluate
 Check performance of tuned model on test set
 Goal: estimate how well your model will do in the real world
 Keep evaluation realistic!

Parameter or model tuning
 Learning algorithms typically have settings (aka hyperparameters)
 For each value of hyperparameters:
o Apply algorithm to training set to learn
o Check performance on validation set
o Find/Choose best-performing setting

, Unsupervised learning
INPUT

 Clustering: group similar objects
 Dimensionality reduction: reduce random variables

Clustering | Dimensionality reduction

Clustering
Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are
more similar (in some sense or another) to each other than to those in other groups (clusters).

Dimensionality reduction
 Feature selection: reduce the large amount of data
o Reduce complexity and easier interpretation
o Reduce demand on resources (computation / memory)
o Reduce the ‘curse of dimensionality’
o Reduce chance of over-fitting
 Feature extraction: often domain specific
o Image Processing: edge detection
o From pixels to reduced set of features
o Often part of pre-processing, but might contain the hard problems

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller JHessels. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.82. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

65507 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller

Summary

Data Mining 2017/2018 - Summary

Document information

Subjects

Written for

3 reviews

Seller

Reviews received

Content preview

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Quick and easy check-out

Focus on what matters

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?