100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Data Mining | Midterm week 1-3 $5.41   Add to cart

Summary

Summary Data Mining | Midterm week 1-3

1 review
 108 views  5 purchases
  • Course
  • Institution

This summary includes all material of week 1-3. It serves for the first midterm of this course. * Lectures notes: 1-3

Preview 3 out of 30  pages

  • February 25, 2020
  • 30
  • 2019/2020
  • Summary

1  review

review-writer-avatar

By: michielkoch • 4 year ago

avatar-seller
Week 1
Slides
Data Mining for Business & Governance

Lecture 1: What is Data Mining?
Data mining is the computational process of discovering patterns in large data
sets involving methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems

It is about extracting novel, interesting and potentially useful knowledge.

(main) relations to:
• Knowledge discovery in databases
• Machine learning → branch of computer science studying learning from data
• Statistics → branch of mathematics focused on data
• Artificial intelligence → interdisciplinary field aiming to develop intelligent
machines

Key aspects
• Computation vs large data sets: there is a trade-off to be made between
processing time and memory
• Computation enables analysis of large data sets: computers as a tool and with growing data → design
efficient computation methods to work on data to extract and give meaning to knowledge.
• Data mining often implies knowledge discovery from data bases: from unstructured data to
structured knowledge.
- Unstructured data: text
- Semi structured data: html page due to the tags which give us some more information
- Structured data: tables

What are large amounts or big data? (definition is always changing)
→ Current opinion: we should have smaller datasets, so we can enrich them, give them a higher quality
Volume Variety Velocity
• Too big for manual • Range of values: variance • Data changes quickly:
analysis • Outliers, confounders and require results before data
• Too big to fit in RAM noise changes
• Too big to store on disk • Different data types • Streaming data (no
storage)

Application of data mining
Companies: business intelligence → market Science: knowledge discovery → scientific
analysis and management discovery in large data
• Target marketing, CRM • DNA: sequence data
• Risk analysis and management • SETI program, time series
• Forecasting, customer retention, quality • Electronic Health Records
control, competitive analysis • Social Network Analysis
• Fraud detection and management • Text Mining (natural language
• AH bonus card, Amazon, Mastercard, processing): going from unstructured
Booking.com text → structured knowledge


What makes prediction possible?
Make sure of some structure in the data!
• Associations between features/target
• Association features in numerical variables: correlation coefficient
• Categorical: mutual information value of X1, contains information about value of X2


Different types of learning
? A program is said to learn from experience (E) on task (T) and a performance measure (P), if its performance
at tasks in T as measured by P improves with E.
• Supervised learning – label
= You train the machine in using data which is well ‘labeled’ --> so you are
mapping from the input to the essential output

- Classification: because we have a label, we could try to get a model
to classify different classes of diseases.
- Regression: when we have numerical data, e.g. specifying the risk
of getting a disease




1

,• Unsupervised learning – no labels
= We don’t know anything about the data; you are not aiming to produce output in the response of the input.
Instead, you want to discover patterns in the data.

- Dimensionality reduction: large number of attributes, we could try to reduce to the most
relevant/interesting ones.
- Clustering: you will investigate similar groups of patients

Inductive learning for algorithms: learns from samples/ training data / trial and error

Supervised learning workflow for algorithms




1. Collect data
• How do you select your sample?
4. Train model(s)
• Reliability of measurement
• Keep some examples for final evaluation: test
• Privacy and other regulations
set
• Use the rest for:
2. Label examples
- Learning: training set
• Annotation guidelines
- Tuning: validation set
• Measure inter-annotator agreement
• Crowdsourcing
Parameter or model tuning
• Learning algorithms typically have setting (aka
3. Choose representation
hyperparameters)
• Features: attributes describing examples
• For each value of hyperparameters:
- Numerical or categorical (binary)
- Apply algorithm to training set to learn
• Possibly convert to feature vector
- Check performance on validation set
- A vector is a fixed-size list of numbers
- Find/choose best-performing setting
- Feature vector: describes the object that
you want to use.
5. Evaluate
- Some learning algorithms require
• Check performance of tuned model on test set
examples represented as vectors →
• Goal: estimate how well your model will do in
spectra representation
the real world
• Keep evaluation realistic
• Decision tree models, neural networks etc.
• You want to have your data balanced, it’s bad
if one group is overrepresented or
underrepresented → learn to create a
representative sample, e.g. down sample data




2

, Correlation Coefficient
Pearson’s r measures the strength of a linear relationship (dependency)




Pearson’s correlation coefficient
• Numerator: covariance → to what extent do the features change together?
• Denominator: product of standard deviations → makes correlations independent of unit




Covariance and correlation
Covariance = indicates the relationship of two
variables whenever one variable changes. If an
increase in one variable results in an increase in
the other variable, both variables are said to have
a positive covariance

→ corresponds to the strength of the linear
relationship.



Magnitude (direction) of the covariance is not
easy to interpret


Correlation coefficient is normalized and
corresponds to strength of the linear relation

Divide variance by the product of the variable’s
standard deviations




3

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller ioumi. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.41. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

60904 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$5.41  5x  sold
  • (1)
  Add to cart