100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary DMfB&G $6.96
Add to cart

Summary

Summary DMfB&G

 19 views  0 purchase
  • Course
  • Institution

All the needed information

Preview 4 out of 47  pages

  • September 9, 2022
  • 47
  • 2021/2022
  • Summary
avatar-seller
Lecture 1: Introduction to Data Mining
What is data mining?
“Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful
insights from data”.

It is an umbrella term and the methods used relates to different disciplines:
- Knowledge discovery in databases
- Statistics
- Artificial intelligence (important)
- Machine learning perspective (important)

Key aspects:
- Computation vs large data sets:
Trade-off between processing time and memory
- Computation enables analysis of large data sets:
Computers as a tool and with growing data
- Data mining often implies knowledge discovery from databases
From unstructured data to structured knowledge

What are large amounts or Big Data?
(is not only about the size of the data, which is volume but also about the complexity)
Volume:
- Too big for manual analysis
- Too big to fit in RAM
- Too big to store on disk

Variety:
- Range of values: variance
- Outliers, confounders and noise
- Different data types

Velocity:
- Data changes quickly: require results before data changes
- Streaming data (no storage)




1

,Application of data mining




Overview of basic data types
Data points represents by certain domain, is there any relationship or not?




How does it work? The general pipeline of data mining set




The steps above depend on the problem as well as the approach. Some approaches do not
require and explicit feature extraction.


2

,What makes prediction possible?
Fitting data is easy, but predictions are hard.
- Associations between features/target (how the points are related/associated?)
- Numerical: correlation coefficient
- Categorical: mutual information value of x1 contains information about value of x2 (it
is usually common that the sport cars will have a red color- mutual information)

Statistical descriptions of data
1. Measures of central tendency:
- Mean: average
- Median: the middle value in a set of ordered data value
- Mode: the mode for a set of data is the value that occurs most frequently in the set




2. Measuring the spread of data, five number summary:
- Range: difference between max() and min() value
- Quantiles: points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets. The quantile is the median, the 4 quantiles
are quartiles (3 data points Q1, Q2,Q3), and 100 quartiles are percentiles.
- Interquartile range: IQR = difference between Q3- Q1




Basic plots: box plot
Includes Q1, median, Q3, min and max values as well as outliers, points that are at least 1,5
IQR further away from Q1 and Q3.




3

, 3. Measuring the dispersion of data
- Variance σ2, measures how far each number in the set is from the mean and thus
from every other number in the set.
- Standard deviation σ, how dispersed the data is in relation to the mean. Low
standard deviation means data are clustered around the mean, and high standard
deviation indicates data are more spread out




Basic plots: scatter plot




Correlation coefficient
Pearson's r measures the strength of linear relationship (dependency), how things (two
points) are moving together(1 or -1=perfectly aligned , 0 no moving together)




Person’s correlation coefficient
- Numerator: covariance. To what extent the features change together.
- Denominator: product of standard deviations. Makes correlations independent of
units.




4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller adata. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.96. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

50155 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling
$6.96
  • (0)
Add to cart
Added