Summary

Summary - Data Mining (880022-M-6): Comprehensive Summary: Data Mining and Machine Learning Techniques

0 purchase

Course
Data Mining (880662M6)

Institution
Tilburg University (UVT)

Looking to ace your exams or finally understand data mining and machine learning? This summary is your ultimate guide to mastering the subject. It takes you step by step through clustering methods like k-means and fuzzy c-means, helping you see how data can be grouped in meaningful ways. You’ll l...

[Show more]

Preview 4 out of 36 pages

View example

Uploaded on January 16, 2025
Number of pages 36
Written in 2024/2025
Type Summary

data mining techniques
model evaluation and
feature selection and extraction
clustering methods k means fuzzy c means hierar
dimensionality reduction and pca
classification problems and decision tree

Institution
Tilburg University (UVT)
Education
Data Science & Society
Course
Data Mining (880662M6)

$6.92

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Data Mining Summary

Lecture 1: Introduction and preliminaries
Pattern classification
- Figure 1.1 is our dataset; we have to organize
it through features and instances.
o Features: The variables describing the
problem, in figure 1.1 X1, X2 and X3
are the features describing the
problem. Y is a special feature; we call
the decision feature or the target.
o Instances (examples): Placed by row.
For each instance, we have a set of
features or variables describing the
instance. Together we have the target,
the value of the target.
- This is why we call it a supervised
Classification problem.
o It is supervised because we have
Figure 1.1: Tabel Pattern Classification
knowledge about the target  Variables
related to a target.
o Need to build a model in which if we provide x1 = 0.5, X2 = 0.9, X3
= 0.5. The model should be able to predict c1 / to produce c1
(Y).
- Multi-class classification: Multi-class because we have a target and this
target has three possible variables: X1, X2 and X3 = multiple decisions
classes.
Missing values
- In figure 1.2 we have missing values; we denote this with “?”
- The reason can differentiate:
o Error when measuring the data.
o Information is not applicable to this particular case.
o There was something wrong there.
- Different strategies to cope missing data:
o Replacing missing data with other data:
 Actually, removing the feature containing
the missing values. If we do this in figure 1.2,
only X1 will remain.
 Remove the instances which are missing 
Also not advised because you can end upFigure
with 1.2: Missing values dataset.
no features, limited numbers of features or you can miss
relevant information.
 Replacing the missing values with the most popular value
per column per feature. If the variable / feature is

, numerical, it can just be the average. If the variable is
categorical, it can be the mode (number that appears the
most). THIS IS THE MOST POPULAR ONE OF THE REPLACING
DATA STRATEGIES
 These strategies, however, can induce noise = We are
completing data with information that we don’t know about.
There are fancier strategies to fill missing data.
o Using neural networks (= type of machine learning model inspired
by the way the human brain works:
 Autoencoders is a neural network in which we have two
blocks, the encoder and decoder.
 In figure 1.3 we see the general architecture with first, the
input layer with neurones.
 Neurones are the neural processing entities. This input layer
is capturing the information.

Figure 1.3: The general architecture

Feature scaling
- Why do we need feature scaling?
o Feature scaling is only applicable to numerical data. In a dataset
with a number of numerical features it is very unlikely that all those
features are going to be expressed in the same interval in the same
domain.
o For example, we have a feature X1 that can take values between 1
and 5. And X2 with values between 1 and 1000. Those are
expressed in different scales and will give wrong answers.
o This is the reasons why we first need to standardize or normalise a
scale in general, to ensure that all features are expressed in the
same scale.

- Two strategies for feature scaling:

, o Normalisation
o Standardisation

- Normalisation
(1.4)
o By

Figure 1.4: Normalisation formula

normalisation we apply feature by
feature. In this case what we do, is we
take every possible value in a column and this value is operated
with the minimum value we observed in the column. So basically,
we subtract the minimum value from each value x in the column.
Later on, we divide by this value we have in the denominator = the
maximum value observed in the column minus the minimum one
under this normalisation.
Figure 1.5: Standardisation formula -
-
- Standardisation (1.5)
o Kinda the same as normalisation but we are using the mean and the
standard deviation computer by the column.

Figure 1.6 Normalization versus standardization

- In the original data we see the first feature (the x axis), the domain is
between 0 and 3. In the case of the y axis the domain is between 0 and 2.
- In the case of normalization everything is disclosed in one unit. The
values which we will produce with the normalization strategy will always be
between 0 and 1.

, - In the case of standardization we have negative values, this is one of the
main differences between the two strategies.
- Buttt we are not changing the properties of the data, just the
scales to which the data is represented.
Feature interaction
- An important thing is discovering the interactions between features.
- A Pearson’s correlation can be used to describe the relationship between
two numerical features. They must be numerical though!
- An example: The correlation between gender (numerical feature) and

Figure 1.7: Three ways of correlation between two numerical variables
income in Sweden
- Closer the dots to the line, the closer the correlation is.
- The Pearson’s correlation takes values between -1 and 1.
Pearson's correlation
- Xi = Each of the values
associated to a feature / a
column.
- Xbar (The x with the line above
it) = The mean, so the average
value of x.
- Yi = The second feature we
have, so the different values
for the second column.
- Ybar = The mean, so the
average value of Y.
- The components in the
dominator (the formula
under the line) we know,
those are the same.
Numerators are the numbers above the line!!!!

Figure 1.8: Pearson's correlation

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller DataScienceandSociety. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.92. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

65507 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller