Summary - Data Mining (880022-M-6): Comprehensive Summary: Data Mining and Machine Learning Techniques
0 purchase
Course
Data Mining (880662M6)
Institution
Tilburg University (UVT)
Looking to ace your exams or finally understand data mining and machine learning? This summary is your ultimate guide to mastering the subject. It takes you step by step through clustering methods like k-means and fuzzy c-means, helping you see how data can be grouped in meaningful ways. You’ll l...
Lecture 1: Introduction and preliminaries
Pattern classification
- Figure 1.1 is our dataset; we have to organize
it through features and instances.
o Features: The variables describing the
problem, in figure 1.1 X1, X2 and X3
are the features describing the
problem. Y is a special feature; we call
the decision feature or the target.
o Instances (examples): Placed by row.
For each instance, we have a set of
features or variables describing the
instance. Together we have the target,
the value of the target.
- This is why we call it a supervised
Classification problem.
o It is supervised because we have
Figure 1.1: Tabel Pattern Classification
knowledge about the target Variables
related to a target.
o Need to build a model in which if we provide x1 = 0.5, X2 = 0.9, X3
= 0.5. The model should be able to predict c1 / to produce c1
(Y).
- Multi-class classification: Multi-class because we have a target and this
target has three possible variables: X1, X2 and X3 = multiple decisions
classes.
Missing values
- In figure 1.2 we have missing values; we denote this with “?”
- The reason can differentiate:
o Error when measuring the data.
o Information is not applicable to this particular case.
o There was something wrong there.
- Different strategies to cope missing data:
o Replacing missing data with other data:
Actually, removing the feature containing
the missing values. If we do this in figure 1.2,
only X1 will remain.
Remove the instances which are missing
Also not advised because you can end upFigure
with 1.2: Missing values dataset.
no features, limited numbers of features or you can miss
relevant information.
Replacing the missing values with the most popular value
per column per feature. If the variable / feature is
, numerical, it can just be the average. If the variable is
categorical, it can be the mode (number that appears the
most). THIS IS THE MOST POPULAR ONE OF THE REPLACING
DATA STRATEGIES
These strategies, however, can induce noise = We are
completing data with information that we don’t know about.
There are fancier strategies to fill missing data.
o Using neural networks (= type of machine learning model inspired
by the way the human brain works:
Autoencoders is a neural network in which we have two
blocks, the encoder and decoder.
In figure 1.3 we see the general architecture with first, the
input layer with neurones.
Neurones are the neural processing entities. This input layer
is capturing the information.
Figure 1.3: The general architecture
Feature scaling
- Why do we need feature scaling?
o Feature scaling is only applicable to numerical data. In a dataset
with a number of numerical features it is very unlikely that all those
features are going to be expressed in the same interval in the same
domain.
o For example, we have a feature X1 that can take values between 1
and 5. And X2 with values between 1 and 1000. Those are
expressed in different scales and will give wrong answers.
o This is the reasons why we first need to standardize or normalise a
scale in general, to ensure that all features are expressed in the
same scale.
- Two strategies for feature scaling:
, o Normalisation
o Standardisation
- Normalisation
(1.4)
o By
Figure 1.4: Normalisation formula
normalisation we apply feature by
feature. In this case what we do, is we
take every possible value in a column and this value is operated
with the minimum value we observed in the column. So basically,
we subtract the minimum value from each value x in the column.
Later on, we divide by this value we have in the denominator = the
maximum value observed in the column minus the minimum one
under this normalisation.
Figure 1.5: Standardisation formula -
-
- Standardisation (1.5)
o Kinda the same as normalisation but we are using the mean and the
standard deviation computer by the column.
Figure 1.6 Normalization versus standardization
- In the original data we see the first feature (the x axis), the domain is
between 0 and 3. In the case of the y axis the domain is between 0 and 2.
- In the case of normalization everything is disclosed in one unit. The
values which we will produce with the normalization strategy will always be
between 0 and 1.
, - In the case of standardization we have negative values, this is one of the
main differences between the two strategies.
- Buttt we are not changing the properties of the data, just the
scales to which the data is represented.
Feature interaction
- An important thing is discovering the interactions between features.
- A Pearson’s correlation can be used to describe the relationship between
two numerical features. They must be numerical though!
- An example: The correlation between gender (numerical feature) and
Figure 1.7: Three ways of correlation between two numerical variables
income in Sweden
- Closer the dots to the line, the closer the correlation is.
- The Pearson’s correlation takes values between -1 and 1.
Pearson's correlation
- Xi = Each of the values
associated to a feature / a
column.
- Xbar (The x with the line above
it) = The mean, so the average
value of x.
- Yi = The second feature we
have, so the different values
for the second column.
- Ybar = The mean, so the
average value of Y.
- The components in the
dominator (the formula
under the line) we know,
those are the same.
Numerators are the numbers above the line!!!!
Figure 1.8: Pearson's correlation
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller DataScienceandSociety. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.92. You're not tied to anything after your purchase.