10.31 Lec6 Clustering & Association rules
Significant point of this Lec (SP):
• Revisit: Supervised vs. unsupervised learning
• Knn
• Clustering
• Apriori & association rules
• Recommender system
Highlight:
1 Revisit: Supervised and unsupervised model
Supervised model (=predictive data mining), means you discover patterns in training
set to predict value of target variable of items in test set (i.e. discrete target variables:
classification; continuous target variables: regression), whereas unsupervised
model(=descriptive data mining) means you discover regularities in data without
notion of target variable.
Classification, regression, and causal modeling generally are solved with supervised
methods. Similarity matching, link prediction, and data reduction could be either.
Clustering, co-occurrence grouping, and profiling generally are unsupervised. The
fundamental principles of data mining that we will present underlie all these types of
technique.
2 knn
• GOAL = find k instances that are most similar to data point
• Attention: [the importance of standardization] Numeric attributes may have
vastly different ranges, and unless they are scaled appropriately the effect of
one attribute with a wide range can swamp the effect of another with a much
smaller range.
• Number of k and weight vote:
43
,2.1 similarity measures and an example of cosine distance:
44
,Anothter example:
If two data points, (2,2) (8,8)
d=1-(2*8+2*8)/!·"(2^2+2^2)*·"!8^2+8^2""
d=1-32/32
d=0
2.2 Issues/advantages and disadvantages with knn:
¿ It’s comprehensible: justification for model and data instances
¿ Computational efficiency: Training time=0. As a “lazy learner ”, it waits until a
prediction is asked.
¿ Curse of dimensionality: KNN always takes all features into account to calculate
the similarity. Therefore: [selection of features] having too many attributes, or
many that are irrelevant to the similarity judgment, which demands for a data
scientist’s domain knowledge.
¿ Nature of attributes: 1) scaling of attributes; 2) dummy encoding
The ads and disads of KNN:
45
, Advantages
1. Simplicity and Intuitiveness: kNN is incredibly straightforward and easy to
understand, making it a good starting point for algorithm learning and
application.
2. No Training Phase: kNN is a lazy learner, meaning it doesn't learn a
discriminative function from the training data but memorizes the training
dataset instead.
3. Versatility: It can be used for both classification and regression problems.
Disadvantages
1. Scalability: kNN can be computationally expensive, especially with large
datasets, as the distance needs to be calculated between each test sample and
all training samples.
2. Curse of Dimensionality: kNN suffers significantly as the dimensionality of the
data increases because it becomes difficult to compute distances in high-
dimensional space.
3. Optimal k Value: Selecting the optimal value of k is crucial for the
performance of the algorithm, and it can be computationally intensive to
find this value.
3 Clustering
• Goal : Dividing data into clusters such that there is maximal similarity between
items within the cluster and maximal dissimilarity between items of
different clusters.
46