100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Cluster analysis £8.49
Add to cart

Lecture notes

Cluster analysis

 0 purchase

Cluster analysis is a supervised learning , it consists of concepts of cluster analysis thorougly

Preview 1 out of 4  pages

  • May 12, 2023
  • 4
  • 2022/2023
  • Lecture notes
  • Alexandar
  • All classes
All documents for this subject (1)
avatar-seller
hasithakutala
W4&W5_Cluster Analysis
31 January 2023 15:59



Clustering :
Process to partition the data set into subsets(clusters) on the basis of similarity or proximity for some defined distance measure.

Compact cluster :
It is a group of data points that are tightly packed together and have a distinct boundary from other clusters.
Algorithm that aim to create compact clusters :
k-means hierarchical clustering DBSCAN.

Center vs centroid :
Center of cluster is a point at the geometric center of all the data points in cluster.

Centroid of cluster is the average of all the data points in the cluster.(centroid is preferred as it less sensitive to outliers.

Chained cluster :
Type of hierarchical clustering algorithm where the clusters are built incrementally in a sequence or chain.
Algorithm starts with a single data point as a cluster and then sequentially adds more data points to the cluster based on some similarity measure untill a stopping criterion is met.
-Useful when data is not easily partionable into discrete clusters or where there are natural groupings that are not immediately obvious.

-Drawback - computationally expensive
-It adds data point sequentially therefore it is sensitive to the order in which the data points are addeed.

Types of Algorithm :
1.Hierarchical clustering :
1.Agglomerative
2.Divisive
2.Partitional Clustering

Hierarichal Clustering :
-It will build a hierarchy of clusters by iteratively grouping together the closest data points or clusters untill stopping criterion met.
-Types of hierarichal clustering :
1.Agglomerative clustering :
starts with each data point as its own cluster and then repeatedly merges the two closet clusters into a larger one untill all data points are in a single cluster.
2.Divisive clustering :
starts with all data points in a single cluster and then recursively splits the cluster into smaller clusters untill each data point is in its own cluster.

-Both methods produce a dendrogram which is a tree-like diagram that shows the hierachical relationship between clusters.Each leaf node represent a single data point and the internal nodes represent the
cluster of data points.
-The distance metric defines the similarity measure between pairs of data points and the linkage method determines how the distance between clusters is calculated.
Distance metrics : Euclidean distance,Manhattan distance and cosine distance
Linkage methods : single linkage , complete linkage and average linkage.

-Hierarchical clustering can be used for exploratory data analysis,visualization and feature engineering.

Agglomerative Clustering :
-The process of agglomerative clustering can be visualised as a binary tree called a dendogram,where each leaf
node represents a single data point and internal nodes represent merged clusters.At the top of the tree,there
is a single root node that represents the entire data set.

In agglomerative clustering, each data point starts as a single-cluster, and the algorithm computes a distance
or similarity matrix that contains the pairwise distances or similarities between all data points. The algorithm
then iteratively merges the two closest clusters until all data points belong to a single cluster.

-Diiferent ways to define the distance between clsuters :
single linkage : the distance between two clusters is defined the minimum distance between any two
points from the two clusters.

Complete linkage : the distance between two clusters is defined as the maximum distance between any
two points from the two clusters.

Average linkage : the distance between two clusters is defines as the average distance between all pairs
of points from the two clusters.

Divisive Clustering :
-It is hierarchical clustering algorithm that starts with all the data points in a single cluster and then recursively
divides the cluster into smaller sub-clusters based on some similarity measure.
i.e. It starts with a top-down approach where the entire dataset is considered as one cluster and then it is
divided into smaller clusters until a stopping criterion is met.

-Algorithm works by repeatedly selecting a cluster and dividing it into two smaller sub-clusters based on some
dissimilarity measure.The process continues until a stopping criterion is reached.




Similarity and Dissimilarity between objects :
1.Similarity coefficient : It is a measure the degree os similarity between two objects or data points in a dataset.
It is used to calculate the distance between data points or clusters which can then be used to group similar points together.
Example of similarity coefficients :
1.Euclidean distance : Calculates the straight-line distance between two data points in n-dimensional space.
2.Cosine similarity : measure the cosine of the angle between two vectors,where higher values indicate greater similarity.
3.Jaccard index : measure the similarity between sets of binary data,where a value of 1 indicates identical sets and a value of 0 indicates completely dissimilar sets.
4.Pearson correlation coefficient : measure the linear correlation between two variables where values range from -1(perfect negative correlation) to 1(perfect positive correlation(.
2.Dissimilarity coefficient : It is a measure that quantifies the degree of dissimilarity between two objects or data points in a dataset.
It is used to calculate the distance between data points or clusters which can then be used to group dissimilar points together.
Example of dissimilarity coefficients :
1.Manhattan distance : calculates the sum of the absolute differences between two data points in n-dimensional space.
2.Minkowski distance : a generalization of the manhattan distance that includes a parameter p,where p=1 corresponds to the manhattan distance and p=2 corresponds to the Euclidean
distance.
3.Hamming distance : measures the number of positions at which two strings of equal length are different commonly used for binary data.
4.Mahalanobis distance : takes into account the covariance of the data and is often used for high-dimensional data with correlated variables.

Proximity Matrix : (or distance matrix)
-It is a matrix that stores the pairwise distances or dissimilarities between a set of objects or data points in a dataset.
-Used as input to clustering algorithms that require pairwise distance information between data points.



Machine Learning Page 1

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller hasithakutala. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for £8.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

68175 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy revision notes and other study material for 15 years now

Start selling
£8.49
  • (0)
Add to cart
Added