Lecture notes

Cluster analysis

0 purchase

Module
Data Analysis (CS5706)

Institution
Brunel University (BU)

Cluster analysis is a supervised learning , it consists of concepts of cluster analysis thorougly

[Show more]

Preview 1 out of 4 pages

View example

Uploaded on May 12, 2023
Number of pages 4
Written in 2022/2023
Type Lecture notes
Professor(s) Alexandar
Contains All classes

machine learning cluster analysis breif description about methods in supervised learning

W4&W5_Cluster Analysis
31 January 2023 15:59

Clustering :
Process to partition the data set into subsets(clusters) on the basis of similarity or proximity for some defined distance measure.

Compact cluster :
It is a group of data points that are tightly packed together and have a distinct boundary from other clusters.
Algorithm that aim to create compact clusters :
k-means hierarchical clustering DBSCAN.

Center vs centroid :
Center of cluster is a point at the geometric center of all the data points in cluster.

Centroid of cluster is the average of all the data points in the cluster.(centroid is preferred as it less sensitive to outliers.

Chained cluster :
Type of hierarchical clustering algorithm where the clusters are built incrementally in a sequence or chain.
Algorithm starts with a single data point as a cluster and then sequentially adds more data points to the cluster based on some similarity measure untill a stopping criterion is met.
-Useful when data is not easily partionable into discrete clusters or where there are natural groupings that are not immediately obvious.

-Drawback - computationally expensive
-It adds data point sequentially therefore it is sensitive to the order in which the data points are addeed.

Types of Algorithm :
1.Hierarchical clustering :
1.Agglomerative
2.Divisive
2.Partitional Clustering

Hierarichal Clustering :
-It will build a hierarchy of clusters by iteratively grouping together the closest data points or clusters untill stopping criterion met.
-Types of hierarichal clustering :
1.Agglomerative clustering :
starts with each data point as its own cluster and then repeatedly merges the two closet clusters into a larger one untill all data points are in a single cluster.
2.Divisive clustering :
starts with all data points in a single cluster and then recursively splits the cluster into smaller clusters untill each data point is in its own cluster.

-Both methods produce a dendrogram which is a tree-like diagram that shows the hierachical relationship between clusters.Each leaf node represent a single data point and the internal nodes represent the
cluster of data points.
-The distance metric defines the similarity measure between pairs of data points and the linkage method determines how the distance between clusters is calculated.
Distance metrics : Euclidean distance,Manhattan distance and cosine distance
Linkage methods : single linkage , complete linkage and average linkage.

-Hierarchical clustering can be used for exploratory data analysis,visualization and feature engineering.

Agglomerative Clustering :
-The process of agglomerative clustering can be visualised as a binary tree called a dendogram,where each leaf
node represents a single data point and internal nodes represent merged clusters.At the top of the tree,there
is a single root node that represents the entire data set.

In agglomerative clustering, each data point starts as a single-cluster, and the algorithm computes a distance
or similarity matrix that contains the pairwise distances or similarities between all data points. The algorithm
then iteratively merges the two closest clusters until all data points belong to a single cluster.

-Diiferent ways to define the distance between clsuters :
single linkage : the distance between two clusters is defined the minimum distance between any two
points from the two clusters.

Complete linkage : the distance between two clusters is defined as the maximum distance between any
two points from the two clusters.

Average linkage : the distance between two clusters is defines as the average distance between all pairs
of points from the two clusters.

Divisive Clustering :
-It is hierarchical clustering algorithm that starts with all the data points in a single cluster and then recursively
divides the cluster into smaller sub-clusters based on some similarity measure.
i.e. It starts with a top-down approach where the entire dataset is considered as one cluster and then it is
divided into smaller clusters until a stopping criterion is met.

-Algorithm works by repeatedly selecting a cluster and dividing it into two smaller sub-clusters based on some
dissimilarity measure.The process continues until a stopping criterion is reached.

Similarity and Dissimilarity between objects :
1.Similarity coefficient : It is a measure the degree os similarity between two objects or data points in a dataset.
It is used to calculate the distance between data points or clusters which can then be used to group similar points together.
Example of similarity coefficients :
1.Euclidean distance : Calculates the straight-line distance between two data points in n-dimensional space.
2.Cosine similarity : measure the cosine of the angle between two vectors,where higher values indicate greater similarity.
3.Jaccard index : measure the similarity between sets of binary data,where a value of 1 indicates identical sets and a value of 0 indicates completely dissimilar sets.
4.Pearson correlation coefficient : measure the linear correlation between two variables where values range from -1(perfect negative correlation) to 1(perfect positive correlation(.
2.Dissimilarity coefficient : It is a measure that quantifies the degree of dissimilarity between two objects or data points in a dataset.
It is used to calculate the distance between data points or clusters which can then be used to group dissimilar points together.
Example of dissimilarity coefficients :
1.Manhattan distance : calculates the sum of the absolute differences between two data points in n-dimensional space.
2.Minkowski distance : a generalization of the manhattan distance that includes a parameter p,where p=1 corresponds to the manhattan distance and p=2 corresponds to the Euclidean
distance.
3.Hamming distance : measures the number of positions at which two strings of equal length are different commonly used for binary data.
4.Mahalanobis distance : takes into account the covariance of the data and is often used for high-dimensional data with correlated variables.

Proximity Matrix : (or distance matrix)
-It is a matrix that stores the pairwise distances or dissimilarities between a set of objects or data points in a dataset.
-Used as input to clustering algorithms that require pairwise distance information between data points.

Machine Learning Page 1

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller hasithakutala. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for £8.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

65040 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy revision notes and other study material for 15 years now

Start selling

Seller

Exam (elaborations) ·

Package deal ·

Exam (elaborations) ·

Exam (elaborations) ·

Summary ·

Other ·

Other ·

Exam (elaborations) ·

Exam (elaborations) ·

Lecture notes

Cluster analysis

Document information

Subjects

Written for

Seller

Content preview

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Quick and easy check-out

Focus on what matters

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?

Recently viewed by you

Exam (elaborations) ·

INF2603 EXAM PACK 2023

Package deal ·

MTO A, B, C en E

Exam (elaborations) ·

FOR3706 EXAM PACK 2022

Exam (elaborations) ·

NBST 515 Exam 3 Q & A

Summary ·

Statistics notes

Other ·

ASSIGNMENT 02 76%

Other ·

SPAN 2006 Blog 2

Exam (elaborations) ·

SCI-228 Week 4 Midterm

Exam (elaborations) ·

Master exam 2 Bio 1122