WEEK-13 ( 19/04/2021 - 23/04/2021 )
Week 13 started with continuing the Unsupervised Learning topic.
What Is the Problem with the K-Means Method?
The k-means algorithm is sensitive to outliers!
Since an object with an extremely large value may substantially distort the distribution of the data.
K-Medoids: Instead of taking the mean value of the objects in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

Drawbacks of Partitional Clustering
- Natural clusters may be split if
- clusters are not well separated (large inter-cluster distances)
- have very different cluster sizes

When cluster shapes are not convex

Hierarchical Clustering
- Create a nested series of partitions
- represented in the form of a dendogram
- Shows how objects are grouped together step by step
- Typical stopping criteria
- Number of clusters
- Minimum distance between clusters being greater than a user defined threshold

Types:
Agglomerative
- Starts by assuming that each object is a separate cluster
- Successively merges clusters until some stopping criterion is met
Divisive:
- Starts by assuming the existence of only one cluster
- Successively splits a cluster into two or more

Hierarchical Clustering Criteria
Single-link
- Distance between two clusters is defined as the minimum distance between pairs of objects, one from each cluster
- Can result in a chaining effect
- C2 is closer to C1
- Builds minimum spanning tree
- Edge only exist between “closest” nodes
- Distance between to clusters is the maximum distance between the pairs of objects, one from each cluster
- Merging clusters has the minimum increase in diameter of the cluster
- C2 is closer to C3

Cure:
CURE (Clustering Using REpresentatives) is an efficient data clustering algorithm for large databases. Compared with K-means clustering it is more robust to outliers and able to identify clusters having non-spherical shapes and size variances.
Algorithm Steps –
- Starts with each object in a separate cluster
- At each successive step it merges the closest clusters
- Similarity between clusters is computed using Single-link criterion on the sets of representative points
- Selects c well scattered points for the merged cluster
- The first point is the furthest point from the mean of the cluster
- Each successive representative object is chosen to be the furthest from any of the existing representative objects
- Shrink each representative point, p, towards the mean
- p = p+𝛼 x(mean-p)
- Where mean is the mean for the new merged cluster
- Recompute similarity of merged cluster with each of the other clusters
Birch:
Birch: Balanced Iterative Reducing and Clustering using Hierarchies Objective
- Construct the best quality clustering with the available resources (memory and time constraints)
- Find a good clustering with a single scan of the data
- Improve quality with additional scans
- Explicit process for handling noise
- Not all data points contribute equally to the clustering process
- Points in sparse areas treated as noise
- Points in dense areas treated as one (micro-) cluster
- Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering
Implementation Phases:
- Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
- Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
- Not the reverse process of the divisive first phase. Corrects errors in Phase I due to “local”decisions
Weakness: handles only numeric data, and sensitive to the order of the data record
Density-Based Clustering(DB Scan):
Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
minPts: The minimum number of points (a threshold) clustered together for a region to be considered dense.
eps (ε): A distance measure that will be used to locate the points in the neighborhood of any point.
Neighbourhood, within a radius ε , of a point, p, (ε-neighbourhood) is defined as
Nε(p) ={q ε D | dist(p,q) <= ε}
If |Nε(p)| ≥ MinPts, the object is called a core object
A point pis directly density-reachable from a point q w.r.t. ε, MinPts if
q is a core object
p ε Nε(q)
Symmetric relationship between pairs of core object
But p need not be a core object
Density-Reachable and Density-Connected
- A point p is density-reachable from a point q w.r.t. ε, MinPts if
- there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1is directly density-reachable from pi
- A point p is density-connected to a point q w.r.t. ε, MinPts if
- there is a point o such that both, p and q are density-reachable from o w.r.t. ε and MinPts
- A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability
- Every object not contained in a cluster is considered noise

Algorithmic steps for DBSCAN clustering:
- The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points have been visited).
- If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we consider all these points to be part of the same cluster.
- The clusters are then expanded by recursively repeating the neighborhood calculation for each neighboring point

At the end of week 13, we were given group coursework of predicting the price of the Football Player.
Comments
Post a Comment