WEEK-13 ( 19/04/2021 - 23/04/2021 )

Week 13 started with continuing the Unsupervised Learning topic.

What Is the Problem with the K-Means Method?
The k-means algorithm is sensitive to outliers!
Since an object with an extremely large value may substantially distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the objects in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

Drawbacks of Partitional Clustering

Natural clusters may be split if
clusters are not well separated (large inter-cluster distances)
have very different cluster sizes

When cluster shapes are not convex

Hierarchical Clustering

Create a nested series of partitions
represented in the form of a dendogram
Shows how objects are grouped together step by step
Typical stopping criteria
Number of clusters
Minimum distance between clusters being greater than a user defined threshold

Types:
Agglomerative

Starts by assuming that each object is a separate cluster
Successively merges clusters until some stopping criterion is met

For example, until the number of clusters is 1
Divisive:

Starts by assuming the existence of only one cluster
Successively splits a cluster into two or more

Hierarchical Clustering Criteria
Single-link

Distance between two clusters is defined as the minimum distance between pairs of objects, one from each cluster
Can result in a chaining effect
C2 is closer to C1
Builds minimum spanning tree
Edge only exist between “closest” nodes

Complete-link:

Distance between to clusters is the maximum distance between the pairs of objects, one from each cluster
Merging clusters has the minimum increase in diameter of the cluster
C2 is closer to C3

Cure:

CURE (Clustering Using REpresentatives) is an efficient data clustering algorithm for large databases. Compared with K-means clustering it is more robust to outliers and able to identify clusters having non-spherical shapes and size variances.

Algorithm Steps –

Starts with each object in a separate cluster
At each successive step it merges the closest clusters
Similarity between clusters is computed using Single-link criterion on the sets of representative points
Selects c well scattered points for the merged cluster
The first point is the furthest point from the mean of the cluster
Each successive representative object is chosen to be the furthest from any of the existing representative objects
Shrink each representative point, p, towards the mean
p = p+𝛼 x(mean-p)
Where mean is the mean for the new merged cluster
Recompute similarity of merged cluster with each of the other clusters

Birch:
Birch: Balanced Iterative Reducing and Clustering using Hierarchies Objective

Construct the best quality clustering with the available resources (memory and time constraints)
Find a good clustering with a single scan of the data
Improve quality with additional scans
Explicit process for handling noise
Not all data points contribute equally to the clustering process
Points in sparse areas treated as noise
Points in dense areas treated as one (micro-) cluster
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering

Implementation Phases:

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
Not the reverse process of the divisive first phase. Corrects errors in Phase I due to “local”decisions

Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data record

Density-Based Clustering(DB Scan):

Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.

The DBSCAN algorithm uses two parameters:
minPts: The minimum number of points (a threshold) clustered together for a region to be considered dense.
eps (ε): A distance measure that will be used to locate the points in the neighborhood of any point.
Neighbourhood, within a radius ε , of a point, p, (ε-neighbourhood) is defined as
Nε(p) ={q ε D | dist(p,q) <= ε}
If |Nε(p)| ≥ MinPts, the object is called a core object
A point pis directly density-reachable from a point q w.r.t. ε, MinPts if
q is a core object
p ε Nε(q)
Symmetric relationship between pairs of core object
But p need not be a core object

Density-Reachable and Density-Connected

A point p is density-reachable from a point q w.r.t. ε, MinPts if
there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1is directly density-reachable from pi
A point p is density-connected to a point q w.r.t. ε, MinPts if
there is a point o such that both, p and q are density-reachable from o w.r.t. ε and MinPts
A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability
Every object not contained in a cluster is considered noise

Algorithmic steps for DBSCAN clustering:

The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points have been visited).
If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we consider all these points to be part of the same cluster.
The clusters are then expanded by recursively repeating the neighborhood calculation for each neighboring point

At the end of week 13, we were given group coursework of predicting the price of the Football Player.

Search This Blog

Sabudh Data Science Internship(2021)

WEEK-13 ( 19/04/2021 - 23/04/2021 )

Comments

Post a Comment

Popular posts from this blog

WEEK-17 ( 31/06/2021 - 4/06/2021 )

WEEK -9 ( 15/03/2021 - 19/03/2021 )

WEEK -12( 5/04/2021 - 11/03/2021 )