Diving more into DBSCAN mathematics
DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is a clustering algorithm that is based on the concept of density of data points in a feature space. In this section, we will describe the mathematics behind DBSCAN, including the key concepts and formulas that are used in the algorithm.
Key Concepts
Epsilon (ε): This is the maximum distance between two points to be considered in the same cluster. Points within ε distance of each other are considered "neighbors".
MinPts: This is the minimum number of points required to form a dense region. A dense region is defined as a group of points where each point has at least MinPts neighbors.
Core Points: These are points that have at least MinPts neighbors within ε distance of them.
Border Points: These are points that are within ε distance of a core point but have less than MinPts neighbors within ε distance of them.
Noise Points: These are points that are not core points or border points.
Formulas
Euclidean Distance: The Euclidean distance between two points (x1, y1) and (x2, y2) in a two-dimensional space is calculated using the following formula:
d = √((x2 - x1)^2 + (y2 - y1)^2)
where d is the distance between the two points.
Density: The density of a point is calculated by counting the number of points within ε distance of that point.
Reachability Distance: The reachability distance between two points p and q is the maximum distance between p and any core point that q is directly reachable from. It is calculated using the following formula:
reachability_distance(p, q) = max(d(p, q), core_dist(q))
where d(p, q) is the Euclidean distance between points p and q, and core_dist(q) is the distance between point q and its closest core point.
Core Distance: The core distance of a point p is the distance to its kth nearest neighbor, where k is the MinPts parameter. It is calculated by sorting the distances of all points within ε distance of p and selecting the kth distance.
Cluster: A cluster is a set of points that are directly or indirectly reachable from a core point.
By using these concepts and formulas, the DBSCAN algorithm is able to group together points that are closely packed together while leaving out points that are further away or isolated.