下面是用C++实现DBSCAN聚类算法的完整攻略:
一、DBSCAN聚类算法简介
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) 是一种基于密度的聚类算法。该算法将数据点划分为三类:核心点、边界点和噪声点。主要优点有:
- 能够发现任意形状的聚类。
- 能够在一定程度上对噪声数据进行过滤。
- 可以在簇的数量未知的情况下进行聚类。
该算法的主要缺点是在数据点分布高度不规律时,聚类效果并不那么好。下面是该算法的步骤:
-
随机选择一个数据点 $P$。
-
以点 $P$ 为中心,以给定半径 $r$ 构建一个半径为 $r$ 的超球体,记为 $S(P,r)$。
-
如果在 $S(P,r)$ 中的数据点的数量大于等于给定的阈值 $MinPts$,则认为 $P$ 是一个核心点,否则将其归为噪声点。
-
将所有在 $S(P,r)$ 中的核心点归为同一簇。
-
重复执行步骤 1-4,直到所有数据点都被归为某一簇。
下面是完整代码实现:
二、C++实现DBSCAN聚类算法
1. 定义点类
class Point{
public:
std::vector<double> data;
int cluster;
bool visited;
Point(){
this->cluster = -1;
this->visited = false;
}
Point(std::vector<double> data);
};
Point::Point(std::vector<double> data) {
this->data = data;
this->cluster = -1;
this->visited = false;
}
2. 计算两点间的欧几里得距离
double getDistance(Point p1, Point p2){
double distance = 0;
for(int i=0;i<p1.data.size();i++){
distance += pow(p1.data[i] - p2.data[i], 2);
}
return sqrt(distance);
}
3. 寻找 $Eps$ 直径
double getEpsDiameter(std::vector<Point> dataset, int MinPts){
std::vector<double> neighbor_distances;
for(int i=0;i<dataset.size();i++){
for(int j=i+1;j<dataset.size();j++){
neighbor_distances.push_back(getDistance(dataset[i], dataset[j]));
}
}
std::sort(neighbor_distances.begin(), neighbor_distances.end());
return neighbor_distances[MinPts - 1];
}
4. 寻找相邻点集合
std::vector<Point> getNeighborPoints(Point p, std::vector<Point> dataset, double Eps){
std::vector<Point> neighbors;
for(int i=0;i<dataset.size();i++){
if(getDistance(p, dataset[i]) <= Eps){
neighbors.push_back(dataset[i]);
}
}
return neighbors;
}
5. DBSCAN核心算法
std::vector<Point> DBSCAN(std::vector<Point> dataset, double Eps, int MinPts){
std::vector<Point> clustering;
int cluster_label = 0;
for(int i=0;i<dataset.size();i++){
Point p = dataset[i];
if(p.visited) continue;
p.visited = true;
std::vector<Point> NeighborPts = getNeighborPoints(p, dataset, Eps);
if(NeighborPts.size() < MinPts){
p.cluster = -1;
} else{
cluster_label++;
clustering.push_back(p);
p.cluster = cluster_label;
int idx = 0;
while(idx < NeighborPts.size()){
Point p_neighbor = NeighborPts[idx];
if(!p_neighbor.visited){
p_neighbor.visited = true;
std::vector<Point> NeighborPts_neighbor = getNeighborPoints(p_neighbor, dataset, Eps);
if(NeighborPts_neighbor.size() >= MinPts){
for(int j=0;j<NeighborPts_neighbor.size();j++){
NeighborPts.push_back(NeighborPts_neighbor[j]);
}
}
}
if(p_neighbor.cluster == -1){
p_neighbor.cluster = cluster_label;
clustering.push_back(p_neighbor);
}
idx++;
}
}
}
return clustering;
}
6. 示例一
假设有以下数据,每个元素为一个由两个坐标组成的二维点:
std::vector<Point> dataset = {
Point({1, 2}), Point({2, 1}), Point({2, 2}),
Point({5, 1}), Point({5, 2}),
Point({9, 10}), Point({9, 11}), Point({10, 10}),
Point({13, 8}), Point({13, 9}), Point({13, 10}),
Point({14, 7}), Point({14, 8})
};
设置半径 $r = 4$,最小点数 $MinPts = 3$,得到以下结果:
std::vector<Point> clustering = DBSCAN(dataset, 4, 3);
聚类结果:
Cluster 1:
(1, 2)
(2, 1)
(2, 2)
(5, 1)
(5, 2)
Cluster 2:
(9, 10)
(9, 11)
(10, 10)
Cluster 3:
(13, 8)
(13, 9)
(13, 10)
(14, 7)
(14, 8)
7. 示例二
假设有以下数据:
std::vector<Point> dataset = {
Point({1, 2}), Point({2, 1}), Point({2, 2}),
Point({5, 1}), Point({5, 2}),
Point({9, 10}), Point({9, 11}), Point({10, 10}),
Point({13, 8}), Point({13, 9}), Point({13, 10}),
Point({14, 7}), Point({14, 8}),
Point({15, 11}), Point({15, 12}), Point({15, 13}),
Point({16, 12})
};
设置半径 $r = 3$,最小点数 $MinPts = 4$,得到以下结果:
std::vector<Point> clustering = DBSCAN(dataset, 3, 4);
聚类结果:
Cluster 1:
(9, 10)
(9, 11)
(10, 10)
Cluster 2:
(13, 8)
(13, 9)
(13, 10)
(14, 7)
(14, 8)
Noise:
(1, 2)
(2, 1)
(2, 2)
(5, 1)
(5, 2)
(15, 11)
(15, 12)
(15, 13)
(16, 12)
至此,我们已经基于 C++ 实现了 DBSCAN 算法的示例代码。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:用C++实现DBSCAN聚类算法 - Python技术站