用R编写自己的KMeans算法

人气:709 发布:2022-10-16 标签: algorithm r machine-learning data-mining k-means

问题描述

我正在尝试用R编写我自己的第一个KMeans算法。我在这个领域是新手,所以请不要因为我看不到明显的东西而评判我。

在当前状态下,该算法取两个向量xy,计算每个数据点到簇中心的距离,并分配从其中心到数据点距离最小的簇。当分配没有变化,因此聚类中心没有变化时,算法停止。

# Sample data    
set.seed(100)
xval <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
yval <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)

# Kmeans function
kclus <- function(x, y, nclus) {

    # start with random cluster centers
    xcen <- runif(n = nclus, min = min(x), max = max(x))   
    ycen <- runif(n = nclus, min = min(y), max = max(y))

    # data points and cluster assignment in "data"
    # cluster coordinates in "clus"
    data <- data.frame(xval = x, yval = y, clus = NA)
    clus <- data.frame(name = 1:nclus, xcen = xcen, ycen = ycen)

    finish <- FALSE

    while(finish == FALSE) {

        # assign cluster with minimum distance to each data point
        for(i in 1:length(x)) {
            dist <- sqrt((x[i]-clus$xcen)^2 + (y[i]-clus$ycen)^2)
            data$clus[i] <- which.min(dist)
        }

        xcen_old <- clus$xcen
        ycen_old <- clus$ycen

        # calculate new cluster centers
        for(i in 1:nclus) {
            clus[i,2] <- mean(subset(data$xval, data$clus == i))
            clus[i,3] <- mean(subset(data$yval, data$clus == i))
        }

        # stop the loop if there is no change in cluster coordinates
        if(identical(xcen_old, clus$xcen) & identical(ycen_old, clus$ycen)) finish <- TRUE
    }
    data
}

# apply kmeans function to sample data
cluster <- kclus(xval, yval, 4)

# plot the result
ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + geom_point()

到目前为止,这一方法运行得相对较好。但我不知道如何将算法强制应用到特定数量的集群中。在我的kclus()函数中已经实现为参数nclus,但我不知道如何使用它。

对于给定的样本数据,该算法只给出了三个簇。我想强迫他还给我四个集群。

在座有人能给我提个建议吗?

非常感谢, 马库斯

推荐答案

您实现的算法并不总是给您提供3个集群,可能是您没有运行足够多的次数。以下是对您的代码的轻微修改,我们将能够看到集群输出的数量取决于集群质心的初始化(随机选择,并且可以使用随机种子进行控制):

# Sample data    
set.seed(100)
xval <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
yval <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)

# Kmeans function with random.seed for initialization
kclus <- function(x, y, nclus, random.seed=123) {

  set.seed(random.seed)
  # start with random cluster centers
  xcen <- runif(n = nclus, min = min(x), max = max(x))   
  ycen <- runif(n = nclus, min = min(y), max = max(y))

  # data points and cluster assignment in "data"
  # cluster coordinates in "clus"
  data <- data.frame(xval = x, yval = y, clus = NA)
  clus <- data.frame(name = 1:nclus, xcen = xcen, ycen = ycen)

  finish <- FALSE

  while(finish == FALSE) {

    # assign cluster with minimum distance to each data point
    for(i in 1:length(x)) {
      dist <- sqrt((x[i]-clus$xcen)^2 + (y[i]-clus$ycen)^2)
      data$clus[i] <- which.min(dist)
    }

    xcen_old <- clus$xcen
    ycen_old <- clus$ycen

    # calculate new cluster centers
    for(i in 1:nclus) {
      clus[i,2] <- mean(subset(data$xval, data$clus == i))
      clus[i,3] <- mean(subset(data$yval, data$clus == i))
    }

    # stop the loop if there is no change in cluster coordinates
    if(identical(xcen_old, clus$xcen) & identical(ycen_old, clus$ycen)) finish <- TRUE
  }
  data
}

# with default random seed 123, you should be able to reproduce the result
# as you can see, in this case, no data points were assigned to the 4th cluster
cluster <- kclus(xval, yval, 4)
cluster.centers <- aggregate(.~clus, cluster, mean)
ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
  geom_point(size=5) + 
  geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

# run with a different random seed = 12
# as you can see, in this case, the algorithm outputs 4 clusters, with the 2nd cluster having a single datapoint assigned to
    cluster <- kclus(xval, yval, 4, 12)
    cluster.centers <- aggregate(.~clus, cluster, mean)
    ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
      geom_point(size=5) + 
      geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

# run with a different random seed = 12345
# as you can see, in this case, the algorithm outputs 2 clusters, with the all the datapoints assigned to the 1st and the 2nd cluster
    cluster <- kclus(xval, yval, 4, 12345)
    cluster.centers <- aggregate(.~clus, cluster, mean)
    ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
      geom_point(size=5) + 
      geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

正如我们从上面的例子中可以看到的,一个集群在收敛时是否最终没有分配点取决于初始中心位置和数据分布。通常,如果kMeans最终有一个集群质心为空,这意味着如果您尝试将一个点强制分配给空集群,很可能会导致质量较差的集群,这是您不想做的事情。

此时您可以尝试几种方法。

首先,您可以多次运行您的算法,每次都使用不同的随机初始化中心,然后选择具有最高聚类质量的结果(由SSE等衡量)。 您可以尝试的第二件事是使用更智能的初始化 KMeans++。 一个不太好的选择可能是将算法修改为 确保在重新分配群集时,它保证每个 K(=4)个簇至少分配了一个点(如果没有,则 不重新分配)。 最后,您可以尝试一些其他算法,例如 通过以下方式为您提供更大灵活性的分层群集 树状图可根据需要选择任意数量的簇。

498