Python語(yǔ)言是一種腳本語(yǔ)言,具有易學(xué)、高效、可擴(kuò)展等特點(diǎn)。其中的k均值算法是數(shù)據(jù)挖掘中常用的聚類(lèi)算法。下面我們來(lái)詳細(xì)介紹一下這個(gè)算法。
首先,我們需要導(dǎo)入必要的庫(kù):
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
接下來(lái),我們需要使用pandas讀取數(shù)據(jù)集:
data = pd.read_csv('file_path')
然后,我們需要將數(shù)據(jù)集轉(zhuǎn)換成數(shù)組,以便于使用numpy:
X = np.array(data)
接著,我們需要選擇合適的k值,初始化k個(gè)質(zhì)心:
k = 3
centroids = np.zeros((k,X.shape[1]))
for i in range(k):
index = np.random.randint(X.shape[0])
centroids[i] = X[index]
然后,我們開(kāi)始聚類(lèi),直到質(zhì)心不再改變:
iteration = 100
for i in range(iteration):
clusters = {}
for j in range(k):
clusters[j] = []
for x in X:
distances = [np.linalg.norm(x-centroids[c]) for c in range(k)]
classification = distances.index(min(distances))
clusters[classification].append(x)
prev_centroids = centroids.copy()
for c in clusters:
centroids[c] = np.mean(clusters[c],axis=0)
if np.array_equal(prev_centroids, centroids):
break
最后,我們可以將結(jié)果可視化:
kmeans = KMeans(n_clusters=k).fit(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_
import matplotlib.pyplot as plt
plt.scatter(X[:,0], X[:,1], c=labels, cmap='rainbow')
plt.scatter(centers[:,0],centers[:,1],c='black',s=200,alpha=0.5)
以上就是使用Python語(yǔ)言實(shí)現(xiàn)k均值聚類(lèi)算法的整個(gè)過(guò)程。這個(gè)算法簡(jiǎn)單易懂,適用于大多數(shù)的數(shù)據(jù)集。