对于kmeans散点图,PCA输出看起来很奇怪

时间:2015-07-01 01:05:33

标签: python matplotlib scipy scikit-learn pca

在对我的数据进行PCA并绘制kmeans簇后,我的情节看起来很奇怪。群集的中心和点的散点图对我来说没有意义。这是我的代码:

#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]

X = np.array([clicks,conversion, bounce]).T
y = np.array(search)

num_clusters = 5

pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)

print data2D
    >>> [[-0.07187948 -0.17784291]
     [-0.07173769 -0.26868727]
     [-0.07173789 -0.26867958]
     ..., 
     [-0.06942414 -0.25040886]
     [-0.06950897 -0.19591147]
     [-0.07172973 -0.2687937 ]]

km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)

labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)

colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],  marker='x', c='r')
plt.show()

红色十字架是群集的中心。任何帮助都会很棒。 enter image description here

2 个答案:

答案 0 :(得分:3)

你订购的PCA和KMeans搞砸了......

以下是您需要做的事情:

  1. 规范化您的数据。
  2. PCA上执行X将尺寸从5缩小为2并生成Data2D
  3. 再次标准化
  4. 群集Data2DKMeans
  5. Centroids。{/ li>之上绘制Data2D

    在哪里,这是你上面所做的:

    1. PCA上执行X将尺寸从5缩小为2以生成Data2D
    2. 将原始数据X分为5个维度。
    3. 在群集质心上执行单独的PCA,这会为质心生成完全不同的2D子空间。
    4. 将PCA缩小Data2D并将PCA缩小的质心放在顶部,即使这些质心不再正确耦合。
    5. 归一化:

      看看下面的代码,你会发现它将质心放在需要的位置。规范化是关键,完全可逆。群集时始终规范化数据,因为距离指标需要平均移动所有空间。聚类是规范化数据的最重要时刻之一,但总的来说......总是正常化: - )

      超出原始问题的启发式讨论:

      降维的整个要点是使KMeans聚类更容易,并预测出不会增加数据方差的维度。因此,您应该将简化数据传递给聚类算法。我将补充说,很少有5D数据集可以投影到2D而不会产生很多差异,即查看PCA诊断以查看是否已保留90%的原始方差。如果没有,那么你可能不想在你的PCA中如此咄咄逼人。

      新代码:

      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      from sklearn.decomposition import PCA
      from sklearn.cluster import KMeans
      import seaborn as sns
      %matplotlib inline
      
      # read your data, replace 'stackoverflow.csv' with your file path
      df = pd.read_csv('/Users/angus/Desktop/Downloads/stackoverflow.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()
      
      df.describe()
      
      #Normalize the data
      df_norm = (df - df.mean()) / (df.max() - df.min())
      
      num_clusters = 5
      
      pca=PCA(n_components=2)
      UnNormdata2D = pca.fit_transform(df_norm)
      
      # Check the resulting varience
      var = pca.explained_variance_ratio_
      print "Varience after PCA: ",var
      
      #Normalize again following PCA: data2D
      data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())
      
      print "Data2D: "
      print data2D
      
      km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
      km.fit_transform(data2D)
      
      labels=km.labels_
      centers2D = km.cluster_centers_
      
      colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
      col_map=dict(zip(set(labels),colors))
      label_color = [col_map[l] for l in labels]
      
      plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
      plt.hold(True)
      plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
      plt.show()
      

      简介:

      plot from code above

      输出:

      Varience after PCA:  [ 0.65725709  0.29875307]
      Data2D: 
      [[-0.00338421 -0.0009403 ]
      [-0.00512081 -0.00095038]
      [-0.00512081 -0.00095038]
      ..., 
      [-0.00477349 -0.00094836]
      [-0.00373153 -0.00094232]
      [-0.00512081 -0.00095038]]
      Initialization complete
      Iteration  0, inertia 51.225
      Iteration  1, inertia 38.597
      Iteration  2, inertia 36.837
      ...
      ...
      Converged at iteration 31
      

      希望这有帮助!

答案 1 :(得分:1)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('stackoverflow.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()

Out[3]: 
              freq  visit_length  conversion_cnt
count  289705.0000   289705.0000     289705.0000
mean        0.2624       20.7598          0.0748
std         0.4399       55.0571          0.2631
min         0.0000        1.0000          0.0000
25%         0.0000        6.0000          0.0000
50%         0.0000       10.0000          0.0000
75%         1.0000       21.0000          0.0000
max         1.0000     2500.0000          1.0000

# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)

feature_names = df.columns
X_raw = df.values

transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_

Out[4]: array([  9.9991e-01,   6.6411e-05])

# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)

labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]

fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')

enter image description here

KMeans尝试最小化群内欧几里德距离,这可能适合您的数据,也可能不适合您的数据。只是基于图表,我会考虑使用Gaussian Mixture Model进行无监督的聚类。

此外,如果您对哪些观察可能归类为哪个类别/标签有更好的了解,则可以进行半监督学习。