具有距离阈值停止准则的编辑距离矩阵的单链接聚类

时间:2019-05-02 10:37:00

标签: python python-3.x scipy cluster-analysis bioinformatics

在给定平方距离矩阵的情况下,我试图将平坦的单链接簇分配给以编辑距离scipy.cluster.hierarchy.fclusterdata()和criterion='distance'可能是实现此目的的一种方法,但是它并没有完全返回我希望为这个玩具示例提供的聚类。

具体来说,在下面的4x4距离矩阵示例中,我希望clusters_50(使用t=50)创建2个群集,实际上它找到3个。我认为问题是{{1 }}并不期望有距离矩阵,但是fclusterdata()似乎也没有满足我的要求。

我也查看了fcluster(),但这需要指定sklearn.cluster.AgglomerativeClustering,并且我想根据需要创建尽可能多的簇,直到满足我指定的距离阈值为止。

我看到有一个针对此确切功能的当前未合并 scikit-learn请求:https://github.com/scikit-learn/scikit-learn/pull/9069

有人能指出我正确的方向吗?用绝对距离阈值条件进行聚类似乎是一个通用用例。

n_clusters
import pandas as pd
from scipy.cluster.hierarchy import fclusterdata

cols = ['a', 'b', 'c', 'd']

df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
                   {'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
                   {'a': 35, 'b': 29468, 'c': 0, 'd': 38},
                   {'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
                  index=cols)

clusters_20 = fclusterdata(df.values, t=20, criterion='distance')
clusters_50 = fclusterdata(df.values, t=50, criterion='distance')
clusters_100 = fclusterdata(df.values, t=100, criterion='distance')

names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}

2 个答案:

答案 0 :(得分:0)

通过将linkage()传递给fcluster()来解决这个问题,与metric='precomputed'相比,fclusterdata()支持fcluster(linkage(condensed_dm, metric='precomputed'), criterion='distance', t=20)

import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster

cols = ['a', 'b', 'c', 'd']

df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
                   {'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
                   {'a': 35, 'b': 29468, 'c': 0, 'd': 38},
                   {'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
                  index=cols)

dm_cnd = squareform(df.values)

clusters_20 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=20)
clusters_50 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=50)
clusters_100 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=100)

names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}

解决方案:

names_clusters_20
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}

names_clusters_50
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}

names_clusters_100
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage

def cluster_df(df, method='single', threshold=100):
    '''
    Accepts a square distance matrix as an indexed DataFrame and returns a dict of index keyed flat clusters 
    Performs single linkage clustering by default, see scipy.cluster.hierarchy.linkage docs for others
    '''

    dm_cnd = squareform(df.values)
    clusters = fcluster(linkage(dm_cnd,
                                method=method,
                                metric='precomputed'),
                        criterion='distance',
                        t=threshold)
    names_clusters = {s:c for s, c in zip(df.columns, clusters)}
return names_clusters

作为功能:

/* $trouve=result of find (all)*/


 $envoyes= 0;
 $non_envoyes = 0;

   foreach ($trouve as $k => $v)
 {               
      if($this>VueAppliMouv>validates(array('fieldList'=>array('email'))))
       {
          /*i do my emailing=> it's working */
         $send++;

         }

       else

           { 
               $erreurs= $this->VueAppliMouv->validationErrors; 
           no_send++;
        }
  }

答案 1 :(得分:0)

您没有设置指标参数。

则默认值为metric='euclidean',而不是预先计算的