Question

我有两个数据csv 第一个：

word,centroid
she,1
great,0
good,3
mother,2
father,2
After,4
before,4
.....

第二个：

sentences,label
good mother,1
great father,1

我想根据聚类结果检查每个句子因此，如果句子在good mother上为good centroid，则数组将为[0,0,0,1,0]，单词mother在{{1 }} 2然后数组将为[0,0,1,1,0] ...

我的代码错了复杂...任何人都可以帮助我

这是我的代码：

centroid

Answer 1

您可以在DataFrame的apply()列上使用sentences：

import numpy as np

MAX_CENTROIDS = 5

def get_centroids(row):
    centroids = np.zeros(MAX_CENTROIDS, dtype=int)
    for word in row.split(' '):
        if word in df1['word'].values:
            centroids[df1[df1['word']==word]['centroid'].values]+=1
    return centroids

df2['centroid'] = df2['sentences'].apply(get_centroids)

结果df2：

df1是包含您的单词和质心的DataFrame，df2是包含您的句子的DataFrame。您必须在MAX_CENTROIDS（=形心列表的长度）中指定最大形心。

修改

要阅读您提供的数据样本：

# Maybe remove encoding on your system
df1 = pd.read_csv('hasil_cluster.csv', sep=',', encoding='iso-8859-1')

# Drop Values without a centroid:
df1.dropna(inplace=True)

# Remove ; from every centroid value and convert the column to integers
df1['centroid'] = df1['centroid;'].apply(lambda x:str(x).replace(';', '')).astype(int)

# Remove unused colum
df1.drop('centroid;', inplace=True, axis=1)

使用熊猫基于其他数据CSV检查数据

1 个答案: