Question

我有一个pandas DataFrame，在＆name;＆＃39; name_x＆＃39;和＆＃39; name_y＆＃39;列和关联的ID：

    id  name_x  name_y
0   104 molly   james
1   104 sarah   adam
2   236 molly   adam
3   388 adam    sarah
4   388 johnny  pete
5   236 adam    james
6   236 pete    johnny

我想删除＆＃39;重复＆＃39; id编号相同的行，两个名称一起出现在任一名称列中。例如

这样就删除了索引为1的行，因为这对名称＆＃39; molly＆＃39;和詹姆斯＆＃39;已经出现了id为104.类似地，索引为6的行被删除为一对名称＆quot; adam＆＃39;和莎拉＆＃39;已经出现了id 104，以便DataFrame看起来像这样：

    count   ids        name_x   name_y
0   1       104        molly    james
1   2       [104, 388] sarah    adam
2   1       236        molly    adam
3   2       [388, 236] johnny   pete
4   1       236        adam     james

（名字的排序无关紧要）

然后，我希望能够创建另一个DataFrame，它显示名称对的数量，具体取决于它们出现的次数与不同的ID和那些ID，例如：

<form enctype="multipart/form-data" action="/Filebrowser?Path=/S71500/" method="POST" onsubmit="return checkUploadFile()">
<td><input id="filebrowser_upload_filename" type="file" name="filename" size="30" maxlength="80" style="background-color: transparent;"></td>
<td><input type="submit" value="Datei laden"></td> 
</form>

我是编程/ python / pandas的新手，还没有找到答案！谢谢！

Answer 1

您可以使用：

首先使用names
groupby，转换为set s然后转换为list s
len

list

如有必要，请使用mask与indexing with str一起使用标量list s

df[['name_x','name_y']] = np.sort( df[['name_x','name_y']], axis=1)

df=df.groupby(['name_x','name_y'])['id'].apply(lambda x:list(set(x))).reset_index(name='ids')
df['count'] = df['ids'].str.len()
print (df)
   name_x name_y         ids  count
0    adam  james       [236]      1
1    adam  molly       [236]      1
2    adam  sarah  [104, 388]      2
3   james  molly       [104]      1
4  johnny   pete  [388, 236]      2

df['ids'] = df['ids'].mask(df['count'] == 1, df['ids'].str[0])
print (df)
   name_x name_y         ids  count
0    adam  james         236      1
1    adam  molly         236      1
2    adam  sarah  [104, 388]      2
3   james  molly         104      1
4  johnny   pete  [388, 236]      2

如果多个列包含相同的数据但是互换，则从pandas DataFrame中删除行

1 个答案: