匹配两个不同数据帧的两列的子集

时间:2019-05-13 08:41:12

标签: python-3.x

比较来自两个不同数据帧的特定列。计算两个数据帧的子集是否匹配。

条件: 如果文件small['genes of cluster']任何元素big['genes of cluster']匹配,则输出应为:match: 1

对于下面的示例,只有OR4F16与两个数据帧都匹配。 因此输出:match: 1; unmatch: 3.

    file1: big <tab separated>
    cl    nP    genes of cluster
     1    11    DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138C, FAM138F, FAM138A, OR4F5, LOC729737, LOC102725121, FAM138D
     2     4    OR4F16, OR4F3, OR4F29, LOC100132287
     3    64    LOC100133331, LOC100288069, FAM87B, LINC00115, LINC01128, FAM41C, LINC02593, SAMD11
     4     7    GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC105378591, PRKCZ


    file2: small <tab separated>
    cl    nP    genes of cluster
     1    11    A, B, C, D
     2     4    OR4F16, X, Y, Z

我的代码:Python3

def genes_coordinates(big, small):
    b = pd.read_csv(big, header=0, sep="\t")
    s = pd.read_csv(small, header=0, sep="\t")

    match = 0
    unmatch = 0

    for index, row in b.iterrows():
        if row[row['genes of cluster'].isin(s['genes of cluster'])]:
            match+1
        else:
            unmatch+1
    print("match: ", match, "\nunmatch: ", unmatch)

genes_coordinates('big','small')

1 个答案:

答案 0 :(得分:0)

我会选择pandas.merge(),然后按照列表理解计数。

import pandas as pd

df1 = pd.DataFrame({'cl':[1,2], 'nP':[11,4], 'gene of cluster':[['A', 'B', 'C', 'D'], ['OR4F16', 'X', 'Y', 'Z']]})
df2 = pd.DataFrame({'cl':[1,2,3,4], 'nP':[11,4,64,7], 'gene of cluster':[['DDX11L1', 'MIR6859-3', 'WASH7P', 'MIR1302-2', 'FAM138C', 'FAM138F', 'FAM138A', 'OR4F5', 'LOC729737', 'LOC102725121', 'FAM138D'], ['OR4F16', 'OR4F3', 'OR4F29', 'LOC100132287'], ['LOC100133331', 'LOC100288069', 'FAM87B', 'LINC00115', 'LINC01128', 'FAM41C', 'LINC02593', 'SAMD11'], ['GNB1', 'CALML6', 'TMEM52', 'CFAP74', 'GABRD', 'LOC105378591', 'PRKCZ']]})

df_m = df1.merge(df2, on=['cl', 'nP'], how='outer')
>>>df_m

   cl  nP  gene of cluster_x                                  gene of cluster_y
0   1  11       [A, B, C, D]  [DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138...
1   2   4  [OR4F16, X, Y, Z]              [OR4F16, OR4F3, OR4F29, LOC100132287]
2   3  64                NaN  [LOC100133331, LOC100288069, FAM87B, LINC00115...
3   4   7                NaN  [GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC10537...

# An np.nan value is an outright 'unmatch'
found = []
for x in df_m.index:
    if isinstance(df_m.iloc[x]['gene of cluster_x'], float):
        found.append(0)
    else:
        if isinstance(df_m.iloc[x]['gene of cluster_y'], float):
            found.append(0)
        elif any([y in df_m.iloc[x]['gene of cluster_y'] for y in df_m.iloc[x]['gene of cluster_x']]):
            found.append(1)
        else:
            found.append(0)
# The counts
match = sum(found)
unmatch = len(found) - match