根据条件在熊猫数据框中创建列

时间:2018-10-23 01:37:37

标签: python pandas dataframe tuples

我有一个数据框,想根据条件创建第三列,例如col3 如果col1中存在col2值,则为“是”,否则为“否”

data = [[[('330420', 0.9322496056556702), ('76546', 0.9322003126144409)],76546],[[('330420', 0.9322496056556702), ('500826', 0.9322003126144409)],876546]]
test = pd.DataFrame(data, columns=['col1','col2'])

                                                col1    col2
0  [(330420, 0.9322496056556702), (76546, 0.93220...   76546
1  [(330420, 0.9322496056556702), (500826, 0.9322...  876546

所需结果:

data = [[[('330420', 0.9322496056556702), ('76546', 0.9322003126

    144409)],76546, 'Yes'],[[('330420', 0.9322496056556702), ('500826', 0.9322003126144409)],876546,'No']]
    test = pd.DataFrame(data, columns=['col1','col2', 'col3'])

                                                    col1    col2 col3
    0  [(330420, 0.9322496056556702), (76546, 0.93220...   76546  Yes
    1  [(330420, 0.9322496056556702), (500826, 0.9322...  876546   No

我的解决方案:

test['col3'] = [entry for tag in test['col2'] for entry in test['col1'] if tag in entry]

获取错误:ValueError: Length of values does not match length of index

4 个答案:

答案 0 :(得分:4)

anyzip一起使用

[any([int(z[0])==y for z in x]) for x, y in zip (test.col1,test.col2)]
Out[227]: [True, False]

答案 1 :(得分:1)

您应避免使用序列表。让我们尝试一个矢量化的解决方案:

# extract array of values and reshape
arr = np.array(df.pop('col1').values.tolist()).reshape(-1, 4)

# join to dataframe and replace list of tuples
df = df.join(pd.DataFrame(arr, dtype=float))

# apply test via isin
df['test'] = df.drop('col2', 1).isin(df['col2']).any(1)

print(df)

     col2         0        1         2       3   test
0   76546  330420.0  0.93225   76546.0  0.9322   True
1  876546  330420.0  0.93225  500826.0  0.9322  False

答案 2 :(得分:0)

使用numpy where

test['col3'] = test.apply(lambda x: np.where(str(x.col2) in [i[0] for i in x.col1],"yes", "no"), axis =1)
test['col3']
0    yes
1     no

答案 3 :(得分:0)

您可以使用.apply()

def sublist_checker(row):
    check_both = ['Yes' if str(row['col2']) in sublist else 'No' for sublist in row['col1']]
    check_any = 'Yes' if 'Yes' in check_both else 'No'
    return check_any

test['col3'] = test.apply(sublist_checker, axis=1)
print(test)

                                                   col1    col2 col3
0   [(330420, 0.932249605656), (76546, 0.932200312614)]   76546  Yes
1  [(330420, 0.932249605656), (500826, 0.932200312614)]  876546   No

函数sublist_checkertest['col2']中找到的每个子列表对test['col1']中的每个元素进行行检查,并返回YesNo基于任何子列表中该元素的存在或不存在。