我有一个数据集,在其中我将column1的每个值与column2的所有值进行比较。我能够为每一行创建一个二进制变量,注意是否确实在column2的某处找到了column1值。
我现在想创建一个列,该列是在列2值中找到column1值的所有索引位置的列表。使用Python 3.6
import pandas as pd
import numpy as np
data = [{'column1': 'ibm', 'column2': 'apple'},
{'column1': 'microsoft', 'column2': 'ibm'},
{'column1': 'apple', 'column2': 'ibm'},
{'column1': 'apple', 'column2': 'microsoft'},
{'column1': 'yahoo', 'column2': 'microsoft'}]
data_df = pd.DataFrame(data)
data_df['match'] = np.where((data_df.column1.isin(data_df['column2'])), 1, 0)
此结果对于该部分是正确的。
split1 split2 match
0 ibm apple 1
1 microsoft ibm 1
2 apple ibm 1
3 apple microsoft 1
4 yahoo microsoft 0
要为column2中找到的column1中的每个值创建索引位置列表,我已经尝试过:
data_df['indices'] = [i for i, x in enumerate(data_df['column2']) if x == np.where((data_df.column1.isin(data_df['column2'])))]
但是,出现以下错误:
data_df['indices'] = [i for i, x in enumerate(data_df['split2']) if x == np.where((data_df.split1.isin(data_df['split2'])))]
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/frame.py", line 3119, in __setitem__
self._set_item(key, value)
File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/frame.py", line 3194, in _set_item
value = self._sanitize_column(key, value)
File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/frame.py", line 3391, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/series.py", line 4001, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
我希望看到的是这个
split1 split2 match indices
0 ibm apple 1 1,2
1 microsoft ibm 1 3,4
2 apple ibm 1 0
3 apple microsoft 1 0
4 yahoo microsoft 0 Nan
答案 0 :(得分:1)
通过首先创建将公司映射到索引的字典,然后通过线性扫描“ column1”简单地查询字典,即可有效地构建“索引”列。
此后,您可以从“索引”派生“匹配”列。
from collections import defaultdict
d = defaultdict(list)
for i, company in enumerate(df['column2']):
d[company].append(str(i))
d
# defaultdict(list, {'apple': ['0'], 'ibm': ['1', '2'], 'microsoft': ['3', '4']})
# Now comes the fun part.
idx_mapping = {k: ','.join(v) for k, v in d.items()}
df['indices'] = [idx_mapping.get(x, np.nan) for x in df['column1']]
df['match'] = df['indices'].notna()
df
column1 column2 match indices
0 ibm apple True 1,2
1 microsoft ibm True 3,4
2 apple ibm True 0
3 apple microsoft True 0
4 yahoo microsoft False NaN
答案 1 :(得分:1)
factorize
+ stack
+ np.flatnonzero
:
f, l = pd.factorize(df.stack())
r = f.reshape(df.shape)
m = r[:, 0, None] == r[:, 1]
df.assign(
indices=[np.flatnonzero(c) for c in m],
match=m.sum(1).astype(bool)
)
column1 column2 indices match
0 ibm apple [1, 2] True
1 microsoft ibm [3, 4] True
2 apple ibm [0] True
3 apple microsoft [0] True
4 yahoo microsoft [] False