Python / Pandas:识别跨列重复项

时间:2018-07-05 03:27:36

标签: python pandas duplicates

在下面的代码中,我要识别并报告在Col2中出现的Col1中的值,在Col1中出现的Col2中的值以及出现多次的总体值。

在下面的示例中,值AAPL和GOOG出现在Col1和Col2中。预计将在接下来的2列中识别并报告这些,然后在随后的列中期望识别并报告Col1或Col2值中的“任何”是DUP。

{{1}}

This is how the result will appear in Excel

4 个答案:

答案 0 :(得分:1)

这里是与您的代码一起使用的解决方案。它只使用一些与itterows()的循环。没什么。

df['Col3'] = False
df['Col4'] = False
df['Col5'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col3'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col4'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col3'] | df.loc[i,'Col4'] == True:
     df.loc[i,'Col5'] = True

Click here to view image of result

答案 1 :(得分:1)

使用numpy where检查一个列值是否在另一个列值中,然后对这些列进行布尔OR运算以检查它是否为重复项。

df['Col1inCol2']=np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2inCol1']=np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe']= df.Col1inCol2 | df.Col2inCol1



    Col1    Col2    Col1inCol2  Col2inCol1  Dupe
0   AAPL    GOOG    True            True    True
1   NaN     IBM     False           False   False
2   GOOG    MSFT    True            False   True
3   MMM     NaN     False           False   False
4   NaN     GOOG    False           True    True
5   INTC    AAPL    False           True    True
6   FB       VZ     False           False   False

答案 2 :(得分:0)

以下是最终脚本:

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 04-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("Initial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)


df['Col1_val_exists_in_Col2'] = False
df['Col2_val_exists_in_Col1'] = False
df['Dup_in_Frame'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col1_val_exists_in_Col2'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col2_val_exists_in_Col1'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col1_val_exists_in_Col2'] | df.loc[i,'Col2_val_exists_in_Col1'] == True:
     df.loc[i,'Dup_in_Frame'] = True

print ("Final DataFrame\n")
print (df)

答案 3 :(得分:0)

下面提供了另一种完成任务的方法-感谢“ skrubber”:

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 05-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={ 
       'Col1':
              ['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],
       'Col2':
              ['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']
     }
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("\n\nInitial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)

df['Col1_val_exists_in_Col2'] = np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2_val_exists_in_Col1'] = np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe'] = df.Col1_val_exists_in_Col2 | df.Col2_val_exists_in_Col1


print ("\n\nFinal DataFrame\n")
print (df)


Initial DataFrame

   Col1  Col2
0  AAPL  GOOG
1   NaN   IBM
2  GOOG  MSFT
3   MMM   NaN
4   NaN  GOOG
5  INTC  AAPL
6    FB    VZ


Final DataFrame

   Col1  Col2  Col1_val_exists_in_Col2  Col2_val_exists_in_Col1   Dupe
0  AAPL  GOOG                     True                     True   True
1   NaN   IBM                    False                    False  False
2  GOOG  MSFT                     True                    False   True
3   MMM   NaN                    False                    False  False
4   NaN  GOOG                    False                     True   True
5  INTC  AAPL                    False                     True   True
6    FB    VZ                    False                    False  False