Question

df1

|Invoice #  |Date        |Amount       
|12         |12/15/2015  |$10 
|13         |12/16/2015  |$11 
|14         |12/17/2015  |$12

df2

|Invoice #  |Date        |Amount
|12         |1/16/2016   |$10 
|14         |1/17/2016   |$12

Merged = df1.merge（df2，how = left，on = Invoice＃）

|Invoice #  |Date         |Amount
|12         |12/15/2015   |$10
|NaN        |NaN          |NaN
|14         |1/17/2016    |$12

我想做的是使用Invoice 13在合并中返回NaN值并将其放入列表中。有什么想法吗？

Answer 1

Your merged result is not showing what actually happens with a left merge?

Here's what I get when I try to reproduce what I think you're trying to do (I'm using pandas version 0.19.0):

merged = df1.merge(df2, how='left', on='Invoice #')

Then you can mask by the missing values and get a dataframe containing those rows:

merged[merged['Amount_y'].isnull()]

Or just create a column with the boolean flag:

merged['missing_from_df2'] = merged['Amount_y'].isnull()

To select things from the masked dataframe, treat it like any other dataframe, and index into one or more columns by listing them (note that if you want more than one, you have to do double brackets).

You can save it to a new variable to make the syntax simpler if you want to do other things with it.

Answer 2

method 1
pd.concat + drop_duplicates

pd.concat([df1, df2]).drop_duplicates(subset=['Invoice #'])

method 2
combine_first

df1.set_index('Invoice #').combine_first(df2.set_index('Invoice #')).reset_index()

method 3
merge

df1.merge(df2, on='Invoice #', suffixes=['', '_'], how='left')[df1.columns]

method 4
join

df1.join(df2.set_index('Invoice #'), on='Invoice #', rsuffix='_')[df1.columns]

all produce

timing
pd.concat + drop_duplicates is the fastest

替换由pandas合并产生的缺失值

2 个答案: