比较数据帧并取其差异

时间:2021-01-05 07:09:21

标签: python pandas dataframe

我有一个复杂的问题,但我会尽量详细地解释它。我有以下 2 个数据帧,我需要做一些比较并将差异放在另一个数据帧中。比较标准如下所示。

initial = pd.DataFrame({'ProductID': ['123', '456', '789', '000','231'],
                     'ProductName': ['Apple','Pear','Mango','Banana','Jackfruit'],
                     'DiscountedPrice': ['0.99', '1.00', '1.50', '2.10','2.35'],
                      'DiscountStartDate': ['30/01/2020', '21/06/2020', '01/01/2020', '10/11/2020','05/05/2020'],
                      'DiscountEndDate': ['25/03/2020', '30/07/2020', '30/01/2020', '12/12/2020','06/06/2020']}) 

updated = pd.DataFrame({'ProductID': ['123', '456', '789', '000','231'],
                     'ProductName': ['Apple','Pear','Mango','Banana','Jackfruit'],
                     'DiscountedPrice': ['0.53', '1.00', '0.99', '2.00','2.35'],
                      'DiscountStartDate': ['30/01/2020', '21/06/2020', '15/01/2020', '30/11/2020','09/10/2020'],
                      'DiscountEndDate': ['25/03/2020', '30/07/2020', '30/01/2020', '12/12/2020','31/10/2020']}) 
 

比较标准是;

(1) 如果两个数据帧上的折扣价和开始/结束日期相同,则忽略。

(2) 如果折扣价格相同但开始/结束日期不同,我需要将两个条目都放入我的“更改”数据框中

(3) 如果两个数据框的折扣价不同但开始和结束日期相同,我需要将“更新”数据框中的 DiscountedPrice 和开始/结束日期放入我的“更改”数据框中

(4) 如果折扣价格不同并且它们的开始/结束日期以某种方式重叠,我需要将初始的结束日期调整为更新开始日期的 -1 并将两个条目都纳入我的“更改” '数据框

基本上,'changes' 数据帧输出必须如下表所示。

<头>
产品ID 产品名称 折扣价 折扣开始日期 折扣结束日期
123 苹果 0.53 30/01/2020 25/03/2020
789 芒果 1.50 01/01/2020 14/01/2020
789 芒果 0.99 15/01/2020 30/01/2020
000 香蕉 2.10 10/11/2020 29/11/2020
000 香蕉 2.00 30/11/2020 12/12/2020
231 菠萝蜜 2.35 05/05/2020 06/06/2020
231 菠萝蜜 2.35 09/10/2020 31/10/2020

有人可以帮我吗?

1 个答案:

答案 0 :(得分:1)

合并两个数据帧,以便可以应用逻辑来识别所有四种情况。确定案例后,可以修改日期并将结果串联在一起。为了透明起见,添加了更改数据框的情况。

initial = pd.DataFrame({'ProductID': ['123', '456', '789', '000','231'],
                     'ProductName': ['Apple','Pear','Mango','Banana','Jackfruit'],
                     'DiscountedPrice': ['0.99', '1.00', '1.50', '2.10','2.35'],
                      'DiscountStartDate': ['30/01/2020', '21/06/2020', '01/01/2020', '10/11/2020','05/05/2020'],
                      'DiscountEndDate': ['25/03/2020', '30/07/2020', '30/01/2020', '12/12/2020','06/06/2020']}) 

updated = pd.DataFrame({'ProductID': ['123', '456', '789', '000','231'],
                     'ProductName': ['Apple','Pear','Mango','Banana','Jackfruit'],
                     'DiscountedPrice': ['0.53', '1.00', '0.99', '2.00','2.35'],
                      'DiscountStartDate': ['30/01/2020', '21/06/2020', '15/01/2020', '30/11/2020','09/10/2020'],
                      'DiscountEndDate': ['25/03/2020', '30/07/2020', '30/01/2020', '12/12/2020','31/10/2020']}) 
 
initial["DiscountStartDate"] = pd.to_datetime(initial["DiscountStartDate"])
initial["DiscountEndDate"] = pd.to_datetime(initial["DiscountEndDate"])
updated["DiscountStartDate"] = pd.to_datetime(updated["DiscountStartDate"])
updated["DiscountEndDate"] = pd.to_datetime(updated["DiscountEndDate"])


# merge two dataframes so that values can be identified
dfcat = (initial
 .merge(updated, on=["ProductID"], suffixes=("_i","_u"))
# cascading logic to mark which each of the 4 cases
 .assign(cat=lambda dfa: np.where(dfa["DiscountStartDate_i"].eq(dfa["DiscountStartDate_u"])
                                  &dfa["DiscountEndDate_i"].eq(dfa["DiscountEndDate_u"])
                                  &dfa["DiscountedPrice_i"].eq(dfa["DiscountedPrice_u"])
                                  ,"case1",
                                  # no need to check dates different - done in case1
                                  np.where(dfa["DiscountedPrice_i"].eq(dfa["DiscountedPrice_u"])
                                           ,"case2",
                                np.where(dfa["DiscountEndDate_i"].eq(dfa["DiscountEndDate_u"])
                                  &dfa["DiscountStartDate_i"].eq(dfa["DiscountStartDate_u"])

                                  ,"case3", "case4")))
# case 4, modify EndDate
        ,DiscountEndDate_i=lambda dfa: np.where(dfa["cat"].eq("case4"),
                                                dfa["DiscountStartDate_u"] - pd.to_timedelta(1,unit="d"),
                                                dfa["DiscountEndDate_i"])

 
))

# utility to filter data and rename columns for each of the cases
def chngrows(df, case, ind):
    return (df
            .query(f"cat.isin(['{case}'])")
            .loc[:,["ProductID"]+[c for c in dfcat.columns if ind in c]]
            .rename(columns={c:c.replace(ind,"") for c in dfcat.columns if ind in c})
            .assign(cat=f"{case}{ind}")
           )


changes = pd.concat([
    chngrows(dfcat, "case2", "_i"),
    chngrows(dfcat, "case2", "_u"),
    chngrows(dfcat, "case3", "_u"),
    chngrows(dfcat, "case4", "_i"),
    chngrows(dfcat, "case4", "_u"),
]).sort_values(["ProductID","cat"])

输出

ProductID ProductName DiscountedPrice DiscountStartDate DiscountEndDate      cat
      000      Banana            2.10        2020-10-11      2020-11-29  case4_i
      000      Banana            2.00        2020-11-30      2020-12-12  case4_u
      123       Apple            0.53        2020-01-30      2020-03-25  case3_u
      231   Jackfruit            2.35        2020-05-05      2020-06-06  case2_i
      231   Jackfruit            2.35        2020-09-10      2020-10-31  case2_u
      789       Mango            1.50        2020-01-01      2020-01-14  case4_i
      789       Mango            0.99        2020-01-15      2020-01-30  case4_u
相关问题