为什么在pyspark和sql之间得到不同的结果?

时间:2018-06-05 14:00:44

标签: apache-spark pyspark apache-spark-sql

我试图在两个不同的语法中在pyspark中的sql下面翻译但是两个代码都给出了不同的输出,这也与sql输出不匹配。我不知道这些代码的实际差异在哪里。

select count(*) from (
select afpo.charg as Batch_Number,
mara1.matkl as Material_Group,
mara1.zzmanu_stg as Mfg_Stage_Code,
mkpf.budat as WCB_261_Posting_Date,
mch1.hsdat as Manufacturing_Date
from 
opssup_dev_wrk_sap.src_sap_afpo afpo 
inner join opssup_dev_wrk_sap.src_sap_mara mara1 on afpo.matnr=mara1.matnr
inner join opssup_dev_wrk_sap.src_sap_mseg mseg on afpo.aufnr=mseg.aufnr
inner join opssup_dev_wrk_sap.src_sap_mkpf mkpf on mseg.mblnr=mkpf.mblnr
inner join opssup_dev_wrk_sap.src_sap_mara mara on mseg.matnr=mara.matnr
inner join opssup_dev_wrk_sap.src_sap_mch1 mch1 on afpo.charg=mch1.charg
where mara.zzmanu_stg='WCB'
and mseg.bwart='261')
  

---它返回2505行   上面的sql查询的执行计划:

*(15) Project [charg#72 AS Batch_Number#327407, matkl#126 AS Material_Group#327408, zzmanu_stg#275 AS Mfg_Stage_Code#327409, budat#511 AS WCB_261_Posting_Date#327410, hsdat#571 AS Manufacturing_Date#327411]
+- *(15) SortMergeJoin [charg#72], [charg#543], Inner
   :- *(12) Sort [charg#72 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(charg#72, 200)
   :     +- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
   :        +- *(11) BroadcastHashJoin [matnr#321], [matnr#327416], Inner, BuildRight, false
   :           :- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, matnr#321, budat#511]
   :           :  +- *(11) SortMergeJoin [mblnr#313], [mblnr#505], Inner
   :           :     :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
   :           :     :  +- Exchange hashpartitioning(mblnr#313, 200)
   :           :     :     +- *(6) Project [charg#72, matkl#126, zzmanu_stg#275, mblnr#313, matnr#321]
   :           :     :        +- *(6) ...

我已经在pyspark中转换了这个sql,如下所示:

afpo_df = sqlContext.table(sap_source_schema + ".src_sap_afpo").alias('afpo_df')
mara1_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara1_df')
mseg_df = sqlContext.table(sap_source_schema + ".src_sap_mseg").alias('mseg_df')
mkpf_df = sqlContext.table(sap_source_schema + ".src_sap_mkpf").alias('mkpf_df')
mara_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara_df')
mch1_df = sqlContext.table(sap_source_schema + ".src_sap_mch1").alias('mch1_df')

temp12_df = afpo_df \
    .join(mara1_df,(afpo_df.matnr==mara1_df.matnr)) \
    .join(mseg_df,(afpo_df.aufnr==mseg_df.aufnr)) \
    .join(mkpf_df,(mseg_df.mblnr==mkpf_df.mblnr)) \
    .join(mara_df,(mseg_df.matnr==mara_df.matnr)) \
    .join(mch1_df,(afpo_df.charg==mch1_df.charg)) \
    .filter("mseg_df.bwart=='261' AND mara_df.zzmanu_stg=='WCB'") \
    .select(afpo_df.charg.alias('Batch_Number'),mara1_df.matkl.alias('Material_Group'),mara1_df.zzmanu_stg.alias('Mfg_Stage_Code'), \
            mkpf_df.budat.alias('WCB_261_Posting_Date'),mch1_df.hsdat.alias('Manufacturing_Date'))

target_df = temp12_df
print(target_df.count())
  

返回大约13L行

上述代码的相应查询计划:

> == Physical Plan ==
*(15) Project [charg#72 AS Batch_Number#322732, matkl#126 AS Material_Group#322733, zzmanu_stg#275 AS Mfg_Stage_Code#322734, budat#511 AS WCB_261_Posting_Date#322735, hsdat#571 AS Manufacturing_Date#322736]
+- *(15) BroadcastNestedLoopJoin BuildRight, Inner
   :- *(15) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511, hsdat#571]
   :  +- *(15) SortMergeJoin [charg#72], [charg#543], Inner
   :     :- *(11) Sort [charg#72 ASC NULLS FIRST], false, 0
   :     :  +- Exchange hashpartitioning(charg#72, 200)
   :     :     +- *(10) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
   :     :        +- *(10) SortMergeJoin [mblnr#313], [mblnr#505], Inner
   :     :           :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
   :     :           :  +- Exchange hashpartitioning(mblnr#313, 200)
   :     :           :     +- *(6) Project [charg#72, matkl#126, zzmanu_stg#275, mblnr#313]
   :     :           :        +- *(6) SortMergeJoin [aufnr#14, matnr#116], [aufnr#368, matnr#321], Inner
   :     :           :           :- *(3) Sort [aufnr#14 ASC NULLS FIRST, matnr#116 ASC NULLS FIRST], false, 0
   :     :           :           :  +- Exchange hashpartitioning(aufnr#14, matnr#116, 200)
   :     :           :           :     +- *(2) Project [aufnr#14, charg#72, matnr#116, matkl#126, zzmanu_stg#275]
   :     :           :           :        +- *(2) BroadcastHashJoin [matnr#33], [matnr#116], Inner, BuildRight, false
   :     :           :           :           :- *(2) Project [aufnr#14, matnr#33, charg#72]
   :     :           :           :           :  +- *(2) Filter ((isnotnull(matnr#33) && isnotnull(aufnr#14)) && isnotnull(charg#72))
   :     :           :           :           :     +- *(2) FileScan parquet opssup_dev_wrk_sap.src_sap_afpo[aufnr#14,matnr#33,charg#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_afpo], PartitionFilters: [], PushedFilters: [IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(charg)], ReadSchema: struct<aufnr:string,matnr:string,charg:string>
   :     :           :           :           +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
   :     :           :           :              +- *(1) Project [matnr#116, matkl#126, zzmanu_stg#275]
   :     :           :           :                 +- *(1) Filter isnotnull(matnr#116)
   :     :           :           :                    +- *(1) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#116,matkl#126,zzmanu_stg#275] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(matnr)], ReadSchema: struct<matnr:string,matkl:string,zzmanu_stg:string>
   :     :           :           +- *(5) Sort [aufnr#368 ASC NULLS FIRST, matnr#321 ASC NULLS FIRST], false, 0
   :     :           :              +- Exchange hashpartitioning(aufnr#368, matnr#321, 200)
   :     :           :                 +- *(4) Project [mblnr#313, matnr#321, aufnr#368]
   :     :           :                    +- *(4) Filter ((((isnotnull(bwart#319) && (bwart#319 = 261)) && isnotnull(matnr#321)) && isnotnull(aufnr#368)) && isnotnull(mblnr#313))
   :     :           :                       +- *(4) FileScan parquet opssup_dev_wrk_sap.src_sap_mseg[mblnr#313,bwart#319,matnr#321,aufnr#368] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mseg], PartitionFilters: [], PushedFilters: [IsNotNull(bwart), EqualTo(bwart,261), IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,bwart:string,matnr:string,aufnr:string>
   :     :           +- *(9) Sort [mblnr#505 ASC NULLS FIRST], false, 0
   :     :              +- Exchange hashpartitioning(mblnr#505, 200)
   :     :                 +- *(8) Project [mblnr#505, budat#511]
   :     :                    +- *(8) Filter isnotnull(mblnr#505)
   :     :                       +- *(8) FileScan parquet opssup_dev_wrk_sap.src_sap_mkpf[mblnr#505,budat#511] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mkpf], PartitionFilters: [], PushedFilters: [IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,budat:string>
   :     +- *(13) Sort [charg#543 ASC NULLS FIRST], false, 0
   :        +- Exchange hashpartitioning(charg#543, 200)
   :           +- *(12) Project [charg#543, hsdat#571]
   :              +- *(12) Filter isnotnull(charg#543)
   :                 +- *(12) FileScan parquet opssup_dev_wrk_sap.src_sap_mch1[charg#543,hsdat#571] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mch1], PartitionFilters: [], PushedFilters: [IsNotNull(charg)], ReadSchema: struct<charg:string,hsdat:string>
   +- BroadcastExchange IdentityBroadcastMode
      +- *(14) Project
         +- *(14) Filter (isnotnull(zzmanu_stg#318210) && (zzmanu_stg#318210 = WCB))
            +- *(14) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[zzmanu_stg#318210] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(zzmanu_stg), EqualTo(zzmanu_stg,WCB)], ReadSchema: struct<zzmanu_stg:string>

我再次试过

afpo_df = sqlContext.table(sap_source_schema + ".src_sap_afpo").alias('afpo_df')
    mara1_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara1_df')
    mseg_df = sqlContext.table(sap_source_schema + ".src_sap_mseg").alias('mseg_df')
    mkpf_df = sqlContext.table(sap_source_schema + ".src_sap_mkpf").alias('mkpf_df')
    mara_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara_df')
    mch1_df = sqlContext.table(sap_source_schema + ".src_sap_mch1").alias('mch1_df')

    temp12_df = afpo_df \
        .join(mara1_df,"matnr") \
        .join(mseg_df,"aufnr") \
        .join(mkpf_df,"mblnr") \
        .join(mara_df,"matnr") \
        .join(mch1_df,"charg") \
        .filter("mseg_df.bwart=='261' AND mara_df.zzmanu_stg=='WCB'") \
        .select(afpo_df.charg.alias('Batch_Number'),mara1_df.matkl.alias('Material_Group'),mara1_df.zzmanu_stg.alias('Mfg_Stage_Code'), \
                mkpf_df.budat.alias('WCB_261_Posting_Date'),mch1_df.hsdat.alias('Manufacturing_Date'))

    target_df = temp12_df
    print(target_df.count())
  

它返回1804行

上述代码的执行计划::

== Physical Plan ==
*(15) Project [charg#72 AS Batch_Number#301751, matkl#126 AS Material_Group#301752, zzmanu_stg#275 AS Mfg_Stage_Code#301753, budat#511 AS WCB_261_Posting_Date#301754, hsdat#571 AS Manufacturing_Date#301755]
+- *(15) SortMergeJoin [charg#72], [charg#543], Inner
   :- *(12) Sort [charg#72 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(charg#72, 200)
   :     +- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
   :        +- *(11) BroadcastHashJoin [matnr#33], [matnr#300069], Inner, BuildRight, false
   :           :- *(11) Project [matnr#33, charg#72, matkl#126, zzmanu_stg#275, budat#511]
   :           :  +- *(11) SortMergeJoin [mblnr#313], [mblnr#505], Inner
   :           :     :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
   :           :     :  +- Exchange hashpartitioning(mblnr#313, 200)
   :           :     :     +- *(6) Project [matnr#33, charg#72, matkl#126, zzmanu_stg#275, mblnr#313]
   :           :     :        +- *(6) SortMergeJoin [aufnr#14], [aufnr#368], Inner
   :           :     :           :- *(3) Sort [aufnr#14 ASC NULLS FIRST], false, 0
   :           :     :           :  +- Exchange hashpartitioning(aufnr#14, 200)
   :           :     :           :     +- *(2) Project [matnr#33, aufnr#14, charg#72, matkl#126, zzmanu_stg#275]
   :           :     :           :        +- *(2) BroadcastHashJoin [matnr#33], [matnr#116], Inner, BuildRight, false
   :           :     :           :           :- *(2) Project [aufnr#14, matnr#33, charg#72]
   :           :     :           :           :  +- *(2) Filter ((isnotnull(matnr#33) && isnotnull(aufnr#14)) && isnotnull(charg#72))
   :           :     :           :           :     +- *(2) FileScan parquet opssup_dev_wrk_sap.src_sap_afpo[aufnr#14,matnr#33,charg#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_afpo], PartitionFilters: [], PushedFilters: [IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(charg)], ReadSchema: struct<aufnr:string,matnr:string,charg:string>
   :           :     :           :           +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
   :           :     :           :              +- *(1) Project [matnr#116, matkl#126, zzmanu_stg#275]
   :           :     :           :                 +- *(1) Filter isnotnull(matnr#116)
   :           :     :           :                    +- *(1) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#116,matkl#126,zzmanu_stg#275] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(matnr)], ReadSchema: struct<matnr:string,matkl:string,zzmanu_stg:string>
   :           :     :           +- *(5) Sort [aufnr#368 ASC NULLS FIRST], false, 0
   :           :     :              +- Exchange hashpartitioning(aufnr#368, 200)
   :           :     :                 +- *(4) Project [mblnr#313, aufnr#368]
   :           :     :                    +- *(4) Filter (((isnotnull(bwart#319) && (bwart#319 = 261)) && isnotnull(aufnr#368)) && isnotnull(mblnr#313))
   :           :     :                       +- *(4) FileScan parquet opssup_dev_wrk_sap.src_sap_mseg[mblnr#313,bwart#319,aufnr#368] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mseg], PartitionFilters: [], PushedFilters: [IsNotNull(bwart), EqualTo(bwart,261), IsNotNull(aufnr), IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,bwart:string,aufnr:string>
   :           :     +- *(9) Sort [mblnr#505 ASC NULLS FIRST], false, 0
   :           :        +- Exchange hashpartitioning(mblnr#505, 200)
   :           :           +- *(8) Project [mblnr#505, budat#511]
   :           :              +- *(8) Filter isnotnull(mblnr#505)
   :           :                 +- *(8) FileScan parquet opssup_dev_wrk_sap.src_sap_mkpf[mblnr#505,budat#511] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mkpf], PartitionFilters: [], PushedFilters: [IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,budat:string>
   :           +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
   :              +- *(10) Project [matnr#300069]
   :                 +- *(10) Filter ((isnotnull(zzmanu_stg#300228) && (zzmanu_stg#300228 = WCB)) && isnotnull(matnr#300069))
   :                    +- *(10) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#300069,zzmanu_stg#300228] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(zzmanu_stg), EqualTo(zzmanu_stg,WCB), IsNotNull(matnr)], ReadSchema: struct<matnr:string,zzmanu_stg:string>
   +- *(14) Sort [charg#543 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(charg#543, 200)
         +- *(13) Project [charg#543, hsdat#571]
            +- *(13) Filter isnotnull(charg#543)
               +- *(13) FileScan parquet opssup_dev_wrk_sap.src_sap_mch1[charg#543,hsdat#571] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mch1], PartitionFilters: [], PushedFilters: [IsNotNull(charg)], ReadSchema: struct<charg:string,hsdat:string>

为什么会发生这种情况,以及在pyspark中转换上述sql查询的最佳方法是什么。

0 个答案:

没有答案
相关问题