Spark查询调整-当前查询花费的时间太长

时间:2018-07-05 06:29:35

标签: scala apache-spark rdbms

我有一个如下的火花查询

select  a.unique_id as unique_id,
      first(a.fdp_record_type) as fdp_record_type , 
      first(a.transaction_id) as transaction_id ,
      first(a.primary_account_number) as primary_account_number, 
  ....(58 more similar fields from table a)...,

    first(b.cc_outstanding_by_credit) as cc_outstanding_by_credit,
    first(b.cc_mailing_by_transaction) as cc_mailing_bytransaction, 
    first(b.cc_payment_by_transaction) as cc_payment_bytransaction, 
    first(b.cc_profile_change_by_transaction) as cc_profile_change_by_transaction,

    first(c.probability) as probability

FROM fdp_app_final.fdp_app_output_trans a,
        df1 b, 
        df0 c df1 
 ON a.unique_id=b.unique_id JOIN df0 ON a.unique_id=c.loan_id 
GROUP BY a.unique_id

如果有20000条记录,此查询将花费很长时间才能将三个表连接起来,我无法执行它,但是如果只有100条记录,则该查询效果很好。我怀疑我正在执行的Join语句是问题。调整此效果的最佳方法是什么。抱歉,如果这是一个菜鸟疑问,我是火花和数据库查询的新手。

 10-07-2018 14:05:42 IST 1GetAggregateData INFO - == Physical Plan ==
 10-07-2018 14:05:42 IST 1GetAggregateData INFO - SortAggregate(key=[unique_id#916], functions=[first(niafdp_record_type#917, false), first(transaction_id#918, false), first(primary_account_number#919, false), first(card_sequence_number#920, false), first(issuer_name#921, false), first(brand#922, false), first(ac_open_date#923, false), first(valid_from#924, false), first(valid_to#925, false), first(card_holder_name#926, false), first(date_of_birth#927, false), first(ch_email_id#928, false), first(ch_home_phone_country_code#929L, false), first(ch_home_phone_number#930L, false), first(ch_business_phone_country_code#931L, false), first(ch_business_phone_number#932L, false), first(card_mailing_date#933, false), first(billing_currency_code#934, false), first(billing_address_line1#935, false), first(billing_address_line2#936, false), first(billing_zipcode#937, false), first(billing_city#938, false), first(billing_state#939, false), first(billing_country#940, false), ... 59 more fields])
10-07-2018 14:05:42 IST 1GetAggregateData INFO - +- Sort [unique_id#916 ASC NULLS FIRST], false, 0
10-07-2018 14:05:42 IST 1GetAggregateData INFO -    +- Exchange hashpartitioning(unique_id#916, 200)
10-07-2018 14:05:42 IST 1GetAggregateData INFO -       +- SortAggregate(key=[unique_id#916], functions=[partial_first(niafdp_record_type#917, false), partial_first(transaction_id#918, false), partial_first(primary_account_number#919, false), partial_first(card_sequence_number#920, false), partial_first(issuer_name#921, false), partial_first(brand#922, false), partial_first(ac_open_date#923, false), partial_first(valid_from#924, false), partial_first(valid_to#925, false), partial_first(card_holder_name#926, false), partial_first(date_of_birth#927, false), partial_first(ch_email_id#928, false), partial_first(ch_home_phone_country_code#929L, false), partial_first(ch_home_phone_number#930L, false), partial_first(ch_business_phone_country_code#931L, false), partial_first(ch_business_phone_number#932L, false), partial_first(card_mailing_date#933, false), partial_first(billing_currency_code#934, false), partial_first(billing_address_line1#935, false), partial_first(billing_address_line2#936, false), partial_first(billing_zipcode#937, false), partial_first(billing_city#938, false), partial_first(billing_state#939, false), partial_first(billing_country#940, false), ... 59 more fields])
10-07-2018 14:05:42 IST 1GetAggregateData INFO -          +- *Sort [unique_id#916 ASC NULLS FIRST], false, 0
10-07-2018 14:05:42 IST 1GetAggregateData INFO -             +- BroadcastNestedLoopJoin BuildRight, Inner
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :- BroadcastNestedLoopJoin BuildRight, Inner
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :- *Project [unique_id#916, niafdp_record_type#917, transaction_id#918, primary_account_number#919, card_sequence_number#920, issuer_name#921, brand#922, ac_open_date#923, valid_from#924, valid_to#925, card_holder_name#926, date_of_birth#927, ch_email_id#928, ch_home_phone_country_code#929L, ch_home_phone_number#930L, ch_business_phone_country_code#931L, ch_business_phone_number#932L, card_mailing_date#933, billing_currency_code#934, billing_address_line1#935, billing_address_line2#936, billing_zipcode#937, billing_city#938, billing_state#939, ... 60 more fields]
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :  +- *BroadcastHashJoin [unique_id#916], [loan_id#0], Inner, BuildRight
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :- *Project [unique_id#916, niafdp_record_type#917, transaction_id#918, primary_account_number#919, card_sequence_number#920, issuer_name#921, brand#922, ac_open_date#923, valid_from#924, valid_to#925, card_holder_name#926, date_of_birth#927, ch_email_id#928, ch_home_phone_country_code#929L, ch_home_phone_number#930L, ch_business_phone_country_code#931L, ch_business_phone_number#932L, card_mailing_date#933, billing_currency_code#934, billing_address_line1#935, billing_address_line2#936, billing_zipcode#937, billing_city#938, billing_state#939, ... 59 more fields]
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :  +- *BroadcastHashJoin [unique_id#916], [unique_id#17], Inner, BuildRight
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :     :- *Filter isnotnull(unique_id#916)
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :     :  +- HiveTableScan [unique_id#916, niafdp_record_type#917, transaction_id#918, primary_account_number#919, card_sequence_number#920, issuer_name#921, brand#922, ac_open_date#923, valid_from#924, valid_to#925, card_holder_name#926, date_of_birth#927, ch_email_id#928, ch_home_phone_country_code#929L, ch_home_phone_number#930L, ch_business_phone_country_code#931L, ch_business_phone_number#932L, card_mailing_date#933, billing_currency_code#934, billing_address_line1#935, billing_address_line2#936, billing_zipcode#937, billing_city#938, billing_state#939, ... 55 more fields], MetastoreRelation fdp_app_final, fdp_app_output_trans
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :        +- *Project [unique_id#17, cc_outstanding_by_credit#43, cc_mailing_by_transaction#44, cc_payment_by_transaction#45, cc_profile_change_by_transaction#46]
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :           +- *Filter isnotnull(unique_id#17)
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     :              +- *FileScan csv [unique_id#17,cc_outstanding_by_credit#43,cc_mailing_by_transaction#44,cc_payment_by_transaction#45,cc_profile_change_by_transaction#46] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://10.73.122.194:9000/ccfMitigation_Perf/outputFiles/SkytreeInputPrediction..., PartitionFilters: [], PushedFilters: [IsNotNull(unique_id)], ReadSchema: struct<unique_id:string,cc_outstanding_by_credit:string,cc_mailing_by_transaction:string,cc_payme...
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :        +- *Project [loan_id#0, probability#2]
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :           +- *Filter ((isnotnull(createddate#3) && (createddate#3 >= 17720)) && isnotnull(loan_id#0))
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  :              +- HiveTableScan [loan_id#0, probability#2, createddate#3], MetastoreRelation fdp_app_final, lending_ott_lms_ml
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :  +- BroadcastExchange IdentityBroadcastMode
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                :     +- *FileScan csv [] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://10.73.122.194:9000/ccfMitigation_Perf/outputFiles/SkytreeInputPrediction..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                +- BroadcastExchange IdentityBroadcastMode
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                   +- *Project
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                      +- *Filter (isnotnull(createddate#1052) && (createddate#1052 >= 17720))
10-07-2018 14:05:42 IST 1GetAggregateData INFO -                         +- HiveTableScan [createddate#1052], MetastoreRelation fdp_app_final, lending_ott_lms_ml

1 个答案:

答案 0 :(得分:0)

火花查询的框架方式不正确。重新构图如下所示。

 val df = sparkSession.sql(s"""SELECT * FROM (select a.unique_id as unique_id,......,
       b.cc_profile_change, c.probability as probability,
       ROW_NUMBER() OVER (PARTITION BY a.unique_id ORDER BY transaction_datetime DESC)
        as row_num
       FROM $skytreeHiveSchema.$skytreeTableName a
       INNER JOIN df1 b ON (a.unique_id=b.unique_id) 
       INNER JOIN df0 c ON (a.unique_id=c.loan_id)
       ) as result WHERE row_num = 1 
         """)