我正在使用spark-1.6.0,我想加入2个数据框,它们显示在YARN日志中,如下所示。
一个是df_train_raw:
+------------+-----------+-----+
|subscriberid| objectid|label|
+------------+-----------+-----+
| 80755258|11030733889| 0|
| 81405858|11030733889| 0|
| 83486458|11030733889| 0|
| 81867258|11030733889| 0|
| 83077858|11030733889| 0|
| 80278458|11030733889| 0|
| 80044458|11030733889| 0|
| 81079858|11030733889| 0|
| 83418658|11030733889| 0|
| 83105658|11030733889| 0|
| 83105658| 2157122| 0|
| 83077858|11030780536| 0|
| 83105658|11030797977| 0|
| 83418658|11030714577| 0|
| 83077858|11030714577| 0|
| 79752658|11030714577| 0|
| 83105658|11028639583| 0|
| 79752658|11030549822| 0|
| 83105658|11028975426| 0|
| 83105658|11030686035| 0|
| 81079858|11030686035| 0|
| 79752658|11030504648| 0|
| 83486458|11030696858| 0|
| 81867258|11030696858| 0|
| 83105658|11030696858| 0|
| 81079858|11030696858| 0|
| 83418658|11030696858| 0|
| 80044458|11030696858| 0|
| 80278458|11030696858| 0|
| 81405858|11030696858| 0|
| 80755258|11030696858| 0|
| 83486458|11030434056| 0|
| 80278458|11030434056| 0|
| 81405858|11030434056| 0|
| 80044458|11030434056| 0|
| 80755258|11030434056| 0|
| 81867258|11030434056| 0|
| 108920274|11022029789| 1|
+------------+-----------+-----+
另一个是df_user_clicks_info:
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
| 83486458| 1.71| 3| 3| 3| 0| 3| 0| 2.4| 3| 3| 3| 0| 3| 0| 0.0| 0| 0| 0| 0| 0| 0|
| 81867258| 0.43| 0| 3| 0| 0| 0| 0| 0.0| 0| 0| 0| 0| 0| 0| 1.5| 0| 3| 0| 0| 0| 0|
| 80044458| 0.43| 3| 0| 0| 0| 0| 0| 0.6| 3| 0| 0| 0| 0| 0| 0.0| 0| 0| 0| 0| 0| 0|
| 79752658| 2.57| 0| 9| 0| 3| 6| 0| 3.0| 0| 9| 0| 3| 3| 0| 1.5| 0| 0| 0| 0| 3| 0|
| 83105658| 59.0| 63| 105| 49| 70| 126| 0| 61.6| 56| 98| 49| 42| 63| 0| 52.5| 7| 7| 0| 28| 63| 0|
| 80278458| 3.43| 0| 9| 3| 0| 12| 0| 3.6| 0| 9| 3| 0| 6| 0| 3.0| 0| 0| 0| 0| 6| 0|
| 81405858| 1.29| 3| 0| 3| 0| 3| 0| 1.2| 0| 0| 3| 0| 3| 0| 1.5| 3| 0| 0| 0| 0| 0|
| 108920274| 2.0| 1| 5| 4| 4| 0| 0| 2.2| 0| 5| 3| 3| 0| 0| 1.5| 1| 0| 1| 1| 0| 0|
| 80755258| 2.14| 3| 0| 0| 6| 6| 0| 0.0| 0| 0| 0| 0| 0| 0| 7.5| 3| 0| 0| 6| 6| 0|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
————————————
我试图用代码将它们内部连接:
val df_tmp_tmp_0 = df_train_raw.join(df_user_clicks_info, Seq("subscriberid"))
df_tmp_tmp_0.show()
我得到的结果完全没有!天哪!
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|objectid|label|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
我不知道为什么?这里好像没错吗?希望有帮助〜谢谢〜
答案 0 :(得分:0)
感谢帮助我的朋友的帮助。出于这个原因,我认为是SPARK-1.6.0中的错误,我通过更改数据过程而不更新SPARK来解决了该问题。我的意思是,在一开始,我想从df_1和df_2获取df_3,但是由于问题中提到的错误,它没有得到我想要的结果,因此我尝试了另一种方法来获取df_tmp_1和df_tmp_2,然后加入他们并获得结果。我也不知道为什么,但是如果您使用SPARK-1.6.0并遇到像我一样的连接错误,这似乎是一个好主意。