PySpark:如何将2列并置?

时间:2019-04-18 21:39:46

标签: pyspark

我有两个DataFrames,每个都有一列(每列300行):

    private String createSignature(String hash) throws ProductException {
      try {
        Signature privateSignature = Signature.getInstance("SHA256withRSA");
        privateSignature.initSign(getPrivateKey());
        privateSignature.update(hash.getBytes(UTF_8));
        byte[] signature = privateSignature.sign();
        String result = Base64.encodeBase64String(signature);
        System.out.println(result); //THIS RESULT SHOULD MATCH BUT DOESN'T
        return result;
      } catch (NoSuchAlgorithmException | SignatureException | InvalidKeyException e) {
        throw new ProductException(Codes.AUTHENTICATION_ERROR, e);
      }
    }   
    private PrivateKey getPrivateKey() throws ProductException {
      try {
        String key = IOUtils.toString(this.getClass().getResourceAsStream("private.key"));
        PemObject pem = new PemReader(new StringReader(key)).readPemObject();
        byte[] content = pem.getContent();
        KeyFactory keyFactory = KeyFactory.getInstance("RSA");
        PKCS8EncodedKeySpec ks = new PKCS8EncodedKeySpec(content);
        return keyFactory.generatePrivate(ks);
      } catch (IOException | NoSuchAlgorithmException | InvalidKeySpecException e) {
        throw new ProductException(Codes.AUTHENTICATION_ERROR, e);
      }
    }

我想用两列做一个DataFrame。 我尝试过:

df_realite.take(1)
[Row(realite=1.0)]
df_proba_classe_1.take(1)
[Row(probabilite=0.6196931600570679)]

但是

    _ = spark.createDataFrame([df_realite.rdd, df_proba_classe_1.rdd]       , 
                               schema=StructType([ StructField('realite'     , FloatType() ) , 
                                                   StructField('probabilite' , FloatType() ) ]))

给我空值:

 _.take(10)

2 个答案:

答案 0 :(得分:0)

也许有一种更简洁的方法(或者没有联接的方法),但是您总是可以给他们两个id并像这样联接它们:

travel_to

答案 1 :(得分:0)

我认为这是您要寻找的,并且仅当您的数据非常小(如您的情况(300行))时才建议使用此方法,因为collect()并不是处理大量数据的好方法,否则与虚拟cols的加入路由并进行广播加入,因此不会发生混洗

from pyspark.sql.functions import *
from pyspark.sql.types import *

df1 = spark.range(10).select(col("id").cast("float"))
df2 = spark.range(10).select(col("id").cast("float"))

l1 = df1.rdd.flatMap(lambda x: x).collect()
l2 = df2.rdd.flatMap(lambda x: x).collect()
list_df = zip(l1, l2)

schema=StructType([ StructField('realite', FloatType() ) , 
                    StructField('probabilite' , FloatType() ) ])

df = spark.createDataFrame(list_df, schema=schema)
df.show()

+-------+-----------+
|realite|probabilite|
+-------+-----------+
|    0.0|        0.0|
|    1.0|        1.0|
|    2.0|        2.0|
|    3.0|        3.0|
|    4.0|        4.0|
|    5.0|        5.0|
|    6.0|        6.0|
|    7.0|        7.0|
|    8.0|        8.0|
|    9.0|        9.0|
+-------+-----------+
相关问题