为由列表和数组组成的元组创建PySpark模式

时间:2019-07-10 12:14:22

标签: pyspark sql-types

有人可以帮忙,请告诉我以下元组的正确PySpark模式是什么:

([['__label__positif', '__label__négatif', '__label__neutre']], array([[0.60312474, 0.24436191, 0.15254335]]))

提前谢谢

1 个答案:

答案 0 :(得分:0)

看看下面的叙述代码:

import numpy as np

#this is the object you got from the fasttext model
pred = ([['__label__positif', '__label__négatif', '__label__neutre']], np.array([[0.60312474, 0.24436191, 0.15254335]]))
print(pred)

#At first we flatten this object to create a list with 6 elements
pred = [item for sublist in pred for subsubiter in sublist for item in subsubiter]
print(pred)

#pyspark doesn't work that well with numpy and therefore we cast the numpy floats to python floats
pred = [x.item() if type(x) == np.float64 else x for x in pred]
print(pred)

l = [tuple(pred)]

columns = ['one', 'two', 'three', 'four', 'five', 'six']

df=spark.createDataFrame(l, columns)
df.show()

输出:

([['__label__positif', '__label__négatif', '__label__neutre']], array([[0.60312474, 0.24436191, 0.15254335]])) 
['__label__positif', '__label__négatif', '__label__neutre', 0.60312474, 0.24436191, 0.15254335] 
['__label__positif', '__label__négatif', '__label__neutre', 0.60312474, 0.24436191, 0.15254335] 
+----------------+----------------+---------------+----------+----------+----------+ 
|             one|             two|          three|      four|      five|       six| 
+----------------+----------------+---------------+----------+----------+----------+ 
|__label__positif|__label__négatif|__label__neutre|0.60312474|0.24436191|0.15254335| 
+----------------+----------------+---------------+----------+----------+----------+