在PySpark中将十进制解码为二进制信息

时间:2019-07-08 14:44:56

标签: python pyspark

我在PySpark中将十进制解码为二进制值时遇到问题。 这就是我在普通python中做的事情:

Int

这是我要转换的示例DataFrame:

a = 28
b = format(a, "09b")
print(b)

-> 000011100

我希望将“ b”列解码为:

from pyspark import Row
from pyspark.sql import SparkSession

df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
                            Row(a=2, b='28', c='44', d='bar'),
                            Row(a=3, b='28', c='22', d='foo')])

|  a|  b|  c|  d|
+---+---+---+---+
|  1| 28| 11|foo|
|  2| 28| 44|bar|
|  3| 28| 22|foo|
+---+---+---+---+

谢谢您的帮助!

1 个答案:

答案 0 :(得分:1)

具有binlpad功能以达到相同的输出

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark

df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
                            Row(a=2, b='28', c='44', d='bar'),
                            Row(a=3, b='28', c='22', d='foo')])

df = df.withColumn('b', f.lpad(f.bin(df['b']), 9, '0'))
df.show()

使用UDF

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark

df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
                            Row(a=2, b='28', c='44', d='bar'),
                            Row(a=3, b='28', c='22', d='foo')])


@f.udf()
def to_binary(value):
    return format(int(value), "09b")


df = df.withColumn('b', to_binary(df['b']))
df.show()

输出:

+---+---------+---+---+
|  a|        b|  c|  d|
+---+---------+---+---+
|  1|000011100| 11|foo|
|  2|000011100| 44|bar|
|  3|000011100| 22|foo|
+---+---------+---+---+