如何在pyspark中将字符串列转换为ArrayType

时间:2019-09-10 09:43:47

标签: cassandra pyspark pyspark-sql

我有要求,我需要使用pyspark屏蔽存储在Cassandra表中的数据。我在Cassandra中有一个冻结的数据集,我在pyspark中将其作为数组获取。我将其转换为String以对其进行屏蔽。现在,我想将其转换回数组类型。

我正在使用spark 2.3.2屏蔽来自Cassandra表的数据。我将数据复制到数据框,然后将其转换为字符串以执行屏蔽。我试图将其转换回数组。但是,我无法保持原始结构。

table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select  networkinfos , pid, eid, s sid From tmp  ')


dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos') 

dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))

dfn2.show(30,False)

原始结构如下:

enter code here
|-- networkinfos: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- vendor: string (nullable = true)

 |    |    |-- product: string (nullable = true)

 |    |    |-- dhcp_enabled: boolean (nullable = true)

 |    |    |-- dhcp_server: string (nullable = true)

 |    |    |-- dns_servers: array (nullable = true)

 |    |    |    |-- element: string (containsNull = true)

 |    |    |-- ipv4: string (nullable = true)

 |    |    |-- ipv6: string (nullable = true)

 |    |    |-- subnet_mask_obsolete: string (nullable = true)

 |    |    |-- default_ip_gateway: string (nullable = true)

 |    |    |-- mac_address: string (nullable = true)

 |    |    |-- logical_name: string (nullable = true)

 |    |    |-- dhcp_lease_obtained: timestamp (nullable = true)

 |    |    |-- dhcp_lease_expires: timestamp (nullable = true)

 |    |    |-- ip_enabled: boolean (nullable = true)

 |    |    |-- ipv4_list: array (nullable = true)

 |    |    |    |-- element: string (containsNull = true)

 |    |    |-- ipv6_list: array (nullable = true)

 |    |    |    |-- element: string (containsNull = true)

 |    |    |-- subnet_masks_obsolete: array (nullable = true)

 |    |    |    |-- element: string (containsNull = true)

 |    |    |-- default_ip_gateways: array (nullable = true)

 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- wins_primary_server: string (nullable = true)


 |    |    |-- wins_secondary_server: string (nullable = true)

 |    |    |-- subnet_mask: string (nullable = true)

 |    |    |-- subnet_masks: array (nullable = true)

 |    |    |    |-- element: string (containsNull = true)

 |    |    |-- interface_index: integer (nullable = true)
 |    |    |-- speed: long (nullable = true)

 |    |    |-- dhcp_servers: array (nullable = true)

 |    |    |    |-- element: string (containsNull = true)

我得到的是:

root
 |-- pid: string (nullable = true)

 |-- eid: string (nullable = true)

 |-- sid: string (nullable = true)

 |-- networkinfos_ntdf: array (nullable = false)

 |    |-- element: string (containsNull = true)

如何将其转换为原始结构?

1 个答案:

答案 0 :(得分:1)

如果您的 regexp_replace,可以尝试使用pyspark.sql.functions。 to_json()和pyspark.sql.functions。 from_json()处理任务。 操作不会破坏JSON数据:

首先找到字段networkinfos的架构:

from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json

# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']

# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)

有了field_schema之后,可以使用from_json从修改后的JSON字符串将其设置回其原始架构:

dfn1 = networkinfos_df \
        .withColumn('networkinfos', to_json('networkinfos')) \
        .withColumn('networkinfos', regexp_replace('networkinfos',...)) \
        .....\
        .withColumn('networkinfos', from_json('networkinfos', field_schema))