Question

执行df.printSchema（）

后，我有以下架构

root
 |-- key:col1: string (nullable = true)
 |-- key:col2: string (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: string (nullable = true)
 |-- col5: string (nullable = true)

我需要使用列名访问密钥：col2，但是由于名称中的：

，以下行会出错

df.map(lambda row:row.key:col2)

我试过了

df.map(lambda row:row["key:col2"])

我可以使用

轻松地从col3，col4和col5获取值

df.map(lambda row:row.col4).take(10)

Answer 1

我认为您可以使用getattr：

df.map(lambda row: getattr(row, 'key:col2'))

我不是pyspark的专家，所以我不知道这是否是最佳方式： - ）。

您可能也可以使用operator.attrgetter：

from operator import attrgetter
df.map(attrgetter('key:col2'))

IIRC，在某些情况下，它比lambda执行略微。在这种情况下，这可能比通常更明显，因为它可以避免全局getattr名称查找，在这种情况下，我认为它看起来也更好。

使用特殊字符映射Spark数据帧列

1 个答案: