如何测试列是否存在以及在DataFrame中是否为null

时间:2018-08-06 12:29:44

标签: python apache-spark dataframe rdd

我有一个python RDD:

rddstats = rddstats.filter(lambda x : len(x) == NB_LINE or len(x) == NB2_LINE)

我根据此RDD创建了一个数据框:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])

我想对两个columns 6 and 7进行测试:
如果数据帧中存在第6列且不为null,则应返回包含此column 6值的数据帧,否则,我应返回包含column 7值的数据帧。 以下是我的小代码:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull):
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

语法是否正确,我是否有权像这样用Python编写?

3 个答案:

答案 0 :(得分:2)

if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull)

我很确定,如果column6不存在,您将抛出KeyError。

您可以执行以下操作:

if 'column6' in logsDF.columns:
    if logsDF['column6'].notnull().any():
        logsDF.select("column1","column2","column3","column4","column5","column6")
    else:
        logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

首先检查在logsDF列中是否存在column6。 如果是这样,请查看any()值是否不为空。

如果column6不存在,或者column6存在但所有值均为空,则使用column7。


编辑我自己的评论: 由于如果第一个条件为False,则python不会评估第二个条件,因此您可以执行以下操作:

if 'column6' in logsDF.columns and logsDF['column6'].notnull().any():
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

只要logsDF.columns中的'column6'首先出现, logsDF ['column6'] 将永远不会评估并抛出KeyError,如果column6没有存在。

答案 1 :(得分:1)

if set(['A','C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']

set([])可以用大括号构造:

if {'A', 'C'}.issubset(df.columns):

有关大括号语法的讨论,请参见此问题。

或者,您可以使用列表推导,如:

if all([item in df.columns for item in ['A','C']]):

答案 2 :(得分:0)

我认为这可能会更快

 if 'column_name' not in df.columns:
    do_something 
 if len([x in x for df['column_name'].unique() if x.isna()]) > 0:
    do_something_else