Question

我的熊猫数据框有多个列，其中包含值和不需要的字符。

columnA        columnB    columnC        ColumnD
\x00A\X00B     NULL       \x00C\x00D        123
\x00E\X00F     NULL       NULL              456

我想做的是将此数据框设置如下。

columnA  columnB  columnC   ColumnD
AB        NULL       CD        123
EF        NULL       NULL      456

使用下面的代码，我可以从columnA中删除'\ x00'，但是columnC很棘手，因为它在某行中与NULL混合。

col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
    if df[i].dtype != np.int64:
            df[i] = df[i].map(lambda x: x.translate(fixer))

有没有有效的方法从columnC中删除不需要的字符？

Answer 1

<强>设置

df = pd.DataFrame({
     'columnA' : ['\x00A\x00B', '\x00E\x00F'], 
     'columnB' : ['NULL', 'NULL'],
     'columnC' : ['\x00C\x00D', 'NULL'],
     'columnD' : [123, 456]
})

df

  columnA columnB columnC  columnD
0    AB    NULL    CD      123
1    EF    NULL    NULL      456

在字符串列上使用apply + str.replace：

c = df.columns[df.dtypes == object]
df[c] = df[c].apply(lambda x: x.str.replace('\W+', ''))

df

  columnA columnB columnC columnD
0      AB    NULL      CD     123
1      EF    NULL    NULL     456

如果您需要更全面的正则表达式来保留仅 ASCII字符（不仅仅是字母或数字），您可以在this answer中调整正则表达式。

Answer 2

NULL的诀窍是什么？如果您想用真实'NULL'替换字符串NaN，请使用replace：

df.replace('NULL',np.NaN, inplace=True)
print(df.isnull())

输出：

   columnA  columnB  columnC  columnD
0    False     True    False    False
1    False     True     True    False

或者您需要用空字符串替换'NULL'，在str.replace中使用RegEx

df = df.apply(lambda col: col.str.replace(
               r"[\x00|NULL]", "") if col.dtype == object else col)

print (df.isnull())
print (df.values)

输出：


   columnA  columnB  columnC  columnD
0    False    False    False    False
1    False    False    False    False

[['AB' '' 'CD' 123]
 ['EF' '' '' 456]]

从pandas中的字符串列中删除非ASCII字符

2 个答案: