我可以使用带有numpy数组或字符串列表的正则表达式re.sub()吗?

时间:2015-10-13 02:53:45

标签: python regex numpy whitespace removing-whitespace

我有一大块条目dtype=string_。我想使用正则表达式re模块来替换所有多余的空格,\t标签,\n标签。

如果我使用单个字符串,我会使用re.sub(),如下所示:

import re

proust = 'If a little     dreaming is dangerous, \t the cure for it is not to dream less but to dream more,. \t\t'

newstring = re.sub(r"\s+", " ", proust)

返回

'If a little dreaming is dangerous, the cure for it is not to dream less but to dream more. '

要在numpy数组的每个条目中执行此操作,我应该以某种方式使用for循环。

for i in numpy_arr:这样的东西,但我不确定应该遵循这个soc应用re.sub()到每个numpy数组元素。

解决这个问题最明智的方法是什么?

编辑:

我原来的numpy数组或列表是一个LONG列表/数组条目,每个条目都有一个如上所述的句子。以下是五个条目的示例:

original_list = [ 'to be or     \n\n not to be     that is the question', 
'  to   be or  not to be          that is the question\t ', 
'to   be     or not to       be that is the question', 
'to be or not to be that     is    the question\t ', 
'to be or not to be        that is    \t the question']

1 个答案:

答案 0 :(得分:3)

这不完全是你的re.sub,但效果是相同的,如果不是更好:

In [109]: oarray
Out[109]: 
array(['to be or     \n\n not to be     that is the question',
       '  to   be or  not to be          that is the question\t ',
       'to   be     or not to       be that is the question',
       'to be or not to be that     is    the question\t ',
       'to be or not to be        that is    \t the question'], 
      dtype='<U55')
In [110]: np.char.join(' ',np.char.split(oarray))Out[110]: 
array(['to be or not to be that is the question',
       'to be or not to be that is the question',
       'to be or not to be that is the question',
       'to be or not to be that is the question',
       'to be or not to be that is the question'], 
      dtype='<U39')

在这种情况下有效,因为split()可以识别与&#39; \ s +&#39;相同的空白字符集。

np.char.replace将替换所选字符,但必须多次应用才能删除&#39; \ n&#39;,然后&#39; \ t&#39;等等还有translate