
时间:2017-03-31 14:05:59

标签: python regex string text mining


string = "This is a 1example of the text. But, it only is 2.5 percent of all data"


"This is a  1 example of the text But it only is  2.5  percent of all data"

删除标点符号(可以是. ,string.punctuation中的任何其他内容),并在连接时在数字和单词之间放置一个空格。但在我的例子中保持浮点数像2.5。


item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item


 >> "This is a  1 example of the text. But, it only is  2 . 5  percent of all data"


6 个答案:

答案 0 :(得分:3)



<强> Working demo


regex = r"(?<!\d)[.,;:](?!\d)"

test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"

result = re.sub(regex, "", test_str, 0)


This is a 1example of the text But it only is 2.5 percent of all data

答案 1 :(得分:1)


item = "This is a 1example 2Ex of the text.But, it only is 2.5 percent of all data?"
#if there is two strings contatenated with the second starting with capital letter
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
#if a word starts with a digit like "1example"
item = ' '.join(re.split(r'(\d+)([A-Za-z]+)', item) )
#Magical line that removes punctuation apart from floats
item = re.sub('\S+', lambda m: re.match(r'^\W*(.*\w)\W*$', m.group()).group(1), item)
item = item.replace("  "," ")
print item

答案 2 :(得分:0)

我与Python脱节,但对regexp有一些了解。 我建议使用或? 我会使用这个正则表达式:"(\d+)([a-zA-Z])|([a-zA-Z])(\d+)",然后作为替换字符串使用: "\1 \2"
如果某些极端情况困扰你,你可以将反向引用传递给一个过程,然后处理1-by-1,可能是通过检查你的&#34; \ 1 \ 2&#34;可以翻译为浮动。 TCL有这样的内置功能,Python也应该。

答案 3 :(得分:0)


a = "This is a 1example of the text. But, it only is 2.5 percent of all data" a.replace(". ", " ").replace(", "," ")


答案 4 :(得分:0)


from itertools import groupby

s1 = "This is a 1example of the text. But, it only is 2.5 percent of all data"
s2 = [''.join(g) for _, g in groupby(s1, str.isalpha)]
s3 = ' '.join(s2).replace("   ", "  ").replace("  ", " ")

#you can keep adding a replace for each ponctuation
s4 = s3.replace(". ", " ").replace(", "," ").replace("; "," ").replace(", "," ").replace("- "," ").replace("? "," ").replace("! "," ").replace(" ("," ").replace(") "," ").replace('" '," ").replace(' "'," ").replace('... '," ").replace('/ '," ").replace(' “'," ").replace('” '," ").replace('] '," ").replace(' ['," ")

s5 = s4.replace("  ", " ")


'This is a 1 example of the text But it only is 2.5 percent of all data'

P.s。:您可以查看Punctuation Marks并继续将其添加到.replace()函数中。

答案 5 :(得分:0)


([^ ]?)(?:[^\P{punct}.]|(?<!\d)\.(?!\d))([^ ]?)


如果$ 1长度&gt; 0和$ 2长度&gt; 0
替换为$ 1 +空格+ $ 2
其他 替换为$ 1 $ 2


 ( [^ ]? )                     # (1)
      (?<! \d )
      (?! \d )
 ( [^ ]? )                     # (2)

如果您不想使用旁路旁边的字符逻辑 使用(?:[^\P{punct}.]|(?<!\d)\.(?!\d))并替换为空。