如何在保持URL的同时摆脱标点符号?

时间:2017-05-17 04:01:52

标签: python regex text

我正在使用Twitter数据,并且稍微清理数据,我想摆脱所有标点符号。我能够轻松地做到这一点,但我的问题是我还想保留URL,其中包括一些标点符号。

例如,让我们说Tweet A的内容是:

tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!". 

我可以使用以下代码消除标点符号。但是,这会消除所有标点符号,包括URL中的标点符号。

cleaned = re.sub(r'[^a-zA-Z0-9\s]','',tweet)

这会产生:

cleaned = "check out my httpgooglecom324fasdcsdasdf32    links httpsgooglecomersf8vaddasddd2 hooray"

但是,我希望最终输出看起来像URL中的标点符号保持在哪里:

cleaned = "check out my http://google.com/324fasdcsd?asdf=32&    links https://google.com/ersf8vad?dasd=d&d=2 hooray".

使用Python,我该怎么做?在此先感谢您的帮助!

4 个答案:

答案 0 :(得分:2)

使用John Gruber's regex查找网址:

import re
gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")

在网址上拆分推文:

tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
split_tweet = gruber.split(tweet)

你得到一个字符串列表。非URL始终是列表中的偶数条目,URL是奇数。因此,我们可以迭代列表并从偶数编号中删除标点符号。 (出现range()迭代的罕见用例!)

from string import punctuation
punc_table = {ord(c): None for c in punctuation)

for i in range(0, len(split_tweet), 2):
    split_tweet[i] = split_tweet[i].translate(punc_table)

现在我们一起加入它:

final_tweet = "".join(split_tweet)

这是Python,其中大部分可以使用生成器表达式在一行中完成,因此最终代码为:

import re
from string import punctuation
punc_table = {ord(c): None for c in punctuation)

gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")

tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
final_tweet = "".join(t if i % 2 else t.translate(punc_table) for (i, t) in enumerate(gruber.split(tweet)))

请注意,我使用了{3}的Python 3样式。对于Python 2,您不需要创建str.translate,只需使用{Wevenman的答案中所见的punc_table即可。您也可能希望使用text.translate(None, punctuation)代替xrange

答案 1 :(得分:1)

这是一种方法。首先找到网址,然后找到所有标点符号,然后删除网址中没有的任何标点符号。

可能不是最有效的方法,但至少它比疯狂的正则表达式更容易理解!

import re
def remove_punc_except_urls(s, punctuationRegex=r'[^a-zA-Z0-9\s]'):
  # arrays to keep track of indices
  urlInds = []
  puncInds = []
  # find all the urls
  for m in re.finditer(r'(https?|ftp)://[^\s/$.?#].[^\s]*', s):
    urlInds.append((m.start(0), m.end(0)))
  # find all the punctuation
  for m in re.finditer(punctuationRegex, s):
    puncInds.append((m.start(0), m.end(0)))
  # start removing punctuation from end so that indices do not change
  puncInds.reverse()
  # go through each of the punctuation indices and remove the character if it is not inside a url
  for puncRange in puncInds:
    inUrl = False
    # check each url to see if this character is in it
    for urlRange in urlInds:
      if puncRange[0] >= urlRange[0] and puncRange[0] <= urlRange[1]:
        inUrl = True
        break
    if not inUrl:
      # remove the punctuation from the string
      s = s[:puncRange[0]] + s[puncRange[1]:]
  return s

以下是您的例子:

samp = 'check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!'
print(samp)
print(remove_punc_except_urls(samp))

输出:

check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!
check out my http://google.com/324fasdcsd?asdf=32&    links https://google.com/ersf8vad?dasd=d&d=2 hooray

答案 2 :(得分:0)

假设您的推文内容存储为名为string的{​​{1}}:

tweet

答案 3 :(得分:0)

你可以做到的一种方法是找到网址;移除并保存它们;去掉了痘痘;找到新破坏的网址;并用保存的那些替换破碎的那些:

import re

tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!"

urls_real = []
urls_busted = []
p = re.compile("http\S*")
for m in p.finditer(tweet):
    urls_real.append(m.group())

tweet = re.sub(r'[^a-zA-Z0-9\s]','',tweet)

for m in p.finditer(tweet):
    urls_busted.append(m.group())

for i in range(len(urls_real)):
    tweet = tweet.replace(urls_busted[i], urls_real[i])

print(tweet)

结果:

check out my http://google.com/324fasdcsd?asdf=32&    links https://google.com/ersf8vad?dasd=d&d=2 hooray

此代码要求普通和已破坏的网址都以http开头,并以空白字符结尾。埃里克斯在他的回答中使用的正则表达式也有效(并且更强大)。