保持网址中的文字干净

时间:2018-08-23 12:10:52

标签: python

作为Python(建立一个小型搜索引擎)的Information Retrieval项目的一部分,我要保留下载的tweets(.csv tweets数据集-准确地是27000条tweets)中的纯文本,一条tweet如下所示:< / p>

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX

我想使用正则表达式删除推文中不必要的部分,例如URL,标点符号等

所以结果将是:

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

对此进行了尝试:pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'),但是它做得并不完美,例如,结果中仍然存在部分URL。

请帮助我找到可以满足我需要的正则表达式模式。

1 个答案:

答案 0 :(得分:1)

这可能有帮助。

演示:

cmd.ExecuteScalar();