我有以下代码返回从正则表达式中提取所有模式并存储它
我如何获得预期的输出,我面临的问题是我的文本中有2个电子邮件ID,但它只显示了一个为什么会发生这种情况如何更正? 21也是日期格式,但作为NUMSTR而不是123456计为NUMSTR我如何纠正这个错误。我想这只是第一次出现如果它出现在文本中我怎么能得到所有出现?
import re
def replace_entities(example):
res = ''
# dd mm yyyy
m = re.search("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", example) # dd/mm/yyyy
if m:
res = res + "\n{} : DATESTR".format(m.group())
# email id
m = re.search("[\w\.-]+@[\w\.-]+", example)
if m:
res = res +"\n{} : EMAILIDSTR".format(m.group())
# URL
m = re.search('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', example)
if m:
res= res +"\n{} : URLSTR".format(m.group())
# NUMBERS
m = re.search(r'\d+', example)
if m:
res = res + "\n{} : NUMSTR".format(m.group())
return res.strip()
print(replace_entities('My name is ali, Date is 21/08/2018 Total amount is chandanpatil@yahoo.com euros 10,2018/13/09 saylijawale@gmail.com. https://imarticus.com Account number is 123456'))
以下是输出我得到:
21/08/2018 : DATESTR
chandanpatil@yahoo.com : EMAILIDSTR
https://imarticus.com : URLSTR
21 : NUMSTR # this is not correct
预期输出
21/08/2018 : DATESTR
chandanpatil@yahoo.com : EMAILIDSTR
saylijawale@gmail.com : EMAILIDSTR
https://imarticus.com : URLSTR
123456 :NUMSTR
答案 0 :(得分:0)
使用findall获取所有电子邮件ID并迭代每个。
对于NUMSTR
,您的代码似乎找到example
中的第一个数字。如果您的输入格式相同,请获取字符串的最后一个数字。
import re
def replace_entities(example):
res = ''
# dd mm yyyy
m = re.search("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", example) # dd/mm/yyyy
if m:
res = res + "\n{} : DATESTR".format(m.group())
# email id
m = re.findall("[\w\.-]+@[\w\.-]+", example)
if m:
for id in m:
res = res +"\n{} : EMAILIDSTR".format(id)
# URL
m = re.search('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', example)
if m:
res= res +"\n{} : URLSTR".format(m.group())
# NUMBERS
s = r'{}'.format(example)
m = re.match('.*?([0-9]+)$', s)
if m:
res = res + "\n{} : NUMSTR".format(m.group(1))
return res.strip()
print(replace_entities('My name is ali, Date is 21/08/2018 Total amount is chandanpatil@yahoo.com euros 10,2018/13/09 saylijawale@gmail.com. https://imarticus.com Account number is 123456'))
'''
21/08/2018 : DATESTR
chandanpatil@yahoo.com : EMAILIDSTR
saylijawale@gmail.com : EMAILIDSTR
https://imarticus.com : URLSTR
123456 : NUMSTR
'''
答案 1 :(得分:0)
你可以用正则表达式中的替代编写一个小的生成器函数:
import re
data = """My name is ali, Date is 21/08/2018 Total amount is chandanpatil@yahoo.com euros 10,2018/13/09 saylijawale@gmail.com. https://imarticus.com Account number is 123456"""
def finder(string=None):
# define the tokens
tokens = {
'DATESTR': r'\d{2}/\d{2}/\d{4}',
'EMAILIDSTR': r'\S+@\S+',
'URLSTR': r'https?://\S+',
'NUMSTR': r'\d+'}
# build the expression
# using join and a listcomp
rx = re.compile("|".join(
['(?P<{}>{})'.format(key, value)
for key, value in tokens.items()])
)
# loop over the found matches
for match in rx.finditer(string):
for token in tokens:
value = match.group(token)
if value:
if token in ['DATESTR', 'EMAILIDSTR']:
value = value.rstrip('.')
yield (value, token)
break
# iterate over the found tokens
for value, token in finder(data):
print("Value: {}, Token: {}".format(value, token))
屈服
Value: 21/08/2018, Token: DATESTR
Value: chandanpatil@yahoo.com, Token: EMAILIDSTR
Value: 10, Token: NUMSTR
Value: 2018, Token: NUMSTR
Value: 13, Token: NUMSTR
Value: 09, Token: NUMSTR
Value: saylijawale@gmail.com, Token: EMAILIDSTR
Value: https://imarticus.com, Token: URLSTR
Value: 123456, Token: NUMSTR
<小时/> 请参阅a demo for the expression on regex101.com。