Question

我有以下代码返回从正则表达式中提取所有模式并存储它

我如何获得预期的输出，我面临的问题是我的文本中有2个电子邮件ID，但它只显示了一个为什么会发生这种情况如何更正？ 21也是日期格式，但作为NUMSTR而不是123456计为NUMSTR我如何纠正这个错误。我想这只是第一次出现如果它出现在文本中我怎么能得到所有出现？

import re
def replace_entities(example):
    res = ''
    # dd mm yyyy
    m = re.search("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", example)  # dd/mm/yyyy
    if m:
        res = res + "\n{} : DATESTR".format(m.group())
    # email id
    m = re.search("[\w\.-]+@[\w\.-]+", example)
    if m:
        res = res +"\n{} : EMAILIDSTR".format(m.group())
    # URL
    m = re.search('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', example)
    if m:

        res= res +"\n{} : URLSTR".format(m.group())
    # NUMBERS
    m = re.search(r'\d+', example)
    if m:
        res = res + "\n{} : NUMSTR".format(m.group())
    return res.strip()

print(replace_entities('My name is ali, Date is 21/08/2018 Total amount is chandanpatil@yahoo.com euros 10,2018/13/09  saylijawale@gmail.com. https://imarticus.com   Account number is 123456'))

以下是输出我得到：

21/08/2018 : DATESTR
chandanpatil@yahoo.com : EMAILIDSTR
https://imarticus.com : URLSTR
21 : NUMSTR   # this is not correct

预期输出

21/08/2018 : DATESTR
chandanpatil@yahoo.com : EMAILIDSTR
saylijawale@gmail.com : EMAILIDSTR
https://imarticus.com : URLSTR
123456 :NUMSTR

Answer 1

使用findall获取所有电子邮件ID并迭代每个。

对于NUMSTR，您的代码似乎找到example中的第一个数字。如果您的输入格式相同，请获取字符串的最后一个数字。

import re

def replace_entities(example):
    res = ''

    # dd mm yyyy
    m = re.search("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", example)  # dd/mm/yyyy
    if m:
        res = res + "\n{} : DATESTR".format(m.group())

    # email id
    m = re.findall("[\w\.-]+@[\w\.-]+", example)
    if m:
        for id in m:
            res = res +"\n{} : EMAILIDSTR".format(id)

    # URL
    m = re.search('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', example)
    if m:
        res= res +"\n{} : URLSTR".format(m.group())

    # NUMBERS
    s = r'{}'.format(example)
    m = re.match('.*?([0-9]+)$', s)
    if m:
        res = res + "\n{} : NUMSTR".format(m.group(1))
    return res.strip()

print(replace_entities('My name is ali, Date is 21/08/2018 Total amount is chandanpatil@yahoo.com euros 10,2018/13/09  saylijawale@gmail.com. https://imarticus.com   Account number is 123456'))

'''
21/08/2018 : DATESTR
chandanpatil@yahoo.com : EMAILIDSTR 
saylijawale@gmail.com : EMAILIDSTR
https://imarticus.com : URLSTR           
123456 : NUMSTR 
 '''

Answer 2

你可以用正则表达式中的替代编写一个小的生成器函数：

import re

data = """My name is ali, Date is 21/08/2018 Total amount is chandanpatil@yahoo.com euros 10,2018/13/09  saylijawale@gmail.com. https://imarticus.com   Account number is 123456"""

def finder(string=None):
    # define the tokens
    tokens = {
        'DATESTR': r'\d{2}/\d{2}/\d{4}', 
        'EMAILIDSTR': r'\S+@\S+',
        'URLSTR': r'https?://\S+',
        'NUMSTR': r'\d+'}

    # build the expression
    # using join and a listcomp
    rx = re.compile("|".join(
        ['(?P<{}>{})'.format(key, value) 
        for key, value in tokens.items()])
    )

    # loop over the found matches
    for match in rx.finditer(string):
        for token in tokens:
            value = match.group(token)
            if value:
                if token in ['DATESTR', 'EMAILIDSTR']:
                    value = value.rstrip('.')
                yield (value, token)
                break

# iterate over the found tokens
for value, token in finder(data):
    print("Value: {}, Token: {}".format(value, token))

屈服

Value: 21/08/2018, Token: DATESTR
Value: chandanpatil@yahoo.com, Token: EMAILIDSTR
Value: 10, Token: NUMSTR
Value: 2018, Token: NUMSTR
Value: 13, Token: NUMSTR
Value: 09, Token: NUMSTR
Value: saylijawale@gmail.com, Token: EMAILIDSTR
Value: https://imarticus.com, Token: URLSTR
Value: 123456, Token: NUMSTR

<小时/> 请参阅a demo for the expression on regex101.com。

使用标签

2 个答案: