从python中的字符串中提取年份

时间:2016-10-19 03:54:29

标签: python regex

我如何解析foll。在python中提取年份:

'years since 1250-01-01 0:0:0'

答案应该是1250

3 个答案:

答案 0 :(得分:11)

有各种各样的方法,这里有几个选项:

  • dateutil parser处于“模糊”模式:

    SELECT *
    FROM
      (SELECT rownum AS rn,
        a.*
      FROM
        (WITH DATA AS -- creating dummy data
        ( SELECT 'MOHAN' AS NAME, 200 AS SALARY FROM DUAL
        UNION ALL
        SELECT 'AKSHAY' AS NAME, 500 AS SALARY FROM DUAL
        UNION ALL
        SELECT 'HARI' AS NAME, 300 AS SALARY FROM DUAL
        UNION ALL
        SELECT 'RAM' AS NAME, 400 AS SALARY FROM DUAL
        )
      SELECT D.* FROM DATA D ORDER BY SALARY DESC
        ) A
      )
    WHERE rn = 3; -- specify N'th highest here (In this case fetching 3'rd highest)
    
  • 带有捕获组的正则表达式

    In [1]: s = 'years since 1250-01-01 0:0:0'
    
    In [2]: from dateutil.parser import parse
    
    In [3]: parse(s, fuzzy=True).year  # resulting year would be an integer
    Out[3]: 1250
    
  • 按“自”拆分,然后用破折号分开:

    In [2]: import re
    
    In [3]: re.search(r"years since (\d{4})", s).group(1)
    Out[3]: '1250'
    
  • 或者甚至可以通过第一个破折号分割并切割第一个子字符串:

    In [2]: s.split("since", 1)[1].split("-", 1)[0].strip()
    Out[2]: '1250'
    

最后两个涉及更多“移动部件”,可能不适用,具体取决于输入字符串的可能变化。

答案 1 :(得分:5)

您可以使用带有四位数字捕获组的正则表达式,同时还要确保周围有特定的图案。我可能会寻找一些东西:

  • 4位数和捕获In [2]: s.split("-", 1)[0][-4:] Out[2]: '1250'

  • 连字符(\d{4})

  • 两位数-

  • 连字符\d{2}

  • 两位数-

捐赠:\d{2}

演示:

(\d{4})-\d{2}-\d{2}

如果您需要它作为int,只需将其转换为:

>>> import re
>>> d = re.findall('(\d{4})-\d{2}-\d{2}', 'years since 1250-01-01 0:0:0')
>>> d
['1250']
>>> d[0]
'1250'

答案 2 :(得分:2)

以下正则表达式应该将四位数年份作为第一个捕获组:

^.*\(d{4})-\d{2}-\d{2}.*$