Python 中检测公司名称的正则表达式

时间:2021-07-04 20:59:54

标签: python regex

我想使用 Python 用正则表达式检测公司名称。

这是我的想法:

  1. 公司名称应包含 1 到 3 个单词
  2. 公司名称中的第一个单词应大写
  3. 公司名称中的一个词可以是 .com 或 .co (Amazon.com Inc)
  4. 公司名称的最后一个字(第四个字)应为 Inc. , Ltd、GmbH、AG、GmbH、Group、Holding 等。
  5. 在名称的最后一个单词和 Inc. , Ltd, GmbH, AG 之间有时可以是 ',' 或 ', '

我尝试过类似的方法,但它不起作用:

address_1 = 'I work in Amazon.com Inc.'
address_2 = 'Company named Swiss Medic Holding invested in vaccine'
address_3 = 'what do you think about Abercrombie & Fitch Co. ?'
address_4 = 'do you work in Delta Group?'
address_5 = 'I have worked in CocaCola Gmbh'

regex_company = '([A-Z][\w]+[ -]+){1,3}(Ltd|ltd|LTD|llc|LLC|Inc|inc|INC|plc|Corp|Group)'
found = re.search(regex_company, address)

我想打印检测到的公司的结果 我使用相同的正则表达式逻辑来查找街道地址并且效果很好,但对于公司名称则不然。 这是我使用过的正则表达式:

regex_street = "(\d{0,6})(?:\w)\s([A-Z][\w]+[ -]+){1,3}(Street|St|Road|Rd)

正则表达式逻辑:数字+1-3个单词+street/st/road/rd

2 个答案:

答案 0 :(得分:2)

使用

\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b

regex proof

说明

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    m?                       'm' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (between 0 and 2
                           times (matching the most amount
                           possible)):
--------------------------------------------------------------------------------
    [ -]+                    any character of: ' ', '-' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      &                        '&'
--------------------------------------------------------------------------------
      [ -]+                    any character of: ' ', '-' (1 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      co                       'co'
--------------------------------------------------------------------------------
      m?                       'm' (optional (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
  ){0,2}                   end of grouping
--------------------------------------------------------------------------------
  [,\s]+                   any character of: ',', whitespace (\n, \r,
                           \t, \f, and " ") (1 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  (?i:                     group, but do not capture (case-
                           insensitive) (with ^ and $ matching
                           normally) (with . not matching \n)
                           (matching whitespace and # normally):
--------------------------------------------------------------------------------
    ltd                      'ltd'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    llc                      'llc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    inc                      'inc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    plc                      'plc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      rp                       'rp'
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    group                    'group'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    holding                  'holding'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    gmbh                     'gmbh'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

Python code

import re

regex = r"\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b"

test_str = ("I work in Amazon.com Inc.\n"
    "Company named Swiss Medic Holding invested in vaccine\n"
    "what do you think about Abercrombie & Fitch Co. ?\n"
    "do you work in Delta Group?\n"
    "I have worked in CocaCola Gmbh")

print(re.findall(regex, test_str))

结果['Amazon.com Inc', 'Swiss Medic Holding', 'Abercrombie & Fitch Co', 'Delta Group', 'CocaCola Gmbh']

答案 1 :(得分:0)

使用 https://regex101.com 测试正则表达式,很棒。对于您的具体示例,这里是可以满足您要求的正则表达式。在此示例中,我认为不需要测试可选的 .com。

regex_company = '[A-Z]([^ ]*[ &]*){0,2}(Inc\.|Ltd|GmbH|AG|Gmbh|Group|Holding|Co\.)'

for address in [address_1, address_2, address_3, address_4, address_5]:
    found = re.search(regex_company, address)
    if found:
        print(found)

# prints:
# <regex.Match object; span=(10, 25), match='Amazon.com Inc.'>
# <regex.Match object; span=(14, 33), match='Swiss Medic Holding'>
# <regex.Match object; span=(24, 47), match='Abercrombie & Fitch Co.'>
# <regex.Match object; span=(15, 26), match='Delta Group'>
# <regex.Match object; span=(17, 30), match='CocaCola Gmbh'>

相关问题