Python将列表中的部分字符串相互比较

时间:2020-02-02 14:18:32

标签: python regex string compare

我正在尝试编写代码以将列表中的每个字符串相互比较,然后为相似性生成其正则表达式

list = ["LONDON-UK-L16-N1",
        "LONDON-UK-L17-N1",
        "LONDON-UK-L16-N2",
        "LONDON-UK-L17-N2",
        "PARIS-France-L16-N2"]

我正在尝试获得如下输出

LONDON-UK-L(16|17)-N(1|2)

有可能吗?谢谢

更新:为了清楚起见,我正在尝试 输入:列表或字符串 行动:将列表项彼此比较,并检查相似性(以修复它的第一组字符串),然后对任何其他不相似的项使用正则表达式,因此我们可以只用一个输出,而不用包含项(使用正则表达式) 输出:regex匹配不相似

输入: tez15-3-s1-y2 tez15-3-s2-y2 bro40-55-s1-y2

输出: tez15-3-s(1 | 2)-y2 ,bro40-55-s1-y2

4 个答案:

答案 0 :(得分:3)

从您的问题中尚不清楚确切的问题是什么。由于您提供的数据是一致且井井有条的,因此只需拆分列表中的项目并将其分类,就可以轻松解决此问题。

loc_list = ["LONDON-UK-L16-N1", "LONDON-UK-L17-N1", "LONDON-UK-L16-N2", 
            "LONDON-UK-L16-N2", "PARIS-France-L16-N2"]

split_loc_list = [location.split("-")  for location in loc_list]

locs = {}

for loc in split_loc_list:
    locs.setdefault("-".join(loc[0:2]), {}).\
                        setdefault("L", set()).add(loc[2].strip("L"))

    locs.setdefault("-".join(loc[0:2]), {}).\
                        setdefault("N", set()).add(loc[3].strip("N"))

for loc, vals in locs.items():
    L_vals_sorted = sorted(list(map(int,vals["L"])))
    L_vals_joined = "|".join(map(str,L_vals_sorted))

    N_vals_sorted = sorted(list(map(int,vals["N"])))
    N_vals_joined = "|".join(map(str,N_vals_sorted))

    print(f"{loc}-L({L_vals_joined})-N({N_vals_joined})")

将输出:

LONDON-UK-L(16|17)-N(1|2)
PARIS-France-L(16)-N(2)

因为这里只有两个标签(“ L”和“ N”),所以我只是将它们写到了代码中。如果可能有很多标签,则可以使用:

import re
split = re.findall('\d+|\D+', loc[2])
key, val = split[0], split[1]
locs.setdefault("-".join(loc[0:2]), {}).\
                        setdefault(key, set()).add(val)

然后遍历所有标签,而不仅仅是在第二个循环中获取“ L”和“ N”。

答案 1 :(得分:1)

我已经实现了以下解决方案:

import re 

data = [
  'LONDON-UK-L16-N1',
  'LONDON-UK-L17-N1',
  'LONDON-UK-L16-N2',
  'LONDON-UK-L16-N2',
  'PARIS-France-L16-N2'
]

def deconstruct(data):
  data = [y for y in [x.split('-') for x in data]]
  result = dict()

  for x in data:
    pointer = result

    for y in x:
      substr = re.findall('(\D+)', y)
      if substr:
        substr = substr[0]
        if not substr in pointer:
          pointer[substr] = {0: set()}
        pointer = pointer[substr]

      substr = re.findall('(\d+)', y)
      if substr:
        substr = substr[0]
        pointer[0].add(substr)

  return result

def construct(data, level=0):
  result = []

  for key in data.keys():
    if key != 0:
      if len(data[key][0]) == 1:
        nums = list(data[key][0])[0]
      elif len(data[key][0]) > 1:
        nums = '(' + '|'.join(sorted(list(data[key][0]))) + ')'
      else:
        nums = ''

      deeper_result = construct(data[key], level + 1)
      if not deeper_result:
        result.append([key + nums])
      else:
        for d in deeper_result:
          result.append([key + nums] + d)

  return result if level > 0 else ['-'.join(x) for x in result]

print(construct(deconstruct(data)))
# ['LONDON-UK-L(16|17)-N(1|2)', 'PARIS-France-L16-N2']

答案 2 :(得分:1)

我发布了有关此问题的新的(第二个)实现,我认为更准确,希望对您有所帮助:

import re 

data = [
  'LONDON-UK-L16-N1',
  'LONDON-UK-L17-N1',
  'LONDON-UK-L16-N2',
  'LONDON-UK-L17-N2',
  'LONDON-UK-L18-N2',
  'PARIS-France-L16-N2',
]

def merge(data):
  data.sort()
  data = [y for y in [x.split('-') for x in data]]

  for col in range(len(data[0]) - 1, -1, -1):
    result = []

    def add_result():
      result.append([])
      if headstr:
        result[-1] += headstr.split('-')
      if len(list(findnum)) > 1:
        result[-1] += [f'{findstr}({"|".join(sorted(findnum))})']
      elif len(list(findnum)) == 1:
        result[-1] += [f'{findstr}{findnum[0]}']
      if tailstr:
        result[-1] += tailstr.split('-')

    _headstr = lambda x, y: '-'.join(x[:y])
    _tailstr = lambda x, y: '-'.join(x[y + 1:])
    _findstr = lambda x: re.findall('(\D+)', x)[0] if re.findall('(\D+)', x) else ''
    _findnum = lambda x: re.findall('(\d+)', x)[0] if re.findall('(\d+)', x) else ''

    headstr = _headstr(data[0], col)
    tailstr = _tailstr(data[0], col)
    findstr = _findstr(data[0][col])
    findnum = []

    for row in data:
      if headstr + findstr + tailstr != _headstr(row, col) + _findstr(row[col]) + _tailstr(row, col):
        add_result()
        headstr = _headstr(row, col)
        tailstr = _tailstr(row, col)
        findstr = _findstr(row[col])
        findnum = []
      if _findnum(row[col]) not in findnum:
        findnum.append(_findnum(row[col]))

    else:
        add_result()

    data = result[:]

  return ['-'.join(x) for x in result]

print(merge(data))  # ['LONDON-UK-L(16|17)-N(1|2)', 'LONDON-UK-L18-N2', 'PARIS-France-L16-N2']

答案 3 :(得分:0)

请勿使用“列表”作为变量名...这是保留字。

import re

lst = ['LONDON-UK-L16-N1', 'LONDON-UK-L17-N1', 'LONDON-UK-L16-N2', 'LONDON-UK-L16-N2', 'PARIS-France-L16-N2']

def check_it(string):
    return re.search(r'[a-zA-Z\-]*L(\d)*-N(\d)*', string)

[check_it(x).group(0) for x in lst]

将输出:

['LONDON-UK-L16-N1',
 'LONDON-UK-L17-N1',
 'LONDON-UK-L16-N2',
 'LONDON-UK-L16-N2',
 'PARIS-France-L16-N2']

从那里开始,查看组并定义一个组以覆盖要用于相似性的部分。