Question

已提供解决方案 - 谢谢@ekhumoro！ 我有一个python字典，其中包含一个术语列表值：

myDict = {
    ID_1: ['(dog|cat[a-z+]|horse)', '(car[a-z]+|house|apple\w)', '(bird|tree|panda)'],
    ID_2: ['(horse|building|computer)', '(panda\w|lion)'],
    ID_3: ['(wagon|tiger|cat\w*)'],
    ID_4: ['(dog)']    
    }

我希望能够读取每个值中的列表项，作为单独的正则表达式，如果它们匹配任何文本，则将匹配的文本作为单词字典中的键返回，并使用其原始键（ID））作为价值观。因此，如果这些术语被读作搜索此字符串的正则表达式：

"dog panda cat cats pandas car carts"

我想到的一般方法是：

For key, value in myDict:
    for item in value:
        if re.compile(item) = match-in-text:
            newDict[match] = [list of keys]

预期输出为：

newDict = {
    car: [ID_1],
    carts: [ID_1],
    dog: [ID_1, ID_4],
    panda: [ID_1, ID_2],
    pandas: [ID_1, ID_2],
    cat: [ID_1, ID_3],
    cats: [ID_1, ID_3]
    }

匹配的文字应该在newDict 中作为关键字返回，只有他们实际上匹配了文本正文中的内容。所以在输出中，＆＃39; Carts＆＃39;因为ID_1的正则表达式与它匹配，所以列在那里。因此ID列在输出字典中。的解

import re
from collections import defaultdict

text = """
the eye of the tiger
a doggies in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
the cationic cataclysm
the pandamonious panda pandas
      """

myDict = {
    'ID_1': ['(dog\w+|cat\w+|horse)', '(car|house|apples)', 
    '(bird|tree|panda\w+)'],
    'ID_2': ['(horse|building|computer)', '(panda\w+|lion)'],
    'ID_3': ['(wagon|tiger|cat)'],
    'ID_4': ['(dog)'],
    }

newDict = defaultdict(list)

for key, values in myDict.items():
for pattern in values:
    for match in re.finditer(pattern, text):
        newDict[match.group(0)].append(key)

for item in newDict.items():
   print(item)

Answer 1

这是一个似乎符合您要求的简单脚本：

import re
from collections import defaultdict

text = """
the eye of the tiger
a dog in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
"""

myDict = {
    'ID_1': ['(dog|cat|horse)', '(car|house|apples)', '(bird|tree|panda)'],
    'ID_2': ['(horse|building|computer)', '(panda|lion)'],
    'ID_3': ['(wagon|tiger|cat)'],
    'ID_4': ['(dog)'],
    }

newDict = defaultdict(list)

for key, values in myDict.items():
    for pattern in values:
        for match in re.finditer(pattern, text):
            newDict[match.group(0)].append(key)

for item in newDict.items():
    print(item)

输出：

('dog', ['ID_1', 'ID_4'])
('cat', ['ID_1', 'ID_3'])
('horse', ['ID_1', 'ID_2'])
('bird', ['ID_1'])
('tiger', ['ID_3'])

Answer 2

一种方法是将正则表达式转换为vanilla列表，例如用字符串操作：

In [11]: {id_: "|".join(ls).replace("(", "").replace(")", "").split("|") for id_, ls in myDict.items()}
Out[11]:
{'ID_1': ['dog',
  'cat',
  'horse',
  'car',
  'house',
  'apples',
  'bird',
  'tree',
  'panda'],
 'ID_2': ['horse', 'building', 'computer', 'panda', 'lion'],
 'ID_3': ['wagon', 'tiger', 'cat'],
 'ID_4': ['dog']}

您可以将其转换为DataFrame：

In [12]: from collections import Counter

In [13]: pd.DataFrame({id_:Counter( "|".join(ls).replace("(", "").replace(")", "").split("|") ) for id_, ls in myDict.items()}).fillna(0).astype(int)
Out[13]:
          ID_1  ID_2  ID_3  ID_4
apples       1     0     0     0
bird         1     0     0     0
building     0     1     0     0
car          1     0     0     0
cat          1     0     1     0
computer     0     1     0     0
dog          1     0     0     1
horse        1     1     0     0
house        1     0     0     0
lion         0     1     0     0
panda        1     1     0     0
tiger        0     0     1     0
tree         1     0     0     0
wagon        0     0     1     0

读取dict值为regex，返回匹配

2 个答案: