Question

我需要re.findall的帮助。我有一个美国州的小写字母列表，名为states_names，以及一个来自不同州的1375法律的字符串，名为art_str。我需要为每个法律制作一份包含州名的清单 state_names = ['哥伦比亚区'，'格鲁吉亚'，'新泽西'，'伊利诺伊']

art_str = "2.  45 d.c. reg. 9252, issue: volume 45, number 52, issue date: december 25,
1998, subject: boards, commissions, and agencies, agency: chief financial
officer office of grants management & development, district of columbia
register
... agencies, and community-based and faith-based organizations; * public
and private  ...

3.  46 d.c. reg. 408, issue: volume 46, number 3, issue date: january 15, 1999,
subject: ceremonial resolution, district of columbia register
... years of community, educational and faith-based service to the district
of columbia. ..."

现在我使用了以下代码，这几乎给了我正确的法律答案。 3，但不是法律nr。 2：

我将法律分开，所以我当时看一下：

results = re.finditer("\\n\n[0-9]+. +",art_str) 

for r in results:

    st = r.span()[1]
    sub_str = art_str[st:st+200] # takes the first 200 characters of the law (this is the title where the state name is in)#
    state = re.findall(r"(?=("+'|'.join(state_names)+r"))",sub_str) 
    state_list.append(state)

在大多数情况下，这给了我正确答案，除了状态名称在行尾的那些情况。知道我怎么能修改我的re.findall所以我得到所有州名无关紧要在文本的哪个位置？

提前谢谢你，海伦

Answer 1

看起来问题是你的任意截止长度（200个字符）太短了。

在你失败的例子中，有2号法律，＆＃34;哥伦比亚区＆＃34;在角色202之前不会开始，直到角色221才结束。

但是，如果您尝试通过选择过大的截止长度来解决问题，则可能最终会读取属于您想要的法律之后的状态。

最重要的是，您永远无法知道任何给定的固定大小限制是太短还是太长。您需要设计一种方法来确定每个法律的开始和结束的确切位置，并在执行搜索之前将每个法律拆分为自己的单独字符串。

所以，像这样：

laws = re.split('\n\n(?=\d+\.\s{2})', art_str)
for law in laws:
    state_list = [state for state in state_names if state in law]

提取匹配列表的字符串中的所有元素

1 个答案: