Python-拆分并枚举字符串,检查两个单词是否在字符串内一定距离内

时间:2019-06-17 14:57:11

标签: python string parsing text enumeration

我正在研究一个g程序,该程序将检查研究标题中的某些模式以确定标题是否相关。通常,如果单词“ access”和“ care”在四个单词之内,这将是相关的。可能有诸如“获得护理”,“患者获得”或“获得糖尿病护理”之类的短语。

现在,我已经枚举并分割了每个字符串,并且过滤掉了其中包含“访问”和“护理”的行以及一个数字,但是我一直在努力创建一个二进制“是/否”变量,如果它们之间的距离不超过4个字。例如:

string =“确保获得护理很重要。
相关='是'

string =“确保获得棒球票很重要,但老实说,我并不在乎。
相关='否'

任何有关如何解决此问题的想法将不胜感激。到目前为止,这是我所拥有的:

  sentence = 'A priority area for this company is access to medical care 
  and how we address it.'
  sentence = sentence.lower()
  sentence = sentence.split()
  for i, j in enumerate(sentence):

      if 'access' in j:
          x = 'yes'
      else:
          x = 'no'

      if 'care' in j:
          y = 'yes'
      else:
          y = 'no'   

      if x == 'yes' or y == 'yes':
          print(i, j, x, y)

4 个答案:

答案 0 :(得分:2)

轻松地避免所有这些循环:

sentence = 'A priority area for this company is access to medical care and how we address it.'
sentence = sentence.lower().split()

### if both in list
if 'access' in sentence and 'care' in sentence :
    ### take indexes
    access_position = sentence.index('access')
    care_position = sentence.index('care')
    ### check the distance between indexes
    if abs( access_position - care_position ) < 4  :
        print("found access and care in less than 4 words")

### result:
found access and care in less than 4 words 

答案 1 :(得分:1)

您可以找到索引,因此可以使用索引进行检查。 将您的代码修改为:

sentence = 'A priority area for this company is access to medical care and how we address it.'

sentence = sentence.lower()
sentence = sentence.split()
access_index = 0
care_index = 0
for i, j in enumerate(sentence):

      if 'access' in j:
          access_index= i


      if 'care' in j:
          care_index = i

if access_index - care_index < 4:
          print ("Less than 4 words")
else:
          print ("More than 4 words")

答案 2 :(得分:1)

您可以这样做:

access = sentence.index("access")
care = sentence.index("care")

if abs(care - access) <= 4:
    print("Less than or equal to 4")
else:
    print("More than 4")

当然,请修改以上代码以适合您的特定情况。

答案 3 :(得分:1)

如果句子中出现多次“护理”或“访问”,那么到目前为止所有答案只会考虑其中之一,有时可能无法检测到匹配项。相反,您需要考虑每个单词的所有出现次数:

sentence = "Access to tickets and access to care"
sentence = sentence.lower().split()

access_positions = [i for (i, word) in enumerate(sentence) if word == 'access']
care_positions = [i for (i, word) in enumerate(sentence) if word == 'care']

sentence_is_relevant = any(
    abs(access_i - care_i) <= 4
    for access_i in access_positions
    for care_i in care_positions
)
print("sentence_is_relevant =", sentence_is_relevant)