
时间:2017-11-19 19:46:50

标签: python string split iterator



  • line是我正在阅读的字符串
  • 标点符号列表是预定义列表(不重要)
  • sentence_boundary是我试图用来知道何时拆分句子的布尔
  • 我使用 i prev c 来检查当前 next next,next 字符

由于我向后工作,代码会找到 NOT 句子边界的所有条件。它检查多个案例并使用迭代器检查下一个字符。因为我正在使用迭代器,所以我决定每次使用递归来传递一个较小的字符串,这样我就可以迭代搜索整个字符串。功能正常。

然而,目标是将字符串分割为标点符号 IS 实际上是句子边界的点(即,当不满足其他情况时)。由于我的递归功能,我已经陷入了一个问题,我无法跟踪我所在列表的索引,因此不知道在哪里拆分句子。我想以某种方式使用辅助函数,但我不知道如何跟踪索引。


def parse(line): #function

sentence_boundary = True

if (len(line) == 3):

t = iter(line)
i = next(t)
prev = next(t)
c = next(t)

# periods followed by a digit with no intervening whitespace are not sentence boundaries
if i == "." and (prev.isdigit()):
    print("This is a digit")
    sentence_boundary = False

# periods followed by certain kinds of punctuation are probably not sentence boundaries
for j in punctuation_list:
    if i == "." and (prev == j):
        print("Found a punctuation")
        sentence_boundary = False

# periods followed by a whitespace followed by a lower case letter are not sentence boundaries
if (i == "." and prev == " " and c.islower()):
    print("This is a lower letter")
    sentence_boundary = False

# periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries
if i == '.' and prev.islower() and c.islower():
    print("This is a period within a sentence")
    sentence_boundary = False

# periods followed by a whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries
if c == '.' and prev.islower() and i.isupper():
    print("This is a title")
    sentence_boundary = False

index = line.index(i)


if __name__ == "__main__":

1 个答案:

答案 0 :(得分:0)

我认为您的代码很难遵循。 prev通常是“之前”的缩写,因此使用“next”的含义对我来说毫无意义。


def parse(line, index=0): #function
    parse(line, index+1)