Question

我正在尝试将文本分成不同的部分，但我不知道如何将其分成相等的部分，即每个部分都包含整个单词而不是单词的部分。例如：

0 Division: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's sta
1 Division: ndard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a typ
2 Division: e specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining
3 Division:  essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum
4 Division: passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

我希望 0 部门包含 "standard" 而不是 "sta"。

def main():
    text = "Lorem Ipsum is simply dummy text of the printing and " \
           "typesetting industry. Lorem Ipsum has been the industry's " \
           "standard dummy text ever since the 1500s, when an unknown " \
           "printer took a galley of type and scrambled it to make a type " \
           "specimen book. It has survived not only five centuries, but also " \
           "the leap into electronic typesetting, remaining essentially unchanged. " \
           "It was popularised in the 1960s with the release of Letraset sheets " \
           "containing Lorem Ipsum passages, and more recently with desktop publishing " \
           "software like Aldus PageMaker including versions of Lorem Ipsum"

    n_divisions = 5
    for i in range(n_divisions):
        print(i, "Division:", text[int((i / n_divisions) * len(text)): int(((i + 1) / n_divisions) * len(text ))])



if __name__ == '__main__':
    main()

我不想使用split()，因为我只想要整个字符串而不将其分成单词，因为我想将文本行发送到不同的进程，每个进程都会拆分收到的字符串< /p>

Answer 1

通过沿空格分割然后使用标记的分割（但将标记合并在一起）来标记化：

...
n_divisions = 5
tokens = text.split()
n_tokens = len(tokens)
for i in range(n_divisions):
    print(i, "Division:", ' '.join(tokens[i*n_tokens//n_divisions : min((i+1)*n_tokens//n_divisions,n_tokens)]))
...

输出：

0 Division: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's
1 Division: standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled
2 Division: it to make a type specimen book. It has survived not only five centuries, but also the leap
3 Division: into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets
4 Division: containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

将字符串分成切片但包含整个单词

1 个答案: