将字符串分成切片但包含整个单词

时间:2021-03-30 15:00:29

标签: python

我正在尝试将文本分成不同的部分,但我不知道如何将其分成相等的部分,即每个部分都包含整个单词而不是单词的部分。例如:

0 Division: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's sta
1 Division: ndard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a typ
2 Division: e specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining
3 Division:  essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum
4 Division: passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

我希望 0 部门包含 "standard" 而不是 "sta"

def main():
    text = "Lorem Ipsum is simply dummy text of the printing and " \
           "typesetting industry. Lorem Ipsum has been the industry's " \
           "standard dummy text ever since the 1500s, when an unknown " \
           "printer took a galley of type and scrambled it to make a type " \
           "specimen book. It has survived not only five centuries, but also " \
           "the leap into electronic typesetting, remaining essentially unchanged. " \
           "It was popularised in the 1960s with the release of Letraset sheets " \
           "containing Lorem Ipsum passages, and more recently with desktop publishing " \
           "software like Aldus PageMaker including versions of Lorem Ipsum"

    n_divisions = 5
    for i in range(n_divisions):
        print(i, "Division:", text[int((i / n_divisions) * len(text)): int(((i + 1) / n_divisions) * len(text ))])



if __name__ == '__main__':
    main()

我不想使用split(),因为我只想要整个字符串而不将其分成单词,因为我想将文本行发送到不同的进程,每个进程都会拆分收到的字符串< /p>

1 个答案:

答案 0 :(得分:-1)

通过沿空格分割然后使用标记的分割(但将标记合并在一起)来标记化:

...
n_divisions = 5
tokens = text.split()
n_tokens = len(tokens)
for i in range(n_divisions):
    print(i, "Division:", ' '.join(tokens[i*n_tokens//n_divisions : min((i+1)*n_tokens//n_divisions,n_tokens)]))
...

输出:

0 Division: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's
1 Division: standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled
2 Division: it to make a type specimen book. It has survived not only five centuries, but also the leap
3 Division: into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets
4 Division: containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
相关问题