我正在尝试将文本分成不同的部分,但我不知道如何将其分成相等的部分,即每个部分都包含整个单词而不是单词的部分。例如:
0 Division: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's sta
1 Division: ndard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a typ
2 Division: e specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining
3 Division: essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum
4 Division: passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
我希望 0 部门包含 "standard"
而不是 "sta"
。
def main():
text = "Lorem Ipsum is simply dummy text of the printing and " \
"typesetting industry. Lorem Ipsum has been the industry's " \
"standard dummy text ever since the 1500s, when an unknown " \
"printer took a galley of type and scrambled it to make a type " \
"specimen book. It has survived not only five centuries, but also " \
"the leap into electronic typesetting, remaining essentially unchanged. " \
"It was popularised in the 1960s with the release of Letraset sheets " \
"containing Lorem Ipsum passages, and more recently with desktop publishing " \
"software like Aldus PageMaker including versions of Lorem Ipsum"
n_divisions = 5
for i in range(n_divisions):
print(i, "Division:", text[int((i / n_divisions) * len(text)): int(((i + 1) / n_divisions) * len(text ))])
if __name__ == '__main__':
main()
我不想使用split()
,因为我只想要整个字符串而不将其分成单词,因为我想将文本行发送到不同的进程,每个进程都会拆分收到的字符串< /p>
答案 0 :(得分:-1)
通过沿空格分割然后使用标记的分割(但将标记合并在一起)来标记化:
...
n_divisions = 5
tokens = text.split()
n_tokens = len(tokens)
for i in range(n_divisions):
print(i, "Division:", ' '.join(tokens[i*n_tokens//n_divisions : min((i+1)*n_tokens//n_divisions,n_tokens)]))
...
输出:
0 Division: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's
1 Division: standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled
2 Division: it to make a type specimen book. It has survived not only five centuries, but also the leap
3 Division: into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets
4 Division: containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum