节中尾随空格的句子分割 (stanford corenlp)

时间:2021-02-17 08:53:43

标签: nlp stanford-nlp

使用库 Stanza 进行句子分割:

import stanza
stanza.download('en')
snlp = stanza.Pipeline(lang="en",processors='tokenize')
doc = snlp(text)
doc_sents = [sentence.text for sentence in doc.sentences]

输出:

["Arthur's Magazine (1844–1846) was an American literary periodical published in Philadelphia in the 19th century.",
 'Edited by T.S. Arthur, it featured work by Edgar A. Poe, J.H. Ingraham, Sarah Josepha Hale, Thomas G. Spear, and others.',
 'In May 1846 it was merged into "Godey\'s Lady\'s Book".',
 "First for Women is a woman's magazine published by Bauer Media Group in the USA.",
 'The magazine was started in 1989.',
 'It is based in Englewood Cliffs, New Jersey.',
 'In 2011 the circulation of the magazine was 1,310,696 copies.']

但是,然后我丢失了尾随空格,有没有办法“保留”它们,类似于 spacy 中使用的行为 [sent.text_with_ws for sent in doc.sents] 任何解决方法也会有所帮助。我需要保留原来的空格来处理原来写在全文上的索引

["Arthur's Magazine (1844–1846) was an American literary periodical published in Philadelphia in the 19th century. ",
 'Edited by T.S. Arthur, it featured work by Edgar A. Poe, J.H. Ingraham, Sarah Josepha Hale, Thomas G. Spear, and others. ',
 'In May 1846 it was merged into "Godey\'s Lady\'s Book". ',
 "First for Women is a woman's magazine published by Bauer Media Group in the USA. ",
 'The magazine was started in 1989. ',
 'It is based in Englewood Cliffs, New Jersey. ',
 'In 2011 the circulation of the magazine was 1,310,696 copies.']

0 个答案:

没有答案