将长字符串剪切为包含完整句子的段落

时间:2018-03-04 20:28:38

标签: python-3.x google-translate

我有一项任务是使用在线翻译api(谷歌,yandex等)翻译很长的文本(超过50k符号)。所有这些都对请求长度有限制。所以,我想把我的文本剪切成长度小于这些限制的字符串列表,但也保存未剪切的句子。

例如,如果我想处理此文本,限制为300个符号:

  

斯坦福NLP集团为所有人提供了一些自然语言处理软件!我们为主要的计算语言学问题提供统计NLP,深度学习NLP和基于规则的NLP工具,这些工具可以结合到具有人类语言技术需求的应用程序中。这些包广泛用于工业,学术界和政府。此代码正在积极开发中,我们尝试以尽力而为的方式回答问题并修复错误。我们所有支持的软件发行版都是用Java编写的。从2014年10月开始,我们软件的当前版本需要Java 8+。 (2013年3月到2014年9月的版本需要Java 1.6+; 2005年到2013年2月的版本需要Java 1.5+ .Stanford Parser最初用Java 1.1编写。)分发包包括用于命令行调用的组件,jar文件,Java API和源代码。您也可以在GitHub和Maven上找到我们。许多乐于助人的人通过其他语言的绑定或翻译扩展了我们的工作。因此,很多软件也可以从Python(或Jython),Ruby,Perl,Javascript,F#以及其他.NET和JVM语言中轻松使用。

我应该得到那个输出:

['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.', 
'These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java.', 
'Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.)', 
'Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages.', 
'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']  

最狡猾的方式是什么?是否有任何regexp来实现这一目标?

1 个答案:

答案 0 :(得分:3)

正则表达式不是从段落中解析句子的正确工具。你应该看看nltk

import nltk

# this line only needs to be run once per environment:
nltk.download('punkt') 

text = """The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages."""

sents = nltk.sent_tokenize(text)

sents
# outputs:
['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!',
 'We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.',
 'This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis.',
 'All our supported software distributions are written in Java.',
 'Current versions of our software from October 2014 forward require Java 8+.',
 '(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+.',
 'The Stanford Parser was first written in Java 1.1.)',
 'Distribution packages include components for command-line invocation, jar files, a Java API, and source code.',
 'You can also find us on GitHub and Maven.',
 'A number of helpful people have extended our work, with bindings or translations for other languages.',
 'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']

基于累积长度聚合句子的一种方法是使用生成器函数:

这里,如果字符串的长度超过300个字符或者到达了iterable的末尾,函数g将产生一个连接的字符串。此函数假定没有单个句子超过300个字符的限制。

def g(sents):
    idx = 0
    text_length = 0
    for i, s in enumerate(sents):
        if text_length + len(s) > 300:
            yield ' '.join(sents[idx:i])
            text_length = len(s)
            idx = i
        else:
            text_length += len(s)
    yield ' '.join(sents[idx:])

可以像这样调用句子聚合器:

for s in g(sents):
    print(s)
outputs:
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!
We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.
This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+.
(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.

检查每个文本段的长度表明所有段的字符数少于300个:

[len(s) for s in g(sents)]
#outputs:
[100, 268, 244, 276, 289]
相关问题