将段落分为句子

时间:2015-08-23 15:55:19

标签: swift

我有很多文字。例如

  

我想将一个段落分成句子。但有个问题。我的段落包括2014年1月13日之类的日期,像U.A.E这样的字样和2.2之类的数字。我该如何拆分。**

输出:

I want to split a paragraph into sentences.

But, there is a problem.

My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2.

How do i split this.

这是我想要的输出。任何人都可以指导我在Swift中这样做。

感谢。

5 个答案:

答案 0 :(得分:6)

使用NSLinguisticTagger。它可以为您的输入提供正确的句子,因为它可以用实际的语言术语进行分析。

这是一个粗略的草案(Swift 1.2,这不会在Swift 2.0中编译):

let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTagsInRange(
    indices(s), scheme: NSLinguisticTagSchemeLexicalClass,
    options: nil, tokenRanges: &r)
var result = [String]()
let ixs = Array(enumerate(t)).filter {
    $0.1 == "SentenceTerminator"
    }.map {r[$0.0].startIndex}
var prev = s.startIndex
for ix in ixs {
    let r = prev...ix
    result.append(
        s[r].stringByTrimmingCharactersInSet(
             NSCharacterSet.whitespaceCharacterSet()))
    prev = advance(ix,1)
}

这是一个Swift 2.0版本(更新到Xcode 7 beta 6):

let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTagsInRange(
    s.characters.indices, scheme: NSLinguisticTagSchemeLexicalClass,
    tokenRanges: &r)
var result = [String]()
let ixs = t.enumerate().filter {
    $0.1 == "SentenceTerminator"
}.map {r[$0.0].startIndex}
var prev = s.startIndex
for ix in ixs {
    let r = prev...ix
    result.append(
        s[r].stringByTrimmingCharactersInSet(
            NSCharacterSet.whitespaceCharacterSet()))
    prev = ix.advancedBy(1)
}

这里更新了Swift 3:

let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTags(
    in: s.startIndex..<s.endIndex,
    scheme: NSLinguisticTagSchemeLexicalClass,
    tokenRanges: &r)
var result = [String]()
let ixs = t.enumerated().filter {
    $0.1 == "SentenceTerminator"
    }.map {r[$0.0].lowerBound}
var prev = s.startIndex
for ix in ixs {
    let r = prev...ix
    result.append(
        s[r].trimmingCharacters(
            in: NSCharacterSet.whitespaces))
    prev = s.index(after: ix)
}

result是一个包含四个字符串的数组,每个字符串一个句子:

["I want to split a paragraph into sentences.", 
 "But, there is a problem.", 
 "My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2.", 
 "How do i split this."]

答案 1 :(得分:0)

这是我相信您正在寻找的粗略版本: 我在角色中循环寻找&#34;的组合。 &#34;

循环运行时,字符会添加到currentSentence String?。找到组合后,currentSentence会添加到sentences[sentenceNumber]

此外,必须捕获2个异常,第一次循环在迭代2上为period == index-1。第二个是最后一句话,因为在这段时间之后没有空格。

var paragraph = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E abd numbers like 2.2. How do I split this."

var sentences = [String]()
var sentenceNumber = 0
var currentSentence: String? = ""

var charArray = paragraph.characters
var period = 0

for (index, char) in charArray.enumerate() {
    currentSentence! += "\(char)"
    if (char == ".") {
        period = index

        if (period == charArray.count-1) {
            sentences.append(currentSentence!)
        }
    } else if ((char == " " && period == index-1 && index != 1) || period == (charArray.count-1)) {

        sentences.append(currentSentence!)
        print(period)
        currentSentence = ""
        sentenceNumber++
    }
}

答案 2 :(得分:0)

这是迅速4中的无聊答案

 func splitsentance(string: String) -> [String]{
    let s = string
    var r = [Range<String.Index>]()
    let t = s.linguisticTags(
        in: s.startIndex..<s.endIndex, scheme:    NSLinguisticTagScheme.lexicalClass.rawValue,
        options: [], tokenRanges: &r)
    var result = [String]()

    let ixs = t.enumerated().filter{
         $0.1 == "SentenceTerminator"
    }.map {r[$0.0].lowerBound}
    var prev = s.startIndex
    for ix in ixs {
        let r = prev...ix
        result.append(
            s[r].trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
        prev = ix
    }
    return result
}

答案 3 :(得分:0)

通过语言标记枚举感觉就像是处理此任务的有效方法。 我们可以消除存储多余st的开销。

Enum.reverse/1

答案 4 :(得分:0)

NSLinguisticTagger已过时。改用NLTagger。 (iOS 12.0 +,macOS 10.14 +)

import NaturalLanguage

var str = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."

func splitSentenceFrom(text: String) -> [String] {
    var result: [String] = []
    let tagger = NLTagger(tagSchemes: [.lexicalClass])
    tagger.string = text
    tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .sentence, scheme: .lexicalClass) { (tag, tokenRange) -> Bool in
        result.append(String(text[tokenRange]))
        return true
    }
    return result
}

let sentences = splitSentenceFrom(text: str)

sentences.forEach {
    print($0)
}

输出:

I want to split a paragraph into sentences. 
But, there is a problem. 
My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. 
How do i split this.

是否要排除空白句子并修剪空白?添加

let sentence = String(text[tokenRange]).trimmingCharacters(in: .whitespacesAndNewlines)
if sentence.count > 0 {
    result.append(sentence)
}