将字符串拆分为句子 - 忽略拆分的缩写

时间:2016-01-14 08:21:20

标签: javascript regex string

我正在尝试将此字符串拆分为句子,但我需要处理缩写(将x.y.固定为单词:

content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool."

我试过这个正则表达式:

content.replace(/([.?!])\s+(?=[A-Za-z])/g, "$1|").split("|");

但正如您所看到的,缩写存在问题。由于所有缩写都是格式x.y.,因此应该可以将它们作为单词处理,而不会在此处拆分字符串。

"This is a long string with some numbers 123.456,78 or 100.000 and e.g.", 
"some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e.", 
"in this one.", 
"here and abbr at the end x.y..",
"cool."

结果应为:

"This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e. in this one.", 
"here and abbr at the end x.y..",
"cool."

2 个答案:

答案 0 :(得分:5)

解决方案是匹配并捕获缩写并使用回调构建替换:

var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g; 
var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
var result = str.replace(re, function(m, g1, g2){
  return g1 ? g1 : g2+"\r";
});
var arr = result.split("\r");
document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";

正则表达式解释:

  • \b(\w\.\w\.) - 匹配并捕获第1组缩写(由单词字符组成,然后是.并再次将单词字符和.组成)作为整个单词
  • | - 或......
  • ([.?!])\s+(?=[A-Za-z])
    • ([.?!]) - 匹配并捕获到第2组.?!
    • \s+ - 匹配1个或多个空格符号......
    • (?=[A-Za-z]) - 在ASCII字母之前。

答案 1 :(得分:0)

根据您的示例,我通过使用此表达式成功实现了您的目标:(?<!\..)[.?!]\s+(示例here)。

此表达式将查找不包含字符和句点的句点,问号或感叹号字符。

然后,您需要使用|字符替换它们,最后,将|替换为.\n

相关问题