Question

有一个用户输入的字符串，可以有两种不同的格式，但有一些小的变化：

Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560–564
Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564

我需要得到的是：

作者字符串部分：Some AB, Author C, Names DEF或Some AB, Author C, Names DEF et al
文章标题字符串：The title string或The title string?
日记名称字符串：T journal name
年值：2018
版本值：10
页码560-564

所以我必须用定界符.或(1234)，;和:分割字符串。

我没有为此工作的正则表达式，也不知道如何处理两种格式，它们的年值在不同的位置。

我从类似的东西开始

string.split(/^\(\d+\)\s*/)

但是当我得到一个数组时，我该怎么做。

Answer 1

我也建议使用匹配模式：

^([^.(]+)(?:\((\d{4})\)|\.)\s*([^?!.]*.)\s*([^0-9,]+)(\d{4})?[,; ]*([^,: ]*)[,;: ]*(\d+(?:[–-]\d+)?)

或者使用named capture groups ^*更具可读性的版本：

^(?<author>[^.(]+)(?:\((?<yearf1>\d{4})\)|\.)\s*(?<title>[^?!.]*.)\s*(?<journal>[^0-9,]+)(?<yearf2>\d{4})?[,; ]*(?<issue>[^,: ]*)[,;: ]*(?<pages>\d+(?:[–-]\d+)?)

我已经支持Schifini以及使用否定字符类来查找所需片段的方法。
为了区分两种不同的格式，我为年份格式1和格式2添加了两个可选的命名组，并将其余的包装在其他捕获组中。剩下的唯一一件事就是检查第2组或第5组是否保留年份。

Demo

代码示例：

const regex = /^([^.(]+)(?:\((\d{4})\)|\.)\s*([^?!.]*.)\s*([^0-9,]+)(\d{4})?[,; ]*([^,: ]*)[,;: ]*(\d+(?:[–-]\d+)?)/gm;
const str = `Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560–564
Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    array={};
    m.forEach((match, groupIndex) => {
        switch(groupIndex) {
        case 0:
            console.log(`Full match: ${match}`);
            break;
        case 1:
            array['author'] = match.trim();
            break;
        case 2:
            if(match)
                array['year'] = match;
            break;
        case 3:
            array['title'] = match.trim();
            break;
        case 4:
            array['journal'] = match.trim();
            break;
        case 5:
            if(match)
                array['year'] = match.trim();
            break;
        case 6:
            array['issue'] = match.trim();
            break;
        case 7:
            array['pages'] = match.trim();
            break;        
        default:
            console.log(`Unknown match, group ${groupIndex}: ${match}`);
        }
    });
    console.log(JSON.stringify(array));
}

在所有主要浏览器中的Javascript are not supported中

_{^*命名的捕获组。只需删除它们或使用Steve Levithan's XRegExp library即可解决这些问题。}

Answer 2

由于没有特定的分隔符，因此您必须逐个提取所需的零件。

对于这些示例，您可以使用以下内容获得作者，文章名称和期刊：

str.match(/^([^.(]*)[^ ]*([^?.]*.)([^0-9,]*)/)

^([^.(]*)从一开始就捕获所有内容，直到找到(或.
[^ ]*跳过了文章之前的年份(2018)。
([^?.]*.)捕获文章名称
和([^0-9,]*)捕获日记名称

匹配将返回一个包含四个元素的数组。这三个捕获点位于索引1到3。

请参见Regex101。

数字匹配是可行的。尝试使用另一个单独的正则表达式捕获它们。由于四位数的数字也可能是页码，所以这一年可能比较棘手。

Answer 3

您可以编写函数来解析字符串，而不是尝试找出复杂的正则表达式（在这种情况下不可能使用IMHO）。根据您的样本数据，可能是这样的：

var str = [
  "Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560–564",
  "Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564",
  "Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564",
  "Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564",
  "Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564"
];

function parse(str) {
  var result = [];
  var tmp = "";
  for (var i = 0; i < str.length; i++) {
    var c = str.charAt(i);
  	
    if(c === ",") {
      if(str.charAt(i + 1) === "(") {
          result.push(tmp.trim());
          i++;
          tmp = "";
          continue;
      }
      
      if((str.charAt(i + 1) === " ") && !isNaN(str.charAt(i + 2))) {
        result.push(tmp.trim());
        i++;
        tmp = "";
        continue;
      }
    }
    
    if((c === ".") || (c === "?") || (c === ":")) {
    	if(str.charAt(i + 1) === " ") {
          result.push(tmp.trim());
          i++;
          tmp = "";
          continue;
      }
    }    

    if((c === "(") || (c === ")") || (c === ";")  || (c === ":")) {
      result.push(tmp.trim());
      tmp = "";
      if(str.charAt(i + 1) === " ") {
      	i++;
      }
      continue;
    }
    
    if((c === " ") && !isNaN(str.charAt(i + 1))){
      result.push(tmp.trim());
      tmp = "";
      continue;
    }
    
    tmp += c;
  }
  result.push(tmp.trim());
  
  if(!isNaN(result[3])) {
  	result = [result[0], result[3], result[1], result[2], result[4], result[5]];
  }
  
	return result;
}

for(var j = 0; j < str.length; j++) {
	console.info(parse(str[j]));
}

用不同的分隔符将两个不同的格式字符串分成多个部分

3 个答案: