Question

我想用正则表达式解析MARC记录，并将该字段作为第一个捕获组返回，将值作为第二个捕获组返回。以下是我到目前为止正则表达式的内容：

([^\n]*)

520    Anne, an eleven-year-old orphan, is sent by mistake to 
       live with a lonely, middle-aged brother and sister on a 
       Prince Edward Island farm and proceeds to make an 
       indelible impression on everyone around her. 
650  0 Shirley, Anne (Fictitious character)|vJuvenile fiction.

然而，当涉及到突破线的值时，正则表达式不再起作用：

下一个停靠区应该是上面的([^\n0-9]*)。所以正则表达式应该捕获所有内容，直到换行符后跟3位数。

我确实尝试了{{1}}，但这被解释为匹配除数字以外的任何内容或以任何顺序的换行符。我需要它来匹配换行符和3个数字的确切序列。

Answer 1

这个正则表达式，如regex101所示：

(\n[0-9]{3})[ 0-9]{4}([^\n]+(?:\n\s+[^\n]+)*)

捕获组([^\n]+(?:\n\s+[^\n]+)*)匹配

任何非换行符：[^\n]+
然后任意数量的额外行：(?:\n\s+[^\n]+)*

Answer 2

在末尾添加否定前瞻以确保换行后跟3位数。还有一些方法可以缩短正则表达式。

(\n\d{3})[ \d]{4}((?:(?!\n\d{3}).)*)

如何使用正则表达式解析MARC记录？

2 个答案: