RegEx模式提取授权号

时间:2012-05-29 12:48:37

标签: regex vba

我使用GSKinner's Reg Exr tool来帮助提出一种模式,该模式可以在包含大量其他垃圾的字段中找到授权号。授权号是一个字符串,包含字母(有时),数字(总是)和连字符(有时)(授权总是包含某个地方的数字,但并不总是包含连字符和字母)。此外,授权号码可以位于我正在搜索的字段中的任何位置。

适当授权号的示例包括:

 5555834384734  ' All digits
 12110-AANM     ' Alpha plus digits, plus hyphens
 R-455545-AB-9  ' Alpha plus digits, plus multiple hyphens
 R-45-54A-AB-9  ' Alpha plus digits, plus multiple hyphens
 W892160        ' Alpha plus digits without hypens

这里有一些带有额外垃圾的示例数据,有时会将其附加到带有连字符或没有空格的实际授权号,使其看起来像数字的一部分。垃圾以可预测的形式/单词出现:REF,CHEST,IP,AMB,OBV和HOLD,它们不属于授权号。

 5557653700 IP
 R025257413-001
 REF 120407175
 SNK601M71016
 U0504124 AMB
 W892160
 019870270000000
 00Q926K2
 A025229563
 01615217 AMB
 12042-0148
 SNK601M71016
 12096NHP174
 12100-ACDE
 12110-AANM
 12114AD5QIP
 REF-34555
 3681869/OBV ONL

以下是我使用的模式:

 "\b[a-zA-Z]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*"

我正在学习RegExp,所以毫无疑问可以改进,但它适用于上述情况,而不适用于以下情况:

 REFA5-208-4990IP  'Extract the string 'A5-208-4990'without REF or IP
 OBV1213110379     'Extract the string '1213110379' without the OBV
 5520849900AMB     'Extract the string '5520849900' without AMB
 5520849900CHEST   'Extract the string '5520849900' without CHEST
 5520849900-IP     'Extract the string '5520849900' without -IP
 1205310691-OBV    'Extract the string without the -OBV
 R-025257413-001   'Numbers of this form should also be allowed.
 NO PCT 93660      'If string contains the word NO anywhere, it is not a match
 HOLDA5-208-4990   'If string contains the word HOLD anywhere, it is not a match

有人可以帮忙吗?

出于测试目的,这里的Sub创建了一个包含样本输入数据的表:

 Sub CreateTestAuth()

 Dim dbs As Database
 Set dbs = CurrentDb

 With dbs
     .Execute "CREATE TABLE tbl_test_auth " _
         & "(AUTHSTR CHAR);"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5557653700 IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "(' R025257413-001');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('REF 120407175');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('SNK601M71016');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('U0504124 AMB');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('3681869/OBV ONL');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('REFA5-208-4990IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900AMB');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900CHEST');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900-IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('1205310691-OBV');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('HOLDA5-208-4990');"
     .Close
 End With
 End Sub

5 个答案:

答案 0 :(得分:1)

好的,起初我认为额外的要求会使正则表达式很多更长。
但积极的前瞻,它实际上几乎相同的大小。这次只有正则表达式:
\b(?=.*\d)([a-z0-9]+(?:-[a-z0-9]+)*)\b

或者用注释细分(忽略空格):

\b                     # Word start
  (?=.*\d)             # A number has to follow somewhere after this point
  (                    # Start capture group
    [a-z0-9]+          # At least one alphanum
    (?:-[a-z0-9]+)*    # Possibly more attached with hyphen
  )                    # End capture group
\b                     # Word end

但请注意,所有正则表达式都不支持可变宽度前瞻。我不知道VBA的一个。

第二个注意事项:如果数字出现在单词结尾之后,(?=)内容也会得到满足。所以在 DONT-RECOGNIZE-ME 但是1-5ay-yes
大胆的部分将被捕获。

答案 1 :(得分:0)

\ b开始是一个问题。还需要注意一些空间和一些破折号。试试这个“[a-zA-Z|\s|-]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*”。仅在授权号码上运行此操作。

答案 2 :(得分:0)

由于额外的过滤,我会使用两步法。

var splitter = new Regex(@"[\t\n\r]+", RegexOptions.Multiline);
const string INPUT = @"REFA5-208-4990IP
       OBV1213110379
       5520849900AMB
       5520849900CHEST
       5520849900-IP
       1205310691-OBV
       R-025257413-001
       NO PCT 93660
       HOLDA5-208-4990";
string[] lines = splitter.Split(INPUT);

var blacklist = new[] { "NO", "HOLD" };
var ignores = new[] { "REF", "IP", "CHEST", "AMB", "OBV" };

var filtered = from line in lines
         where blacklist.All(black => line.IndexOf(black) < 0)
         select ignores.Aggregate(line, (acc, remove) => acc.Replace(remove, ""));

var authorization = new Regex(@"\b([a-z0-9]+(?:-[a-z0-9]+)*)\b", RegexOptions.IgnoreCase);
foreach (string s in filtered)
{
  Console.Write("'{0}' ==> ", s);
  var match = authorization.Match(s);
  if (match.Success)
  {
    Console.Write(match.Value);
  }
  Console.WriteLine();
}

打印:

'A5-208-4990' ==> A5-208-4990
' 1213110379' ==> 1213110379
' 5520849900' ==> 5520849900
' 5520849900' ==> 5520849900
' 5520849900-' ==> 5520849900
' 1205310691-' ==> 1205310691
' R-025257413-001' ==> R-025257413-001

答案 3 :(得分:0)

有时很容易让它松散而不是严格地坚持这种或那种方式。 :)

试试这个:

1 - 添加此功能

Public Function RemoveJunk(ByVal inputValue As String, ParamArray junkWords() As Variant) As String
    Dim junkWord
    For Each junkWord In junkWords
        inputValue = Replace(inputValue, junkWord, "", , , vbBinaryCompare)
    Next
    RemoveJunk = inputValue
End Function

2 - 现在你的任务很简单。请参阅下面的示例,了解如何使用它:

Sub Sample()
    Dim theText As String
    theText = " REFA5-208-4990IP blah blah "
    theText = RemoveJunk(theText, "-REF", "REF", "-IP", "IP", "-OBV", "OBV") '<-- complete this in a similar way

    Debug.Print theText

    '' -- now apply the regexp here --


End Sub

RemoveJunk函数调用的完成有点棘手。把较长的那些放在短的之前。例如-OBV应该在“OBV”之前。

试一试,看看它是否能解决您的问题。

答案 4 :(得分:0)

您的示例输入文件(此文件的路径s / b作为function<GetMatches>提供给inputFilePath):

5557653700 IP
R025257413-001
REF 120407175
SNK601M71016
U0504124 AMB
W892160
019870270000000
00Q926K2
A025229563
01615217 AMB
12042-0148
SNK601M71016
12096NHP174
12100-ACDE
12110-AANM
12114AD5QIP
REF-34555
3681869/OBV ONL

这里是保存在文件中的邮箱(此文件的路径s / b作为function<GetMatches>提供给replaceDBPath):

^REF
IP$
^OBV
AMB$
CHEST$
-OBV$
^.*(NO|HOLD).*$

这里有bas

Option Explicit
'This example uses the following references:
'Microsoft VBScript Regular Expressions 5.5 and Microsoft Scripting Runtime

Private fso As New Scripting.FileSystemObject
Private re As New VBScript_RegExp_55.RegExp

Private Function GetJunkList(fpath$) As String()
0     On Error GoTo errHandler
1     If fso.FileExists(fpath) Then
2         Dim junkList() As String, mts As MatchCollection, mt As Match, pos&, tmp$
3         tmp = fso.OpenTextFile(fpath).ReadAll()
4         With re
5             .Global = True
6             .MultiLine = True
7             .Pattern = "[^\r\n]+"
8             Set mts = .Execute(tmp)
9             ReDim junkList(mts.Count - 1)
10            For Each mt In mts
11                junkList(pos) = mt.Value
12                pos = pos + 1
13            Next mt
14        End With
15        GetJunkList = junkList
16    Else
17        MsgBox "File not found at:" & vbCr & fpath
18    End If
19    Exit Function
errHandler:
     Dim Msg$
     With Err
         Msg = "Error '" & .Number & " " & _
        .Description & "' occurred in " & _
        "Function<GetJunkList> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
     End With
     MsgBox Msg, vbCritical
End Function

Public Function GetMatches(replaceDBPath$, inputFilePath$) As String()
0     On Error GoTo errHandler
1     Dim junks() As String, junkPat$, tmp$, results() As String, pos&, mts As MatchCollection, mt As Match
2     junks = GetJunkList(replaceDBPath)
3     tmp = fso.OpenTextFile(inputFilePath).ReadAll
4
5     With re
6        .Global = True
7        .MultiLine = True
8        .IgnoreCase = True
9        For pos = LBound(junks) To UBound(junks)
10           .Pattern = junkPat
11           junkPat = junks(pos)
12           'replace junk with []
13           tmp = .Replace(tmp, "")
14       Next pos
15
16       'trim lines [if all input data in one line]
17       .Pattern = "^[ \t]*|[ \t]*$"
18       tmp = .Replace(tmp, "")
19
20       'create array using provided pattern
21       pos = 0
22       .Pattern = "\b[a-z]*[\d]+\-*\d*[a-z0-9]*\b"
23       Set mts = .Execute(tmp)
24       ReDim results(mts.Count - 1)
25       For Each mt In mts
26           results(pos) = mt.Value
27           pos = pos + 1
28       Next mt
29    End With
30
31    GetMatches = results
32    Exit Function
errHandler:
     Dim Msg$
     With Err
         Msg = "Error '" & .Number & " " & _
        .Description & "' occurred in " & _
        "Function<GetMatches> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
     End With
     MsgBox Msg, vbCritical
End Function

和样本测试人员

Public Sub tester()
    Dim samples() As String, s
    samples = GetMatches("C:\Documents and Settings\Cylian\Desktop\junks.lst", "C:\Documents and Settings\Cylian\Desktop\sample.txt")
    For Each s In samples
        MsgBox s
    Next
End Sub
可以从immediate window调用

tester

希望这有帮助。

相关问题