Python正则表达式找不到匹配项

时间:2018-07-12 13:39:48

标签: python regex

我正在寻找以下文本字符串中的匹配项:

'<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   SN G2250-010\n  </title>\n  <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n  <table>\n   <tr>\n    <td>\n     <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n    </td>\n    <td align="left" width="178">\n     <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n    </td>\n    <td>\n     <div class="subtitle2">\n      <b>\n       <font color="red">\n        Life Sciences and Chemical Analysis Service Note\n       </font>\n      </b>\n     </div>\n    </td>\n   </tr>\n  </table>\n  <h2>\n   SERVICE NOTE G2250-010\n  </h2>\n  <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode)  on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n  Decimal symbol  \'.\'\r\n- Number format (default for English (United States)) \r\n  Digit grouping symbol  \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n*                              Information Only                             
*\r\n******************************************************************************\r\n*             Author/Entity: AG/B404                                         *\r\n*  Additional Information: None                                          
*\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'

我在Python 3.6.4中定义了原始字符串:

r = r'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'

,然后我将其用于搜索:

a = re.search(r, raw_string, re.M|re.S)

这不返回任何匹配项:

a[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not subscriptable

尽管regex101上的字符串和regex完全相同:

https://regex101.com/r/qgJMbO/1

谁能告诉我问题出在哪里吗?

编辑:

预期结果是:

a [1] `INB22000与Windows 2000和ChemStation A.9.01的兼容性\ r \ n \ r \

a [2] “仅信息”

a [3] 'AG / B404'

1 个答案:

答案 0 :(得分:4)

我提供了同时使用BeautifulSoupre的解决方案

from bs4 import BeautifulSoup as bs4
import re

docstring = '<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   SN G2250-010\n  </title>\n  <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n  <table>\n   <tr>\n    <td>\n     <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n    </td>\n    <td align="left" width="178">\n     <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n    </td>\n    <td>\n     <div class="subtitle2">\n      <b>\n       <font color="red">\n        Life Sciences and Chemical Analysis Service Note\n       </font>\n      </b>\n     </div>\n    </td>\n   </tr>\n  </table>\n  <h2>\n   SERVICE NOTE G2250-010\n  </h2>\n  <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode)  on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n  Decimal symbol  \'.\'\r\n- Number format (default for English (United States)) \r\n  Digit grouping symbol  \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n*                              Information Only   *\r\n******************************************************************************\r\n*             Author/Entity: AG/B404                                         *\r\n*  Additional Information: None                                          *\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'


soup = bs4(docstring, 'lxml')

description_source = soup.find('pre')

s = description_source.text

r = 'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'

a = re.search(r, s, re.M|re.S)

s = s.split('\r\n')

print(s[2])
print(a[2])
print(a[3])

返回:

INB22000 compatibility with Windows 2000 and ChemStation A.9.01
                          Information Only  
AG/B404