使用正则表达式提取10K年度报表的一部分

时间:2015-07-19 17:04:09

标签: python regex python-textprocessing

我已经挣扎了几个星期了。我正在尝试为公司提供10K年度报表。我已经从SEC的FTP服务器下载了该文件,这就是10K的样子。它是一个HTML文件。所以我编写了以下代码将其转换为文本:

actvtxt=open("C:\\Users\\Downloads\\10Ks\\AbraxasPetroleum10K.txt",'r')  
txt=actvtxt.readlines()
ind=txt.index('<DOCUMENT>\n')
txt=txt[ind:]
x=(str.join('\n',map(str,txt)))
soup=BeautifulSoup(x.encode('utf-8'))
for script in soup(["script", "style"]):
script.extract()    
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

获取文本后,我需要提取

的文本
  

&n;项目7:管理层对财务状况和经营业绩的讨论和分析&#39;。

此链接可让您了解我所谈论的内容:actual 10K

我尝试了以下代码:

substr=re.search(r'The following is a discussion of our consolidated financial condition(\s+|\w+|[#!\"#$%&\'()*+,-./:;<=>?@^_`{|}]){1,}',text)
substr.group(0)

但这只是给了我段落的开头:

     u'The following is a discussion of our consolidated financial condition, results of operations, liquidity and capital resources. This discussion excludes the operations of Blue Eagle, except our equity share of Blue Eagle'

非常感谢任何帮助。

0 个答案:

没有答案