Question

我下载并抓取网页以获取TSV格式的某些数据。在TSV数据周围是我不想要的HTML。

我下载网页的html，并使用beautifulsoup删除我想要的数据。但是，我现在已经在内存中获得了TSV数据。

如何在大熊猫的内存中使用此TSV数据？我能找到的每一种方法似乎都想从文件或URI中读取，而不是从我已经抓过的数据中读取。

我不想下载文本，将其写入文件，然后重新扫描。

#!/usr/bin/env python2

from pandas import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2

def main():
    url = "URL"
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    # pre is the tag that the data is within
    tab_sepd_vals = soup.pre.string

    data = p.LOAD_CSV(tab_sepd_vals)
    process(data)

Answer 1

如果将文本/字符串版本的数据提供给StringIO.StringIO（或Python 3.X中的io.StringIO），则可以将该对象传递给pandas解析器。所以你的代码变成了：

#!/usr/bin/env python2

import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2
import StringIO

def main():
    url = "URL"
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    # pre is the tag that the data is within
    tab_sepd_vals = soup.pre.string

    # make the StringIO object
    tsv = StringIO.StringIO(tab_sepd_vals)

    # something like this
    data = p.read_csv(tsv, sep='\t') 

    # then what you had
    process(data)

Answer 2

像read_csv这样的方法做两件事，他们解析CSV并且他们construct一个DataFrame对象 - 所以在你的情况下你可能想直接构造DataFrame：

>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]])
>>> print(df)
   0  1
0  a  1
1  b  2
2  c  3

构造函数接受各种数据结构。

如何使用pandas解析已经从其他地方加载的CSV？

2 个答案: