将JSON加速到具有大量数据操作的数据帧

时间:2017-05-01 22:40:36

标签: python json pandas dataframe

我有一大堆JSON数据格式如下:

[
    [{
        "created_at": "2017-04-28T16:52:36Z",
        "as_of": "2017-04-28T17:00:05Z",
        "trends": [{
            "url": "http://twitter.com/search?q=%23ChavezSigueCandanga",
            "query": "%23ChavezSigueCandanga",
            "tweet_volume": 44587,
            "name": "#ChavezSigueCandanga",
            "promoted_content": null
        }, {
            "url": "http://twitter.com/search?q=%2327Abr",
            "query": "%2327Abr",
            "tweet_volume": 79781,
            "name": "#27Abr",
            "promoted_content": null
        }],
        "locations": [{
            "woeid": 395277,
            "name": "Turmero"
        }]
    }],
    [{
        "created_at": "2017-04-28T16:57:35Z",
        "as_of": "2017-04-28T17:00:03Z",
        "trends": [{
            "url": "http://twitter.com/search?q=%23fyrefestival",
            "query": "%23fyrefestival",
            "tweet_volume": 141385,
            "name": "#fyrefestival",
            "promoted_content": null
        }, {
            "url": "http://twitter.com/search?q=%23HotDocs17",
            "query": "%23HotDocs17",
            "tweet_volume": null,
            "name": "#HotDocs17",
            "promoted_content": null
        }],
        "locations": [{
            "woeid": 9807,
            "name": "Vancouver"
        }]
    }]
]...

我编写了一个函数,将其格式化为采用以下形式的pandas数据框:

+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
|    |              name              | promoted_content |              query               | tweet_volume |                             url                              |        as_of         |      created_at      | location_name | location_woeid |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| 47 | #BatesMotel                    |                  | %23BatesMotel                    | 59748        | http://twitter.com/search?q=%23BatesMotel                    | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 48 | #AdviceForPeopleJoiningTwitter |                  | %23AdviceForPeopleJoiningTwitter | 51222        | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 49 | #CADTHSymp                     |                  | %23CADTHSymp                     |              | http://twitter.com/search?q=%23CADTHSymp                     | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 0  | #WorldPenguinDay               |                  | %23WorldPenguinDay               | 79006        | http://twitter.com/search?q=%23WorldPenguinDay               | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| 1  | #TravelTuesday                 |                  | %23TravelTuesday                 |              | http://twitter.com/search?q=%23TravelTuesday                 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| 2  | #DigitalLeap                   |                  | %23DigitalLeap                   |              | http://twitter.com/search?q=%23DigitalLeap                   | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| …  | …                              | …                | …                                | …            | …                                                            | …                    | …                    | …             | …              |
| 0  | #nusnc17                       |                  | %23nusnc17                       |              | http://twitter.com/search?q=%23nusnc17                       | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
| 1  | #WorldPenguinDay               |                  | %23WorldPenguinDay               | 79006        | http://twitter.com/search?q=%23WorldPenguinDay               | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
| 2  | #littleboyblue                 |                  | %23littleboyblue                 | 20772        | http://twitter.com/search?q=%23littleboyblue                 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+

这是将JSON写入DataFrame的函数:

def trends_to_dataframe(data):
    df = pd.DataFrame()

    for location in data:
        temp_df = pd.DataFrame()

        for trend in location[0]['trends']:
            temp_df = temp_df.append(pd.Series(trend), ignore_index=True)

        temp_df['as_of'] = location[0]['as_of']
        temp_df['created_at'] = location[0]['created_at']
        temp_df['location_name'] = location[0]['locations'][0]['name']
        temp_df['location_woeid'] = location[0]['locations'][0]['woeid']

        df = df.append(temp_df)

    return df

不幸的是,由于我拥有的数据量(以及我测试的一些简单计时器),这将需要大约4个小时才能完成。有关如何加快速度的想法吗?

1 个答案:

答案 0 :(得分:3)

您可以通过使用concurrent.futures异步展平数据来加快速度,然后将其全部加载到from_records的数据框中。

from concurrent.futures import ThreadPoolExecutor

def get_trends(location):
    trends = []
    for trend in location[0]['trends']:
        trend['as_of'] = location[0]['as_of']
        trend['created_at'] = location[0]['created_at']
        trend['location_name'] = location[0]['locations'][0]['name']
        trend['location_woeid'] = location[0]['locations'][0]['woeid']
        trends.append(trend)
    return trends

flat_data = []
with ThreadPoolExecutor() as executor:
    for location in data:
        flat_data += get_trends(location)

df = pd.DataFrame.from_records(flat_data)