有没有一种有效的方法来填补python中的日期空白?

时间:2015-08-10 05:37:23

标签: python-2.7 numpy pandas

我从MongoDB获取文件如下:

    {
      "amount": 1200,
      "date_closed": "2012-07-02 17:00:00"
    },
    {
      "amount": 0,
      "date_closed": "2012-08-03 16:00:00"
    },
    {
      "amount": 0,
      "date_closed": "2012-08-04 20:00:00"
    },
    {
      "amount": 0,
      "date_closed": "2012-08-04 22:00:00"
    }

我从用户(名为1343287040的参数)获得了user_time的时间戳,该时间戳引用了日期datetime.datetime(2012, 7, 26, 11, 47, 20)

这是我填补空白的解决方案:
现在我通过以下代码创建日期格式YYYY-mm-dd 00:00:00

hourly_date = str(datetime.datetime.fromtimestamp(user_time).year) + '-' + str(datetime.datetime.fromtimestamp(user_time).month) + '-' + str(datetime.datetime.fromtimestamp(user_time).day) + ' 00:00:00'

user_time是开始日期。现在我生成从user_time到今天的每小时记录。以下代码以我想要的格式生成日期范围(小时):

date_range = pandas.date_range(start=hourly_date, end=datetime.datetime.today(), freq='H')
                    date_range = date_range.values.astype('<M8[h]').astype(str)
                    hourly_date = []
                    for i_hourly in date_range:
                        tmp_date = pandas.to_datetime(str(i_hourly)).strftime('%Y-%m-%d %H:00:00')
                        hourly_date.append(tmp_date)

创建从user_time到今天的小时模板日期范围后,我将其与从MongoDB返回的date_closed字段进行比较:

records_len = len(records)
                    for i_hourly in hourly_date:
                        i = 0
                        for record in records:
                            i += 1
                            if i_hourly in record['date_closed']:
                                break  # break from innermost loop

                            elif records_len == i and i_hourly not in record['date_closed']:
                                records.append({"amount": 0, "date_closed": i_hourly})  

records包含许多字段,从2012年到今天,我要解决的问题是看到的是返回文档中缺少的日期和小时。如果它丢失了,那么我们需要将它添加到记录中以填补空白,否则我应该从最里面的循环中断开。

此代码大约需要57秒!这是一个巨大的时间。是否有更好的方法可以在一小时内生成日期差距?

编辑:

     amount    date_closed
0      21800 2015-07-21 10:00:00
1       5450 2015-07-05 04:00:00
2     571160 2015-06-22 12:00:00
3      65400 2015-06-15 12:00:00
4      10900 2015-06-15 09:00:00
5     109000 2015-06-14 07:00:00
6     109000 2015-06-14 04:00:00
7    1193550 2015-06-11 06:00:00
8      10900 2015-06-11 05:00:00
9      21800 2015-06-09 10:00:00
10     10900 2015-05-31 05:00:00
11         0 2015-05-30 09:00:00
12    114450 2015-05-19 13:00:00
13    261600 2015-05-19 08:00:00
14    108000 2015-05-11 08:00:00
15      2180 2015-05-11 07:00:00
16    344870 2015-05-05 13:00:00
17     70850 2015-05-05 12:00:00
18      5450 2015-05-05 05:00:00
19    109000 2015-05-03 12:00:00
20    327000 2015-05-03 11:00:00
21    310650 2015-04-30 05:00:00
22     38150 2015-04-28 13:00:00
23     26160 2015-04-27 07:00:00
24    109000 2015-04-22 12:00:00
25     97200 2015-03-09 08:00:00
26     21800 2015-07-11 05:00:00
27     26160 2015-05-20 05:00:00
28     37800 2015-03-03 07:00:00
29    130800 2015-06-29 06:00:00
..       ...                 ...
161     2180 2015-05-25 09:00:00
162    26160 2015-05-09 11:00:00
163   108000 2015-03-03 11:00:00
164  3337200 2014-09-13 05:00:00
165  5249880 2014-09-10 05:00:00
166   712800 2014-08-10 09:00:00
167   151200 2015-02-23 06:00:00
168    48600 2014-08-10 11:00:00
169     6540 2015-04-19 10:00:00
170   172800 2014-09-01 09:00:00
171  1370520 2014-10-15 09:00:00
172   421200 2014-07-26 09:00:00
173    86400 2015-03-01 12:00:00
174   118800 2015-02-21 12:00:00
175    97200 2014-09-17 07:00:00
176    54500 2015-04-23 07:00:00
177  1185840 2014-09-09 06:00:00
178   119016 2015-02-18 09:00:00
179    32400 2014-11-05 08:00:00
180   345600 2014-08-09 10:00:00
181   151200 2015-02-18 12:00:00
182   168480 2014-10-09 06:00:00
183  5668920 2014-10-04 21:00:00
184   669600 2014-08-06 12:00:00
185   194400 2014-08-02 07:00:00
186   313920 2015-06-23 08:00:00
187     6540 2015-05-04 09:00:00
188   669600 2014-07-23 10:00:00
189    64800 2015-01-22 06:00:00
190   669600 2014-08-25 04:00:00
[191 rows x 2 columns]

它显示我只有191条记录,这些记录是从Mongo返回的!我希望看到一个每小时生成的列表列表,大约有121000条记录,其中191条记录将由上面的代码填充。

问题在于我认为这两个列表没有合并在一起。

1 个答案:

答案 0 :(得分:1)

您可以先将date_closed列作为索引,然后根据.reindex hourly_date_rng填充缺失的记录。

这是一个例子。

import json
import pandas as pd

json_data = [
    {
      "amount": 0,
      "date_closed": "2012-08-04 16:00:00"
    },
    {
      "amount": 0,
      "date_closed": "2012-08-04 20:00:00"
    },
    {
      "amount": 0,
      "date_closed": "2012-08-04 22:00:00"
    }
]

df = pd.read_json(json.dumps(json_data), orient='records')
df

   amount          date_closed
0       0  2012-08-03 16:00:00
1       0  2012-08-04 20:00:00
2       0  2012-08-04 22:00:00

hourly_date_rng看起来像这样

hourly_date_rng = pd.date_range(start='2012-08-04 12:00:00', end='2012-08-4 23:00:00', freq='H')
hourly_date_rng.name = 'date_closed'

hourly_date_rng

DatetimeIndex(['2012-08-04 12:00:00', '2012-08-04 13:00:00',
               '2012-08-04 14:00:00', '2012-08-04 15:00:00',
               '2012-08-04 16:00:00', '2012-08-04 17:00:00',
               '2012-08-04 18:00:00', '2012-08-04 19:00:00',
               '2012-08-04 20:00:00', '2012-08-04 21:00:00',
               '2012-08-04 22:00:00', '2012-08-04 23:00:00'],
              dtype='datetime64[ns]', name='date_closed', freq='H', tz=None)

对齐索引并填补空白

# make the column datetime object instead of string
df['date_closed'] = pd.to_datetime(df['date_closed'])
# align the index using .reindex
df.set_index('date_closed').reindex(hourly_date_rng).fillna(0).reset_index()

           date_closed  amount
0  2012-08-04 12:00:00       0
1  2012-08-04 13:00:00       0
2  2012-08-04 14:00:00       0
3  2012-08-04 15:00:00       0
4  2012-08-04 16:00:00       0
5  2012-08-04 17:00:00       0
6  2012-08-04 18:00:00       0
7  2012-08-04 19:00:00       0
8  2012-08-04 20:00:00       0
9  2012-08-04 21:00:00       0
10 2012-08-04 22:00:00       0
11 2012-08-04 23:00:00       0

编辑:

将结果转换回JSON。

result = df.set_index('date_closed').reindex(hourly_date_rng).fillna(0).reset_index()

# maybe convert date_closed column to string first
result['date_closed'] = pd.DatetimeIndex(result['date_closed']).to_native_types()
# to json function
json_result = result.to_json(orient='records')

# print out the data with pretty print
from pprint import pprint
pprint(json.loads(json_result))


[{'amount': 0.0, 'date_closed': '2012-08-04 12:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 13:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 14:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 15:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 16:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 17:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 18:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 19:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 20:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 21:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 22:00:00'},
 {'amount': 0.0, 'date_closed': '2012-08-04 23:00:00'}]