如何获取自定义间隔中的最后一个日期? - 熊猫

时间:2017-10-21 04:04:27

标签: python pandas time-series

愿我的例子变得更大,我的代码在这里:

Starting ChromeDriver 2.33.506120 (e3e53437346286c0bc2d2dc9aa4915ba81d9023f) on port 32443
Only local connections are allowed.
Oct 23, 2017 1:36:09 PM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: OSS
Exception in thread "main" org.openqa.selenium.NoSuchElementException: no such element: Unable to locate element: {"method":"id","selector":"lead_field_import_email_address"}
  (Session info: chrome=61.0.3163.100)
  (Driver info: chromedriver=2.33.506120 (e3e53437346286c0bc2d2dc9aa4915ba81d9023f),platform=Windows NT 10.0.15063 x86_64) (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 0 milliseconds
For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html
Build info: version: '3.6.0', revision: '6fbf3ec767', time: '2017-09-27T16:15:26.402Z'
System info: host: 'HOME-PC', ip: '192.235.0.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_151'
Driver info: org.openqa.selenium.chrome.ChromeDriver
Capabilities [{mobileEmulationEnabled=false, hasTouchScreen=false, platform=XP, acceptSslCerts=true, webStorageEnabled=true, browserName=chrome, takesScreenshot=true, javascriptEnabled=true, platformName=XP, setWindowRect=true, unexpectedAlertBehaviour=, applicationCacheEnabled=false, rotatable=false, networkConnectionEnabled=false, chrome={chromedriverVersion=2.33.506120 (e3e53437346286c0bc2d2dc9aa4915ba81d9023f), userDataDir=C:\Users\David\AppData\Local\Temp\1\scoped_dir5416_25737}, takesHeapSnapshot=true, pageLoadStrategy=normal, unhandledPromptBehavior=, databaseEnabled=false, handlesAlerts=true, version=61.0.3163.100, browserConnectionEnabled=false, nativeEvents=true, locationContextEnabled=true, cssSelectorsEnabled=true}]
Session ID: 40cde314a5a76400aceff8b625b38e3c
*** Element info: {Using=id, value=lead_field_import_email_address}
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
    at java.lang.reflect.Constructor.newInstance(Unknown Source)
    at org.openqa.selenium.remote.ErrorHandler.createThrowable(ErrorHandler.java:214)
    at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed(ErrorHandler.java:166)
    at org.openqa.selenium.remote.http.JsonHttpResponseCodec.reconstructValue(JsonHttpResponseCodec.java:40)
    at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:82)
    at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:45)
    at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:164)
    at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83)
    at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:586)
    at org.openqa.selenium.remote.RemoteWebDriver.findElement(RemoteWebDriver.java:356)
    at org.openqa.selenium.remote.RemoteWebDriver.findElementById(RemoteWebDriver.java:402)
    at org.openqa.selenium.By$ById.findElement(By.java:218)
    at org.openqa.selenium.remote.RemoteWebDriver.findElement(RemoteWebDriver.java:348)
    at newAutomation.importLeads.main(importLeads.java:33)

我想要做的是找到一年中每年的最后一天开始import pandas as pd import numpy as np import io t = """ name date a 2005-08-31 a 2005-09-20 a 2005-11-12 a 2005-12-31 a 2006-03-31 a 2006-06-25 a 2006-07-23 a 2006-09-28 a 2006-12-21 a 2006-12-27 a 2007-07-23 a 2007-09-21 a 2007-03-15 a 2008-04-12 a 2008-06-21 a 2008-06-11 b 2005-08-31 b 2005-09-23 b 2005-11-12 b 2005-12-31 b 2006-03-31 b 2006-06-25 b 2006-07-23 b 2006-09-28 b 2006-12-21 b 2006-12-27 b 2007-07-23 b 2007-09-21 b 2007-03-15 b 2008-04-12 b 2008-06-21 b 2008-06-11 """ data=pd.read_csv(io.StringIO(t),delimiter=' ')#5 space here data )并结束2005-7-1,开始2006-06-30并结束2006-7-1。 。等等 。 我的预期输出在这里:

2007-6-30

如何解决这个问题?我想我应该使用name date a 2006-06-25 #the last day of the 2005/7/01 -2006/06/31 a 2007-03-15 #the last day of the 2006/7/01 -2007/06/31 a 2008-06-21 #the last day of the 2007/7/01 -2008/06/31 b 2006-06-25 #the last day of the 2005/7/01 -2006/06/31 b 2007-03-15 #the last day of the 2006/7/01 -2007/06/31 b 2008-06-21 #the last day of the 2007/7/01 -2008/06/31

4 个答案:

答案 0 :(得分:5)

您可以使用单个groupby执行此操作而无需回滚:

In [11]: data.date = pd.to_datetime(data.date, format="%Y-%m-%d")

In [12]: df.groupby(["name", pd.Grouper(key="date", freq="AS-JUL")])["date"].max()
Out[12]:
name  date
a     2005-07-01   2006-06-25
      2006-07-01   2007-03-15
      2007-07-01   2008-06-21
b     2005-07-01   2006-06-25
      2006-07-01   2007-03-15
      2007-07-01   2008-06-21
Name: date, dtype: datetime64[ns]

答案 1 :(得分:4)

嗯,这似乎是一种神奇的方式! 频率是" AS-JUL" (这是年初的频率,从7月开始)。

我们首先会在每个月的开始(因为你在那里有一些糟糕的日期,让我们忽略它们)但关键是我们需要将它作为日期时间而不是字符串:< / p>

In [11]: pd.to_datetime(data.date.str[:7], format="%Y-%m")  # to beginning of month
Out[11]:
0    2005-08-01
1    2005-09-01
2    2005-11-01
3    2005-12-01
...

In [12]: df.date = pd.to_datetime(data.date.str[:7], format="%Y-%m")

现在来了magic

In [13]: from pandas.tseries.frequencies import to_offset

In [14]: df.date.map(to_offset("AS-JUL").rollback)
Out[14]:
0    2005-07-01
1    2005-07-01
2    2005-07-01
3    2005-07-01
4    2005-07-01
5    2005-07-01
6    2006-07-01
7    2006-07-01
8    2006-07-01
9    2006-07-01
10   2007-07-01
11   2007-07-01
12   2006-07-01
13   2007-07-01
14   2007-07-01
15   2007-07-01
16   2005-07-01
17   2005-07-01
18   2005-07-01
19   2005-07-01
20   2005-07-01
21   2005-07-01
22   2006-07-01
23   2006-07-01
24   2006-07-01
25   2006-07-01
26   2007-07-01
27   2007-07-01
28   2006-07-01
29   2007-07-01
30   2007-07-01
31   2007-07-01
Name: date, dtype: datetime64[ns]

我们创建了一个偏移到"AS-JUL"并将其回滚(意思是楼层) 注意:无论出于何种原因,我们无法使用dt.floor ...

好的,误读了这一部分,你想要每个时期每组的最新记录日期,修正日期,最后一部分只是一个组:

In [21]: data.date = pd.to_datetime(data.date, format="%Y-%m-%d")

In [22]: data["period_start"] = data.date.map(to_offset("AS-JUL").rollback).dt.normalize()

In [23]: data.groupby(["name", "period_start"])["date"].max()
Out[23]:
name  period_start
a     2005-07-01     2006-06-25
      2006-07-01     2007-03-15
      2007-07-01     2008-06-21
b     2005-07-01     2006-06-25
      2006-07-01     2007-03-15
      2007-07-01     2008-06-21
Name: date, dtype: datetime64[ns]

答案 2 :(得分:3)

从美丽的功能to_offset @Andy建议我们可以做到

from pandas.tseries.frequencies import to_offset
new = data.groupby('name').apply(lambda x : x.groupby(x['date'].map(to_offset("AS-JUL"))).max())
             name       date
name date                      
a    2006-07-01    a 2006-06-25
     2007-07-01    a 2007-03-15
     2008-07-01    a 2008-06-21
b    2006-07-01    b 2006-06-25
     2007-07-01    b 2007-03-15
     2008-07-01    b 2008-06-21

答案 3 :(得分:3)

使用IntervalIndexDF是您的DataFrame

idx=pd.IntervalIndex.from_arrays(pd.date_range(start='2005-07-01',freq='12MS',periods=12),pd.date_range(start='2006-06-30',freq='12M',periods=12),closed='both')
df=pd.DataFrame({'G':list(range(len(idx)))},index=idx)
DF.date=pd.to_datetime(DF.date)
DF['G']=df.loc[DF.date].values
DF.sort_values(['name','date']).drop_duplicates(['name','G'],keep='last')

Out[19]: 
   name       date  G
5     a 2006-06-25  0
12    a 2007-03-15  1
14    a 2008-06-21  2
21    b 2006-06-25  0
28    b 2007-03-15  1
30    b 2008-06-21  2
相关问题