Question

下面是我的df样本

date                   value

0006-03-01 00:00:00    1   
0006-03-15 00:00:00    2   
0006-05-15 00:00:00    1   
0006-07-01 00:00:00    3   
0006-11-01 00:00:00    1   
2009-05-20 00:00:00    2   
2009-05-25 00:00:00    8   
2020-06-24 00:00:00    1   
2020-06-30 00:00:00    2   
2020-07-01 00:00:00    13  
2020-07-15 00:00:00    2   
2020-08-01 00:00:00    4   
2020-10-01 00:00:00    2   
2020-11-01 00:00:00    4    
2023-04-01 00:00:00    1   
2218-11-12 10:00:27    1   
4000-01-01 00:00:00    6 
5492-04-15 00:00:00    1    
5496-03-15 00:00:00    1    
5589-12-01 00:00:00    1    
7199-05-15 00:00:00    1    
9186-12-30 00:00:00    1

如您所见，数据包含一些拼写错误的日期。

问题：

我们如何将该列转换为dd.mm.yyyy格式？
当Year大于2022时，如何替换行？到01.01.2100
当年份少于2005时，如何删除所有行？

最终输出应如下所示。

date                   value


20.05.2009    2   
25.05.2009     8   
26.04.2020     1   
30.06.2020     2   
01.07.2020     13  
15.07.2020     2   
01.08.2020    4   
01.10.2020    2   
01.11.2020    4    
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1   
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1

我尝试使用to_datetime转换列，但失败。

df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

Out of bounds nanosecond timestamp: 5-03-01 00:00:00

感谢任何帮助！

Answer 1

您可以在'-'分割后检查日期时间字符串的第一个元素，并根据其整数值进行清理/替换。对于诸如“ 0006”之类的较小值，可以使用pd.to_datetime来调用errors='coerce'。它将为无效日期保留“ NaT”。您可以使用dropna()删除它们。示例：

import pandas as pd

df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
                            '0006-03-15 00:00:00',
                            '0006-05-15 00:00:00',
                            '0006-07-01 00:00:00',
                            '0006-11-01 00:00:00',
                            'nan',
                            '2009-05-25 00:00:00',
                            '2020-06-24 00:00:00',
                            '2020-06-30 00:00:00',
                            '2020-07-01 00:00:00',
                            '2020-07-15 00:00:00',
                            '2020-08-01 00:00:00',
                            '2020-10-01 00:00:00',
                            '2020-11-01 00:00:00',
                            '2023-04-01 00:00:00',
                            '2218-11-12 10:00:27',
                            '4000-01-01 00:00:00',
                            'NaN',
                            '5496-03-15 00:00:00',
                            '5589-12-01 00:00:00',
                            '7199-05-15 00:00:00',
                            '9186-12-30 00:00:00']})

# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]

# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'

# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0           NaT
# 1           NaT
# 2           NaT
# 3           NaT
# 4           NaT
# 5    2009-05-20
# 6    2009-05-25
# ...

df = df.dropna()
# df
#          date
# 6  2009-05-25
# 7  2020-06-24
# 8  2020-06-30
# 9  2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...

Answer 2

由于熊猫的局限性，引发了超出范围的错误（https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html）。此代码将在创建数据框之前删除可能导致此错误的值。

import datetime as dt

import pandas as pd

data = [[dt.datetime(year=2022, month=3, day=1), 1],
        [dt.datetime(year=2009, month=5, day=20), 2],
        [dt.datetime(year=2001, month=5, day=20), 2],
        [dt.datetime(year=2023, month=12, day=30), 3],
        [dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]

df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
        date  Value
0 2022-03-01      1
1 2009-05-20      2
2 2001-05-20      2
3 2023-12-30      3

df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01      1
1 2009-05-20      2
3 2100-01-01      3

如果您仍要包括引发超出范围错误的日期，请查看How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?

Answer 3

我建议以下内容：

df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
                                      '7199-05-15 00:00:00',
                                      '2020-10-21 00:00:00'],
                             'value': [1, 2, 3]})

df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
              else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]

df.dropna(inplace = True)

这将产生所需的输出：

date        value
01.01.2100  2
21.10.2020  3

转换日期时间熊猫

3 个答案: