Question

我有一个可以理想地每秒测量温度的程序。但是，实际上这不会发生。有时，它会跳过一秒钟，或者崩溃400秒钟，然后决定再次开始录制。这在我的2×n数据帧中留下了间隙，理想情况下，n = 86400（一天中的秒数）。我想对其应用某种移动/滚动平均值，以获得更好的图表，但是如果对“原始”数据文件执行此操作，则数据点的数量将减少。显示here，注意x轴。我知道“不错的数据”看起来还不错。我只是在玩一些价值观。

因此，我想实现一种数据清理方法，该方法将数据添加到数据框。我曾考虑过，但不知道如何实施。我想到的是：

如果索引不等于时间，那么我们需要在时间=索引处添加一个数字。如果这个差距只有1个值，那么前一个数字和下一个数字的平均值将对我有用。但是如果它更大，比如说缺少100秒，则需要制作一个线性函数，该函数将稳定地增加或减少该值。

所以我想训练集可能是这样的：

index   time   temp 
0       0      20.10
1       1      20.20
2       2      20.20
3       4      20.10
4       100    22.30

在这里，我想获取索引3，时间3的值以及在时间= 4到时间= 100之间丢失的值。我为自己的格式化技能感到抱歉，希望它清楚。

我该如何编程？

Answer 1

使用“合并完整时间”列，然后使用interpolate：

# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']

>>> temps

   time  temperature
0   4.0    21.662352
1  10.0    20.904659
2  15.0    20.345858
3  18.0    24.787389
4  19.0    20.719487

上面是随机表，其中缺少时间数据。

# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.

# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()

>>> merged # or final

    time  temperature
0    4.0    21.662352
1    5.0    21.536070
2    6.0    21.409788
3    7.0    21.283505
4    8.0    21.157223
5    9.0    21.030941
6   10.0    20.904659
7   11.0    20.792898
8   12.0    20.681138
9   13.0    20.569378
10  14.0    20.457618
11  15.0    20.345858
12  16.0    21.826368
13  17.0    23.306879
14  18.0    24.787389
15  19.0    20.719487

Answer 2

首先，您可以将第二个值设置为实际时间值，例如：

df.index = pd.to_datetime(df['time'], unit='s')

之后，您可以使用熊猫的内置时间序列操作来重新采样并填写缺失值：

df = df.resample('s').interpolate('time')

或者，如果您仍然想要进行一些平滑操作，则可以使用以下操作：

df.rolling(5, center=True, win_type='hann').mean()

使用5个元素宽的Hanning window可以平滑。注意：任何基于窗口的平滑处理都会使您在边缘处失去价值点。

现在，您的数据框将以日期时间（包括日期）作为索引。这是重采样方法所必需的。如果您想丢失日期，只需使用：

df.index = df.index.time

在数据框中插入缺失的数字

2 个答案: