我需要找到具有一定长度的相同元素(NaN)块的'participant_id
'。例如,请考虑以下df
:
summary participant_id
13865 3.0 28
13995 NaN 28
14050 3.0 28
14219 5.0 28
14346 NaN 28
14364 4.0 28
14456 4.0 28
14680 NaN 28
14733 3.0 28
14913 2.0 28
15007 4.0 28
15107 4.0 28
15280 NaN 28
15287 3.0 28
15420 2.0 28
15521 2.0 28
15756 NaN 28
15758 3.0 28
15973 NaN 28
16038 4.0 28
16079 6.0 28
16215 4.0 28
16412 NaN 28
16506 6.0 28
16543 6.0 28
16649 2.0 28
16811 NaN 28
16911 NaN 28
16928 3.0 28
17028 2.0 28
11582 NaN 27
11718 2.0 27
11843 NaN 27
11941 2.0 27
12053 NaN 27
12142 NaN 27
12269 NaN 27
12367 4.0 27
12510 NaN 27
12632 NaN 27
12732 NaN 27
12796 2.0 27
12946 NaN 27
13059 NaN 27
13126 2.0 27
13312 NaN 27
13394 3.0 27
13427 2.0 27
13618 NaN 27
13707 NaN 27
13832 NaN 27
13945 NaN 27
14087 NaN 27
14199 NaN 27
14299 NaN 27
14398 NaN 27
14520 NaN 27
14639 NaN 27
14759 NaN 27
14897 NaN 27
15013 NaN 27
15116 NaN 27
15182 3.0 27
15319 NaN 27
15437 NaN 27
15518 3.0 27
15695 NaN 27
15812 NaN 27
15821 2.0 27
15933 2.0 27
如果我对超过4个连续NaN的块感兴趣,那么唯一的选项是participant_id = 27
,如果我想要blocks_length = 2
,那么答案将是participant_id = [27,28]
我试图关注类似的solution,但它没有用。
答案 0 :(得分:1)
您可以使用自定义函数和NaN
连续计算groupby
:
N = 4
def f(x):
a = x.isnull()
return a.cumsum()-a.cumsum().where(~a).ffill().fillna(0) == N
mask = df.groupby('participant_id', sort=False)['summary'].apply(f)
L = df.loc[mask, 'participant_id'].unique().tolist()
print (L)
替代解决方案:
from functools import reduce
N = 4
nulls = df['summary'].isnull()
df1 = nulls.groupby(df['participant_id']).expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
L = df1[df1 == N].index.get_level_values(0).unique().tolist()
print (L)
答案 1 :(得分:1)
groupby
有助于单独获取每位参与者的数据。然后,您可以以任何方式计算数字。清晰而简单的,不使用熊猫的力量可能就像这样
block_size = 4
for name, gr_data in data.groupby("participant_id"):
counter = 0
for value in gr_data["summary"]:
if value is None:
counter+=1
if counter>=block_size:
print("%s has block of NaN of length >= %d"%(str(name), block_size))
break
else:
counter = 0
答案 2 :(得分:1)
def null_blocks(x, n):
isnull = np.isnan(x.values)
nextnot = np.append(~isnull[1:], True)
csum = isnull.cumsum()
return np.diff(csum[isnull & nextnot]).max() >= n
def which_ids(k):
return [n for n, g in df.groupby('participant_id').summary if null_blocks(g, k)]
演示
which_ids(2)
[27, 28]
which_ids(4)
[27]
如何运作
null_blocks
NaN
np.isnan
bool
是int
的子类,我们可以将它们与cumsum
isnull
并将其移动一个空格来确定一个块结束的位置。当nextnot
和isnull
都是True
时,这是一个块的结尾。csum
与块末端的位置切片并取差异...这给出了块的大小。True
which_ids
groupby
对象groupby
个名称,其中组本身的块大小超过我们的阈值。