将序列中的0替换为1' s

时间:2018-03-06 20:53:30

标签: python numpy

我有一个很大的1&0和0的列表:

x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].  

完整列表here

我想创建一个新的列表y,其条件是,只有当它们以> = 10以上的顺序出现时才应保留1,否则应该替换这些1#1由零 ex基于x以上^,y应该成为:

y = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].  

到目前为止,我有以下内容:

  1. 找出更改的位置和
  2. 找出以什么频率发生的序列:
  3. import numpy as np
    import itertools
    nx = np.array(x)
    print np.argwhere(np.diff(nx)).squeeze()
    
    answer = []
    for key, iter in itertools.groupby(nx):
        answer.append((key, len(list(iter))))
    print answer
    

    给了我:

    [0 3 8 14]  # A
    [(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B
    

    #A这意味着更改发生在第0个,第3个等位置之后。

    #B表示有一个1,然后是三个0,然后是五个1,然后是6个零,接着是10个1。

    如何继续创建y的最后一步,我们将根据序列长度将0替换为1?

    PS:##我对所有优秀人才的精彩解决方案感到谦卑。

6 个答案:

答案 0 :(得分:6)

在迭代分组时进行检查。类似的东西:

>>> from itertools import groupby
>>> x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1]
>>> result = []
>>> for k, g in groupby(x):
...     if k:
...         g = list(g)
...         if len(g) < 10:
...             g = len(g)*[0]
...     result.extend(g)
...
>>> result
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

请注意,对于此大小的数据集,这比相应的pandas解决方案更快:

In [11]: from itertools import groupby

In [12]: %%timeit
    ...: result = []
    ...: for k, g in groupby(x):
    ...:     if k:
    ...:         g = list(g)
    ...:         if len(g) < 10:
    ...:             g = len(g)*[0]
    ...:     result.extend(g)
    ...:
181 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %%timeit s = pd.Series(x)
    ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    ...:
4.03 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

请注意,这是大熊猫解决方案的慷慨,不计算任何时间从列表转换为pd.Series或转换回来,包括那些:

In [14]: %%timeit
    ...: s = pd.Series(x)
    ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    ...: s = s.tolist()
    ...:
4.92 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案 1 :(得分:4)

这是另一种笨拙的方法。请注意本文底部的基准:

import numpy as np
import pandas as pd
from itertools import groupby
import re
from timeit import timeit

def f_pp(data):
    switches = np.empty((data.size + 1,), bool)
    switches[0] = data[0]
    switches[-1] = data[-1]
    switches[1:-1] = data[:-1]^data[1:]
    switches = np.where(switches)[0].reshape(-1, 2)
    switches = switches[switches[:, 1]-switches[:, 0] >= 10].ravel()
    reps = np.empty((switches.size + 1,), int)
    reps[1:-1] = np.diff(switches)
    reps[0] = switches[0]
    reps[-1] = data.size - switches[-1]
    return np.repeat(np.arange(reps.size) & 1, reps)

def f_ja(data):
    result = []
    for k, g in groupby(data):
        if k:
            g = list(g)
            if len(g) < 10:
                g = len(g)*[0]
        result.extend(g)
    return result

def f_mu(s):
    s = s.copy()
    s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    return s

def vrange(starts, stops):
     stops = np.asarray(stops)
     l = stops - starts # Lengths of each range.
     return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())

def f_ka(data):
    x = data.copy()
    d = np.where(np.diff(x) != 0)[0]
    d2 = np.diff(np.concatenate(([0], d, [x.size])))
    ind = np.where(d2 >= 10)[0] - 1
    x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0
    return x

def f_ol(data):
    return list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(data)))

n = 10_000
data = np.repeat((np.arange(n) + np.random.randint(2))&1, np.random.randint(1, 20, (n,)))
datal = data.tolist()
datap = pd.Series(data)

kwds = dict(globals=globals(), number=100)

print(np.where(f_ja(datal) != f_pp(data))[0])
print(np.where(f_ol(datal) != f_pp(data))[0])
#print(np.where(f_ka(data) != f_pp(data))[0])
print(np.where(f_mu(datap).values != f_pp(data))[0])

print('itertools.groupby: {:6.3f} ms'.format(10 * timeit('f_ja(datal)', **kwds)))
print('re:                {:6.3f} ms'.format(10 * timeit('f_ol(datal)', **kwds)))
#print('numpy Kasramvd:    {:6.3f} ms'.format(10 * timeit('f_ka(data)', **kwds)))
print('pandas:            {:6.3f} ms'.format(10 * timeit('f_mu(datap)', **kwds)))
print('numpy pp:          {:6.3f} ms'.format(10 * timeit('f_pp(data)', **kwds)))

示例输出:

[]                                        # Delta ja, pp
[]                                        # Delta ol, pp
[  749   750   751 ... 98786 98787 98788] # Delta mu, pp
itertools.groupby:  5.415 ms
re:                28.197 ms
pandas:            14.972 ms
numpy pp:           0.788 ms

只考虑从头开始的解决方案。 @ Olivier的@ juanpa.arrivillaga和我的方法得到了同样的答案,@ MaxU没有。无法让@ Kazramvd完全可靠地完成。 (可能是我的错 - 不知道大熊猫并没有完全理解@ Kazramvd的解决方案。)

请注意,这仅是一个示例,其他条件(如较短列表,更多开关等)可能会更改排名。

答案 2 :(得分:2)

使用列表理解

从编码列表 B ,您可以使用列表推导来生成新列表。

b = [(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B

y = sum(([num and int(rep >= 10)] * rep for num, rep in b), [])

re

开始

或者,从一开始这看起来像re可以做的事情,因为它可以与bytes一起使用。

import re

x = [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

y = list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(x)))

两种解决方案输出:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

答案 3 :(得分:2)

如果您想使用Numpy,这是一种矢量化方法:

ind = np.where(np.diff(np.concatenate(([0], np.where(np.diff(x) != 0)[0], [x.size]))) >= 10)[0] - 1
x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0

如果你想使用Python,这是一种在列表理解中使用itertools.chainitertools.repeatitertools.groupby的方法:

chain.from_iterable(repeat(0, len(i)) if len(i) >= 10 else i for i in [list(g) for _, g in groupby(x)])

演示:

# Python

In [28]: list(chain.from_iterable(repeat(0, len(i)) if len(i) >= 10 else i for i in [list(g) for _, g in groupby(x)]))
Out[28]: [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# Numpy

In [161]: x = np.array([1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1, 0, 0, 1, 1, 1, 1, 1, 1 ,1, 1, 1, 1, 0, 0])

In [162]: d = np.where(np.diff(x) != 0)[0]

In [163]: d2 = np.diff(np.concatenate(([0], d, [x.size])))

In [164]: ind = np.where(d2 >= 10)[0] - 1

In [165]: def vrange(starts, stops):
     ...:     stops = np.asarray(stops)
     ...:     l = stops - starts # Lengths of each range.
     ...:     return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())
     ...: 

In [166]: x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0

In [167]: x
Out[167]: 
array([1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

对于Vrange我使用了这个答案thread pool,但我认为可能有更优化的方法。

答案 4 :(得分:1)

试试这个:

y = []
for pair in b: ## b is the list which you called #B
    add = 0
    if pair[0] == 1 and pair[1] > 9:
        add = 1
    y.extend([add] * pair[1])

答案 5 :(得分:1)

使用熊猫:

import pandas as pd

In [130]: s = pd.Series(x)

In [131]: s
Out[131]:
0     1
1     0
2     0
3     0
4     1
     ..
20    1
21    1
22    1
23    1
24    1
Length: 25, dtype: int64

In [132]: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0

In [133]: s.tolist()
Out[133]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [134]: s
Out[134]:
0     0
1     0
2     0
3     0
4     0
     ..
20    1
21    1
22    1
23    1
24    1
Length: 25, dtype: int64

对于你的“巨大”列表大约需要。在我的旧笔记本上7毫秒:

In [141]: len(x)
Out[141]: 5124

In [142]: %%timeit
     ...: s = pd.Series(x)
     ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
     ...: res = s.tolist()
     ...:
6.56 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)