finding the length and positions of sub-series within a series of numbers

时间:2017-05-16 09:14:30

标签: r numpy

I have a vector made of 0 and non-zero numbers. I would like to know the length and starting-position of each of the non-zero number series:

a = c(0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 2.6301334 1.8372030 0.0000000 0.0000000 0.0000000 1.5632647 1.1433757 0.0000000 1.5412216 0.8762267 0.0000000 1.3087967 0.0000000 0.0000000 0.0000000)

based on a previous post it is easy to find the starting positions of the non-zero regions: Finding the index of first changes in the elements of a vector in R

c(1,1+which(diff(a)!=0))

However I cannot seem to configure a way of finding the length of these regions....

I have tried the following:

dif=diff(which(a==0))
dif_corrected=dif-1 # to correct for the added lengths
row=rbind(postion=seq(length(a)), length=c(1, dif_corrected))

position    1    2    3    4    5    6    7    8    9    10    11    12    13    14    15
length      1    0    0    0    0    2    0    0    2     2     1     0     0     1     0

NOTE: not all columns are displayed ( there are actually 20)

Then I subset this to take away 0 values:

> row[,-which(row[2,]==0)]
         [,1] [,2] [,3] [,4] [,5] [,6] [,7]
position    1    6    9   10   11   14   19
length      1    2    2    2    1    1    2

This seems like a decent way of coming up with the positions and lengths of each non-zero series in the series, but it is incorrect:

The position 9 (identified as the start of a non-zero series) is a 0 and instead 10 and 11 are non-zero so I would expect the position 10 and a length of 2 to appear here.... The only result that is correct is position 6 which is the start of the first non-zero series- it is correctly identified as having a length of 2- all other positions are incorrect.

Can anyone tell me how to index correctly to identify the starting-position of each of the non-zero series and the corresponding lengths?

NOTE I only did this in R because of the usefulness of the which command but it would also be good to know how to do this numpy and create a dictionary of positions and length values

3 个答案:

答案 0 :(得分:1)

似乎rle在这里很有用。

# a slightly simpler vector
a <- c(0, 0, 1, 2, 0, 2, 1, 2, 0, 0, 0, 1)

# runs of zero and non-zero elements
r <- rle(a != 0)

# lengths of non-zero elements
r$lengths[r$values] 
# [1] 2 3 1

# start of non-zero runs
cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
# [1]  3  6 12 

这也适用于仅包含0或非0的向量,并且不依赖于向量是以0还是以非{{1}开头/结尾}。 E.g:

0

可能a <- c(1, 1) a <- c(0, 0) a <- c(1, 1, 0, 1, 1) a <- c(0, 0, 1, 1, 0, 0) 替代方案,使用data.table创建群组,rleid获取起始索引并计算长度。

.I

如果需要,可以通过“非零”轻松切割运行。列。

答案 1 :(得分:1)

对于numpy,这是@Maple的并行方法(对以非零结尾的数组进行修复):

def subSeries(a):
    d = np.logical_not(np.isclose(a, np.zeros_like(a))).astype(int)
    starts = np.where(np.diff(np.r_[0, d, 0]) == 1))
    ends = np.where(np.diff(np.r_[0, d, 0]) == -1))
    return np.c_[starts - 1, ends - starts]

答案 2 :(得分:0)

<强>定义

sublistLen = function(list) {
    z_list <- c(0, list, 0)
    ids_start <- which(diff(z_list != 0) == 1)
    ids_end <- which(diff(z_list != 0) == - 1)
    lengths <- ids_end - ids_start

    return(
        list(
        'ids_start' = ids_start,
        'ids_end' = ids_end - 1,
        'lengths' = lengths)
        )
}

示例

> a <- c(-2,0,0,12,5,0,124,0,0,0,0,4,48,24,12,2,0,9,1)
> sublistLen(a)
$ids_start
[1]  1  4  7 12 18

$ids_end
[1]  1  5  7 16 19

$lengths
[1] 1 2 1 5 2