在r中查找重复值

时间:2014-07-17 10:59:13

标签: r

所以,在一个包含多个1的字符串中,

现在,数字可能是

'1' 
我会说,在多个位置出现

。我想要的是

(3)

3 个答案:

答案 0 :(得分:2)

这不是一个完整的答案,而是一些想法(部分基于评论):

z <- "1101101101"
zz <- as.numeric(strsplit(z,"")[[1]])

计算自相关函数并绘制图:在这种情况下,我将周期性= 3粗略地作为第一个点,其中有一个增加然后减少...

a1 <- acf(zz)
first.peak <- which(diff(sign(diff(a1$acf[,,1])))==-2)[1]

现在我们知道周期是3;使用embed()创建3的运行并分析它们的相似之处:

ee <- embed(zz,first.peak)
pp <- apply(ee,1,paste,collapse="")
mm <- outer(pp,pp,"==")
aa <- apply(mm[!duplicated(mm),],1,which)
sapply(aa,length)  ## 3 3 2   ## number of repeats
sapply(aa,function(x) unique(diff(x)))  ## 3 3 3

答案 1 :(得分:1)

以下代码完全符合您的要求。试试str_groups('1101101101')。它返回一个3向量列表。请注意,第一个三元组是(1,3,4),因为第10个位置的字符也是1。

最终版本,已优化且无错误

str_groups <- function (s) {
    digits <- as.numeric(strsplit(s, '')[[1]])
    index1 <- which(digits == 1)
    len <- length(digits)
    back <- length(index1)
    if (back == 0) return(list())
    maxpitch <- (len - 1) %/% 2
    patterns <- matrix(0, len, maxpitch)
    result <- list()

    for (pitch in 1:maxpitch) {
        divisors <- which(pitch %% 1:(pitch %/% 2) == 0)
        while (index1[back] > len - 2 * pitch) {
            back <- back - 1
            if (back == 0) return(result)
        }
        for (startpos in index1[1:back]) {
            if (patterns[startpos, pitch] != 0) next
            pos <- seq(startpos, len, pitch)
            if (digits[pos[2]] != 1 || digits[pos[3]] != 1) next
            repeats <- length(pos)
            if (repeats > 3) for (i in 4:repeats) {
                if (digits[pos[i]] != 1) {
                    repeats <- i - 1
                    break
                }
            }
            continue <- F
            for (subpitch in divisors) {
                sublen <- patterns[startpos, subpitch]
                if (sublen > pitch / subpitch * (repeats - 1)) {
                    continue <- T
                    break
                }
            }
            if (continue) next
            for (i in 1:repeats) patterns[pos[i], pitch] <- repeats - i + 1
            result <- append(result, list(c(startpos, pitch, repeats)))
        }
    }

    return(result)
}

注意:此算法具有大致二次运行时复杂度,因此如果您将字符串设置为两倍长,则平均需要四倍的时间来查找所有模式。

伪代码版本

帮助理解代码。有关R函数的详细信息,例如which,请参阅R在线文档,例如在R命令行上运行?which

PROCEDURE str_groups WITH INPUT $s (a string of the form /(0|1)*/):
    digits := array containing the digits in $s
    index1 := positions of the digits in $s that are equal to 1
    len := pointer to last item in $digits
    back := pointer to last item in $index1
    IF there are no items in $index1, EXIT WITH empty list
    maxpitch := the greatest possible interval between 1-digits, given $len
    patterns := array with $len rows and $maxpitch columns, initially all zero
    result := array of triplets, initially empty

    FOR EACH possible $pitch FROM 1 TO $maxpitch:
        divisors := array of divisors of $pitch (including 1, excluding $pitch)
        UPDATE $back TO the last position at which a pattern could start;
            IF no such position remains, EXIT WITH result
        FOR EACH possible $startpos IN $index1 up to $back:
            IF $startpos is marked as part of a pattern, SKIP TO NEXT $startpos
            pos := possible positions of pattern members given $startpos, $pitch
            IF either the 2nd or 3rd $pos is not 1, SKIP TO NEXT $startpos
            repeats := the number of positions in $pos
            IF there are more than 3 positions in $pos THEN
                count how long the pattern continues
                UPDATE $repeats TO the length of the pattern
            END IF (more than 3 positions)
            FOR EACH possible $subpitch IN $divisors:
                check $patterns for pattern with interval $subpitch at $startpos
                IF such a pattern is found AND it envelopes the current pattern,
                    SKIP TO NEXT $startpos
                    (using helper variable $continue to cross two loop levels)
                END IF (pattern found)
            END FOR (subpitch)
            FOR EACH consecutive position IN the pattern:
                UPDATE $patterns at row of position and column of $pitch TO ...
                    ... the remaining length of the pattern at that position
            END FOR (position)
            APPEND the triplet ($startpos, $pitch, $repeats) TO $result
        END FOR (startpos)
    END FOR (pitch)

    EXIT WITH $result
END PROCEDURE (str_groups)

答案 2 :(得分:0)

也许以下路线会有所帮助:

  1. 将字符串转换为整数字符的向量

    v <- as.integer(strsplit(s, "")[[1]])
    
  2. 将此向量重复转换为不同行数的矩阵...

    m <- matrix(v, nrow=...)
    
  3. ...并使用rle查找矩阵m中的相关模式:

    rle(m[1, ]); rle(m[2, ]); ...