如何匹配dna序列模式

时间:2013-06-01 09:16:15

标签: algorithm sequence matching dna-sequence

我找不到解决这个问题的方法。

输入输出序列如下:

 **input1 :** aaagctgctagag 
 **output1 :** a3gct2ag2

 **input2 :** aaaaaaagctaagctaag 
 **output2 :** a6agcta2ag

输入序列可以是10 ^ 6个字符,并且将考虑最大的连续模式。

例如对于input2“agctaagcta”输出将不是“agcta2gcta”但它将是“agcta2”。

任何帮助表示感谢。

3 个答案:

答案 0 :(得分:10)

算法说明:

  • 序列S的符号为s(1),s(2),...,s(N)。
  • 设B(i)是具有元素s(1),s(2),...,s(i)的最佳压缩序列。
  • 因此,例如,B(3)将是s(1),s(2),s(3)的最佳压缩序列。
  • 我们想知道的是B(N)。

为了找到它,我们将通过归纳进行。我们想要计算B(i + 1),知道B(i),B(i-1),B(i-2),...,B(1),B(0),其中B(0)是空的序列,和B(1)= s(1)。同时,这构成了解决方案最佳的证据。 ;)

要计算B(i + 1),我们将在候选人中选择最佳序列:

  1. 最后一个块有一个元素的候选序列:

    B(i)s(i + 1)1 B(i-1)s(i + 1)2;只有当s(i)= s(i + 1)时 B(i-2)s(i + 1)3;只有当s(i-1)= s(i)和s(i)= s(i + 1)时 ... B(1)s(i + 1)[i-1];只有当s(2)= s(3)且s(3)= s(4)且......和s(i)= s(i + 1)时 B(0)s(i + 1)i = s(i + 1)i;只有当s(1)= s(2)且s(2)= s(3)且......和s(i)= s(i + 1)

  2. 最后一个块有2个元素的候选序列:

    B(I-1)序列s(i)序列s(i + 1)1 B(i-3)s(i)s(i + 1)2;只有当s(i-2)s(i-1)= s(i)s(i + 1)时 B(i-5)s(i)s(i + 1)3;只有当s(i-4)s(i-3)= s(i-2)s(i-1)和s(i-2)s(i-1)= s(i)s(i + 1)时) ...

  3. 最后一个块有3个元素的候选序列:

    ...

  4. 最后一个块有4个元素的候选序列:

    ...

    ...

  5. 最后一个块有n + 1个元素的候选序列:

    S(1)S(2)S(3).........序列s(i + 1)

  6. 对于每种可能性,当序列块不再重复时,算法停止。就是这样。

    算法将在psude-c代码中使用这样的东西:

    B(0) = “”
    for (i=1; i<=N; i++) {
        // Calculate all the candidates for B(i)
        BestCandidate=null
        for (j=1; j<=i; j++) {
            Calculate all the candidates of length (i)
    
            r=1;
            do {
                 Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
                 If (   (BestCandidate==null) 
                          || (Candidate is shorter that BestCandidate)) 
                     {
                BestCandidate=Candidate.
                     }
                 r++;
            } while (  ([i-j]*r <= i) 
                 &&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))
    
        }
        B(i)=BestCandidate
    }
    

    希望这可以帮到更多。

    下面给出了执行所需任务的完整C程序。它以O(n ^ 2)运行。中心部分只有30行代码。

    编辑我重新构建了一些代码,更改了变量的名称并添加了一些注释以便更具可读性。

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <limits.h>
    
    
    // This struct represents a compressed segment like atg4, g3,  agc1
        struct Segment {
            char *elements;
            int nElements;
            int count;
        };
             // As an example, for the segment agagt3  elements would be:
             // {
             //      elements: "agagt",
             //      nElements: 5,
             //      count: 3
             // }
    
        struct Sequence {
            struct Segment lastSegment;
            struct Sequence *prev;      // Points to a sequence without the last segment or NULL if it is the first segment
            int totalLen;  // Total length of the compressed sequence.
        };
           // as an example, for the sequence agt32ta5, the representation will be:
           // {
           //     lastSegment:{"ta" , 2 , 5},
           //     prev: @A,
           //     totalLen: 8
           // }
           // and A will be
           // {
           //     lastSegment{ "agt", 3, 32},
           //     prev: NULL,
           //     totalLen: 5
           // }
    
    
    // This function converts a sequence to a string.
    // You have to free the string after using it.
    // The strategy is to construct the string from right to left.
    
    char *sequence2string(struct Sequence *S) {
        char *Res=malloc(S->totalLen + 1);
        char *digits="0123456789";
    
        int p= S->totalLen;
        Res[p]=0;
    
        while (S!=NULL) {
                // first we insert the count of the last element.
                // We do digit by digit starting with the units.
                int C = S->lastSegment.count;
                while (C) {
                    p--;
                    Res[p] = digits[ C % 10 ];
                    C /= 10;
                }
    
                p -= S->lastSegment.nElements;
                strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);
    
                S = S ->prev;
        }
    
    
        return Res;
    }
    
    
    // Compresses a dna sequence.
    // Returns a string with the in sequence compressed.
    // The returned string must be freed after using it.
    char *dnaCompress(char *in) {
        int i,j;
    
        int N = strlen(in);;            // Number of elements of a in sequence.
    
    
    
        // B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
        // What we want to return is B[N];
        struct Sequence *B;
        B = malloc((N+1) * sizeof (struct Sequence));
    
        // We first do an initialization for i=0
    
        B[0].lastSegment.elements="";
        B[0].lastSegment.nElements=0;
        B[0].lastSegment.count=0;
        B[0].prev = NULL;
        B[0].totalLen=0;
    
        // and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth,  We will try different sequences and keep the minimum one.
        for (i=1; i<=N; i++) B[i].totalLen = INT_MAX;   // A very high value
    
        for (i=1; i<=N; i++) {
    
            // at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
            for (j=1; j<=i; j++) {
    
                // Here we will check all the candidates where the last segment has j elements
    
                int r=1;                  // number of times the last segment is repeated
                int rNDigits=1;           // Number of digits of r
                int rNDigitsBound=10;     // We will increment r, so this value is when r will have an extra digit.
                                          // when r = 0,1,...,9 => rNDigitsBound = 10
                                          // when r = 10,11,...,99 => rNDigitsBound = 100
                                          // when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.
    
                do {
    
                    // Here we analitze a candidate B(i).
                    // where the las segment has j elements repeated r times.
    
                    int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
                    if (CandidateLen < B[i].totalLen) {
                        B[i].lastSegment.elements = in + i - j*r;
                        B[i].lastSegment.nElements = j;
                        B[i].lastSegment.count = r;
                        B[i].prev = &(B[i-j*r]);
                        B[i].totalLen = CandidateLen;
                    }
    
                    r++;
                    if (r == rNDigitsBound ) {
                        rNDigits++;
                        rNDigitsBound *= 10;
                    }
                } while (   (i - j*r >= 0)
                         && (strncmp(in + i -j, in + i - j*r, j)==0));
    
            }
        }
    
        char *Res=sequence2string(&(B[N]));
        free(B);
    
        return Res;
    }
    
    int main(int argc, char** argv) {
        char *compressedDNA=dnaCompress(argv[1]);
        puts(compressedDNA);
        free(compressedDNA);
        return 0;
    }
    

答案 1 :(得分:2)

忘记Ukonnen。它是动态编程。使用三维表:

  1. 序列位置
  2. 子序列大小
  3. 段数
  4. 术语:例如,具有a = "aaagctgctagag",序列位置坐标将从1到13运行。在序列位置3(字母'g'),具有子序列大小4,子序列将是“gctg”。懂了吗?至于段数,则表示为“aaagctgctagag1”由1个段(序列本身)组成。将其表示为“a3gct2ag2”由3个段组成。 “aaagctgct1ag2”由2个段组成。 “a2a1ctg2ag2”将包含4个段。懂了吗?现在,有了这个,你开始填充一个13 x 13 x 13的三维数组,所以你的时间和内存复杂度似乎在n ** 3左右。你确定你可以处理百万桶序列吗?我认为贪婪的方法会更好,因为大的DNA序列不太可能完全重复。而且,我建议您将作业扩大到近似匹配,并且可以直接在日记中发布。

    无论如何,您将开始填充从某个位置(维度1)开始压缩子序列的表格,其长度等于维度2坐标,最多具有3个维度段。所以你先填充第一行,表示长度为1的子序列的压缩,最多包含1个段:

    a        a        a        g        c        t        g        c        t        a        g        a        g
    1(a1)    1(a1)    1(a1)    1(g1)    1(c1)    1(t1)    1(g1)    1(c1)    1(t1)    1(a1)    1(g1)    1(a1)    1(g1)
    

    数字是字符成本(对于这些简单的1-char序列总是1;数字1不计入字符成本),在括号中,你有压缩(对于这个简单的情况也很简单)。第二行仍然很简单:

    2(a2)    2(a2)    2(ag1)   2(gc1)   2(ct1)   2(tg1)   2(gc1)   2(ct1)   2(ta1)   2(ag1)   2(ga1)    2(ag1)
    

    只有一种方法可以将2个字符的序列分解为2个子序列 - 1个字符+ 1个字符。如果它们相同,则结果类似于a + a = a2。如果它们不同,例如a + g,那么,因为只允许1段序列,结果不能是a1g1,而必须是ag1。第三行最终会更有趣:

    2(a3)    2(aag1)  3(agc1)  3(gct1)  3(ctg1)  3(tgc1)  3(gct1)  3(cta1)  3(tag1)  3(aga1)  3(gag1)
    

    在这里,您始终可以选择两种组合压缩字符串的方式。例如,aag可以由aa + ga + ag组成。但同样,我们不能像aa1g1a1ag1一样拥有2个细分,因此我们必须对aag1感到满意,除非两个组件都包含相同的字符,如aa + a } =&gt; a3,字符成本2.我们可以继续到第4行:

    4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2)
    

    此处,在第一个位置,我们无法使用a3g1,因为此层只允许1个细分。但是在最后一个位置,ag1 + ag1 = ag2可以压缩到字符成本3。这样,可以将整个第一级表一直填充到13个字符的单个子序列,并且每个子序列将具有其最佳字符成本,并且在与其关联的最多1个段的第一级约束下具有其压缩。

    然后你进入第二级,其中允许2个段......再次,从下到上,通过比较所有的,你可以确定给定级别的段计数约束下每个表坐标的最佳成本和压缩使用已计算的位置组成子序列的可能方法,直到您完全填充表格并因此计算全局最优值。有一些细节需要解决,但很抱歉,我不打算给你编码。

答案 2 :(得分:2)

在尝试了自己的方式一段时间之后,我对jbaylina的赞誉,他的漂亮算法和C实现。这是我在Haskell中尝试使用jbaylina算法的版本,并在其下面进一步开发了我尝试线性时间算法的尝试,该算法试图以一个一个的方式压缩包含重复模式的段:

import Data.Map (fromList, insert, size, (!))

compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n  
 where
  n = length s
  f b i = insert (size b) bestCandidate b where
    add (sequence, sLength) (sequence', sLength') = 
      (sequence ++ sequence', sLength + sLength')
    j' = [1..min 100 i]
    bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j'
    combCandidates j candidate' = 
      let nextCandidate' = comb 2 (b!(i - j + 1) 
                       `add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1))
      in if snd nextCandidate' <= snd candidate' 
            then nextCandidate' 
            else candidate' where
        comb r candidate
          | r > uBound                         = candidate
          | not (strcmp r True)                = candidate
          | snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate
          | otherwise                          = comb (r + 1) candidate
         where 
           uBound = div (i + 1) j
           prev = b!(i - r * j + 1)
           nextCandidate = prev `add` 
             ((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r))
           strcmp 1   _    = True
           strcmp num bool 
             | (take j . drop (i - num * j + 1) $ s) 
                == (take j . drop (i - (num - 1) * j + 1) $ s) = 
                  strcmp (num - 1) True
             | otherwise = False

输出:

*Main> compress "aaagctgctagag"
("a3gct2ag2",9)

*Main> compress "aaabbbaaabbbaaabbbaaabbb"
("aaabbb4",7)


线性时间尝试:

import Data.List (sortBy)

group' xxs sAccum (chr, count)
  | null xxs = if null chr 
                  then singles
                  else if count <= 2 
                          then reverse sAccum ++ multiples ++ "1"
                  else singles ++ if null chr then [] else chr ++ show count
  | [x] == chr = group' xs sAccum (chr,count + 1)
  | otherwise = if null chr 
                   then group' xs (sAccum) ([x],1) 
                   else if count <= 2 
                           then group' xs (multiples ++ sAccum) ([x],1)
                   else singles 
                        ++ chr ++ show count ++ group' xs [] ([x],1)
 where x:xs = xxs
       singles = reverse sAccum ++ (if null sAccum then [] else "1")
       multiples = concat (replicate count chr)

sequences ws strIndex maxSeqLen = repeated' where
  half = if null . drop (2 * maxSeqLen - 1) $ ws 
            then div (length ws) 2 else maxSeqLen
  repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated
              in (sequence,(sequenceStart, sequenceEnd'))
  repeated = foldr divide ([],(strIndex,strIndex),False) [1..half]
  equalChunksOf t a = takeWhile(==t) . map (take a) . iterate (drop a)
  divide chunkSize b@(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = 
    let t = take (2*chunkSize) ws
        t' = take chunkSize t
    in if t' == drop chunkSize t
          then let ts = equalChunksOf t' chunkSize ws
                   lenTs = length ts
                   sequenceEnd = strIndex + lenTs * chunkSize
                   newEnd = if sequenceEnd > sequenceEnd' 
                            then sequenceEnd else sequenceEnd'
               in if chunkSize > 1 
                     then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs)
                             then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True)
                             else b
                     else if notSinglesFlag
                             then b
                             else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False)
          else b

addOne a b
  | null (fst b) = a
  | null (fst a) = b
  | otherwise = 
      let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a 
          (((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b
      in if sStart' < sEnd && sEnd < sEnd'
            then let c = ((start,end,patLen,lenS),sequence):rest
                     d = ((start',end',patLen',lenS'),sequence'):rest'
                 in (c ++ d, (sStart, sEnd'))
            else a

segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where
  segment' zzs@(z:zs) strIndex farthest
    | null zs                              = initial
    | strIndex >= farthest && strIndex > 0 = ([],(0,0))
    | otherwise                            = addOne initial next
   where
     next@(s',(start',end')) = segment' zs (strIndex + 1) farthest'
     farthest' | null s = farthest
               | otherwise = if start /= end && end > farthest then end else farthest
     initial@(s,(start,end)) = sequences zzs strIndex maxSeqLen

areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a)

combs []     r = [r]
combs (x:xs) r
  | null r    = combs xs (x:r) ++ if null xs then [] else combs xs r
  | otherwise = if areExclusive (head r) x
                   then combs xs (x:r) ++ combs xs r
                        else if l' > lowerBound
                                then combs xs (x: reduced : drop 1 r) ++ combs xs r
                                else combs xs r
 where lowerBound = l + 2 * patLen
       ((l,u,patLen,lenS),s) = head r
       ((l',u',patLen',lenS'),s') = x
       reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u
       lenReduced = length reduce
       reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s)

buildString origStr sequences = buildString' origStr sequences 0 (0,"",0)
   where
    buildString' origStr sequences index accum@(lenC,cStr,lenOrig)
      | null sequences = accum
      | l /= index     = 
          buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l')
      | otherwise      = 
          buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u')
     where
       l' = l - index
       u' = u - l  
       s' = s ++ show lenS       
       (((l,u,patLen,lenS),s):rest) = sequences

compress []         _         accum = reverse accum ++ (if null accum then [] else "1")
compress zzs@(z:zs) maxSeqLen accum
  | null (fst segment')                      = compress zs maxSeqLen (z:accum)
  | (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum)
  | otherwise                                =
      reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1")
      ++ compressedStr
      ++ compress (drop lengthOriginal zzs) maxSeqLen []
 where segment'@(s,(start,end)) = segment zzs 0 maxSeqLen
       combinations = combs (fst $ segment') []
       takeWhile' xxs count
         | null xxs                                             = 0
         | x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count 
         | not (null (reads [x]::[(Int,String)]))               = 0
         | otherwise                                            = takeWhile' xs (count + 1) 
        where x:xs = xxs
       f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') = 
         let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig) 
                         ((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig')
         in if g == EQ 
               then compare (takeWhile' cStr' 0) (takeWhile' cStr 0)
               else g
       (lenCompressed,compressedStr,lengthOriginal) = 
         head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations))

输出:

*Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 []
"a9b9a9b9"

*Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 []
"aaabbb4"