提取两个子字符串之间匹配的字符串的一部分

时间:2014-09-27 18:04:51

标签: python r perl pattern-matching substring

我有三个包含一组字符串的文件。 File1和File2包含File3的子字符串。我想从位于File1和File2中的子串之间的File3中减去字符串。请参阅下面的示例:

File1(substring 1):

 head(fivep$V2)
[1] UGAGGUAGUAGUUUGUACAGUU  UGAGGUAGUAGUUUGUGCUGUU  ACAUACUUCUUUAUAUGCCCAUA UAGCAGCACAUCAUGGUUUACA 
[5] GGGUUCCUGGCAUGCUGAUUU   AGAGCUUAGCUGAUUGGUGAAC 

File2(substring 2)

 head(threep$V2)
[1] ACUGUACAGGCCACUGCCUUGC CUGCGCAAGCUACUGCCUUGCU UGGAAUGUAAAGAAGUAUGUAU CGAAUCAUUAUUUGCUGCUCUA
[5] AUCACAUUGCCAGGGAUUACC  UUCACAGUGGCUAAGUUCUGC 

文件3

head(hairpin$V2)
[1] UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA
[2] AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU     
[3] AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAUAUCAUAAAUGGAUAUGGAAUGUAAAGAAGUAUGUAGAACGGGGUGGUAGU   
[4] UAAACAGUAUACAGAAAGCCAUCAAAGCGGUGGUUGAUGUGUUGCAAAUUAUGACUUUCAUAUCACAGCCAGCUUUGAUGUGCUGCCUGUUGCACUGU 
[5] CGGACAAUGCUCGAGAGGCAGUGUGGUUAGCUGGUUGCAUAUUUCCUUGACAACGGCUACCUUCACUGCCACCCCGAACAUGUCGUCCAUCUUUGAA  
[6] UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC  

示例:

                                 String in File1                       String in  File2
                              AGGGCUUAGCUGCUUGUGAGCA                   UUCACAGUGGCUAAGUUCCGC
String in File3      CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG

此示例的输出:

GGGUCCACACCAAGUCGUG

7 个答案:

答案 0 :(得分:4)

在Perl中,您可以尝试以下代码:

use strict;
use warnings;

my $file1 = "AGGGCUUAGCUGCUUGUGAGCA";
my $file2 = "UUCACAGUGGCUAAGUUCCGC";
my $file3 = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG";

my ($result) = $file3 =~ /$file1(.*?)$file2/;

print $result;

输出:

GGGUCCACACCAAGUCGUG

答案 1 :(得分:2)

这是R中的解决方案:

file1 <- "AGGGCUUAGCUGCUUGUGAGCA"
file2 <- "UUCACAGUGGCUAAGUUCCGC"
file3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"

# create a regular expression
pattern <- paste0(".*", file1, "(.*)", file2, ".*")

# extract the substring
sub(pattern, "\\1", file3)
# [1] "GGGUCCACACCAAGUCGUG"

答案 2 :(得分:1)

python

>>> a='AGGGCUUAGCUGCUUGUGAGCA'
>>> b='UUCACAGUGGCUAAGUUCCGC'
>>> c='CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG'
>>> regex = a + '(.*?)' + b
>>> regex
'AGGGCUUAGCUGCUUGUGAGCA(.*?)UUCACAGUGGCUAAGUUCCGC'
>>> re.findall(regex,c)
['GGGUCCACACCAAGUCGUG']

答案 3 :(得分:1)

在gsubfn中使用strapplyc尝试此操作。我们假设只有s1s2的一个实例,或者如果有多个实例需要s1的第一个实例和{{1的最后一个实例之间的字符串}}。如果可能有多个实例并且您想要不同的内容,请将此问题添加到问题中。

s2

答案 4 :(得分:1)

在python中`

    string1 = "AGGGCUUAGCUGCUUGUGAGCA"
    string2 = "UUCACAGUGGCUAAGUUCCGC"
    string_main = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
    print string_main[string_main.find(string1)+len(string1):string_main.find(string2)]

答案 5 :(得分:1)

根据您给定的输入,以下内容可行。

f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
strsplit(f3, paste(f1, f2, sep='|'))[[1]][2]
# [1] "GGGUCCACACCAAGUCGUG"

答案 6 :(得分:1)

在R中使用qdapRegex

f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"

library(qdapRegex)
rm_between(f3, f1, f2, extract=TRUE)

## [[1]]
## [1] "GGGUCCACACCAAGUCGUG"

顾名思义rm_between删除或抓取左右边界之间的项目。使用extract = TRUE抓取边界之间的字符串。返回的值是一个列表,因为每个字符串可能有多个提取。如果这是不受欢迎的,请使用unlist中的unlist(rm_between(f3, f1, f2, extract=TRUE))