Question

我正在研究MATLAB中的一个函数，它比较两个基因序列并确定它们的相似性。为此，我将两个序列分成较小的子串，通过使用for循环移动它们，一次移动一个核苷酸并将子串添加到单元阵列中。

因此，例如，字符串长度为4的字符串ATGCAAAT不会被拆分为

ATGC，AAAT

而是

ATCG，TGCA，GCAA，CAAA，AAAT

我正在尝试更快地执行该函数，并且由于这两个for循环几乎占用了90％的执行时间，我想知道在MATLAB中是否会有更快的方法来执行此操作。

以下是我目前使用的代码：

 SubstrSequence1 = {};                                                
 SubstrSequence2 = {};
 for i = 1:length(Sequence1)-(SubstringLength-1)                
     SubstrSequence1 = [SubstrSequence1, Sequence1(i:i+SubstringLength-1)];
 end

 for i = 1:length(Sequence2)-(SubstringLength-1)                
     SubstrSequence2 = [SubstrSequence2, Sequence2(i:i+SubstringLength-1)]; 
 end

Answer 1

这个怎么样？

str = 'ATGCAAAT';
n = 4;
strs = str(bsxfun(@plus, 1:n, (0:numel(str)-n).'));

结果是 2D char数组：

strs =
ATGC
TGCA
GCAA
CAAA
AAAT

因此部分字符串为strs(1,:)，strs(2,:)等。

如果您想将结果作为单元格的字符串，请在最后添加：

strs = cellstr(strs);

生产

strs = 
    'ATGC'
    'TGCA'
    'GCAA'
    'CAAA'
    'AAAT'

然后部分字符串为strs{1}，strs{2}等。

Answer 2

这是使用hankel获取SubstrSequence1 -

的一种方法

A = 1:numel(Sequence1);
out = cellstr(Sequence1(hankel(A(1:SubstringLength),A(SubstringLength:end)).'))

您可以按照相同的步骤查找SubstrSequence2。

示例运行 -

>> Sequence1 = 'ATGCAAAT';
>> SubstringLength = 4;
>> A = 1:numel(Sequence1);
>> cellstr(Sequence1(hankel(A(1:SubstringLength),A(SubstringLength:end)).'))
ans = 
    'ATGC'
    'TGCA'
    'GCAA'
    'CAAA'
    'AAAT'

Answer 3

一种方法是生成一个适当提取所需子串的索引矩阵：

Name

在MATLAB中将长字符串拆分为子字符串的最有效方法

3 个答案: