计算单元格数组中的概率

时间:2014-03-26 07:34:17

标签: matlab matrix probability markov

嘿,我有一个单元格数组,第二列是'XX-> XX'的时间,例如:

'AA->AA'    [21]    [4.2084]
'AA->AC'    [15]    [3.0060]
'AA->AG'    [ 9]    [1.8036]
'AA->AT'    [12]    [2.4048]
'AC->CA'    [14]    [2.8056]
'AC->CC'    [16]    [3.2064]
'AC->CG'    [ 5]    [1.0020]
'AC->CT'    [ 3]    [0.6012]
'AG->GA'    [11]    [2.2044]
'AG->GC'    [ 5]    [1.0020]
'AG->GG'    [ 8]    [1.6032]
'AG->GT'    [13]    [2.6052]
'AT->TA'    [10]    [2.0040]
'AT->TC'    [ 8]    [1.6032]
'AT->TG'    [ 2]    [0.4008]
'AT->TT'    [11]    [2.2044]
'CA->AA'    [17]    [3.4068]
'CA->AC'    [ 7]    [1.4028]
'CA->AG'    [ 9]    [1.8036]
'CA->AT'    [11]    [2.2044]
'CC->CA'    [15]    [3.0060]
'CC->CC'    [ 5]    [1.0020]
'CC->CG'    [ 4]    [0.8016]
'CC->CT'    [17]    [3.4068]
'CG->GA'    [ 1]    [0.2004]
'CG->GC'    [ 2]    [0.4008]
'CG->GG'    [ 9]    [1.8036]
'CG->GT'    [ 3]    [0.6012]
'CT->TA'    [ 7]    [1.4028]
'CT->TC'    [ 9]    [1.8036]
'CT->TG'    [ 9]    [1.8036]
'CT->TT'    [ 2]    [0.4008]
'GA->AA'    [10]    [2.0040]
'GA->AC'    [ 4]    [0.8016]
'GA->AG'    [10]    [2.0040]
'GA->AT'    [ 2]    [0.4008]
'GC->CA'    [ 2]    [0.4008]
'GC->CC'    [ 7]    [1.4028]
'GC->CG'    [ 6]    [1.2024]
'GC->CT'    [ 3]    [0.6012]
'GG->GA'    [ 6]    [1.2024]
'GG->GC'    [ 6]    [1.2024]
'GG->GG'    [ 4]    [0.8016]
'GG->GT'    [ 8]    [1.6032]
'GT->TA'    [ 6]    [1.2024]
'GT->TC'    [11]    [2.2044]
'GT->TG'    [ 8]    [1.6032]
'GT->TT'    [ 5]    [1.0020]
'TA->AA'    [ 8]    [1.6032]
'TA->AC'    [13]    [2.6052]
'TA->AG'    [ 9]    [1.8036]
'TA->AT'    [ 6]    [1.2024]
'TC->CA'    [13]    [2.6052]
'TC->CC'    [13]    [2.6052]
'TC->CT'    [ 4]    [0.8016]
'TG->GA'    [ 8]    [1.6032]
'TG->GC'    [ 5]    [1.0020]
'TG->GG'    [ 3]    [0.6012]
'TG->GT'    [ 6]    [1.2024]
'TT->TA'    [13]    [2.6052]
'TT->TC'    [ 2]    [0.4008]
'TT->TG'    [ 3]    [0.6012]
'TT->TT'    [ 5]    [1.0020]

现在,我试图计算概率:P('AA-> AA')= TIMES('AA-> AA')/ SUM('AA-> AA','AA-> AC','AA-> AG','AA-> AT'),换句话说,P('AA-> AA')=时间('AA-> AA')/ SUM('AA - >任何')。和其他人一样。我想用循环来做到这一点,但

中有一个极端的情况
'TC->CA'    [13]    [2.6052]
'TC->CC'    [13]    [2.6052]
'TC->CT'    [ 4]    [0.8016]
很明显,'TC-> CG'的时间显然是0,这也需要考虑,即使我们已经知道概率应该为0.当然,这种极端情况可以在任何其他情况下发生一个人喜欢,有时可能缺少'TT-> TT',或者有时候'TC-> CT'。 任何人都知道如何做到这一点? 感谢。

1 个答案:

答案 0 :(得分:1)

试试这个 -

%%// Get the cell data into data1
data1 = INPUT_DATA;

%%// Get the data from columns separately
col1 = data1(:,1);
tag_data = vertcat(col1{:});

col2 = data1(:,2);
times_data = vertcat(col2{:});

col3 = data1(:,3);
col3_data = vertcat(col3{:});

%%// Get full data for tag, times and column3
char_array = ['A' 'C' 'G' 'T'];
full_tag_data = char_array(combinator(4,3,'p','r'));
full_tag_data = [full_tag_data(:,1:2) repmat('->',[size(full_tag_data,1) 1]) full_tag_data(:,2:3)];

present_rows = ismember(full_tag_data,tag_data,'rows');
full_times_data = double(present_rows);
full_times_data(present_rows) = times_data;

full_col3_data = double(present_rows);
full_col3_data(present_rows) = col3_data;

%%// Get the sum values
full_col3_data_summed = sum(reshape(full_col3_data,4,[]),1);
full_col3_data_summed = reshape(repmat(full_col3_data_summed,[4 1]),[],1);

%%// Store the required values into a cell array out_cell1
out_cell1 = cell(size(present_rows,1),2);
out_cell1(:,1) = cellstr(full_tag_data);
out_cell1(:,2) = num2cell(full_times_data);
out_cell1(:,3) = num2cell(full_col3_data);

%%// The probabilities are added into the cell array as the fourth column
out_cell1(:,4) = num2cell(full_times_data./full_col3_data_summed);

注意:以上代码使用的函数combinator可用here