我正在研究matlab中的文档聚类代码。我的文件是:
'The first step in analyzing the requirements is to construct an object model.
It describes real world object classes and their relationships to each other.
Information for the object model comes from the problem statement, expert knowledge of the application domain, and general knowledge of the real world.
Britvic plc is one of the leading soft drinks manufacturers of soft drinks in the Beverages Sector functioning in Europe with its distribution branches in Great Britain, Ireland and France. '
如图所示,这些段落包含不同类别的数据。以下是我的主要计划:
global n;
n=1;
file1=fopen('doc1.txt','r');
%file 1 is now open
%read data from file 1
text=fileread('doc1.txt');
i=0;
%now text1 has the content of doc1 as a string.Next split the sentences
%into words.For that we are calling the split function
[C1,C2]=clustering(text)
以下是'群集'的代码:
function [C1,C2]=clustering(text)
global C1;
text1=strsplit(text,'.');
[rt1,ct1]=size(text1);
for i=1:(ct1-1)
var=text1{i};
vv=strsplit(var,' ');
text2=setdiff(vv,{'this','you','is','an','with','as','well','like','and','to','it','on','off','of','in','mine','your','yours','these','this','will','would','shall','should','or','a','about','all','also','am','are','but','of','for','by','my','did','do','her','his','the','him','she','he','they','that','when','we','us','not','them','if','in','just','may','not'},'stable');
[rt2,ct2]=size(text2);
for r=1:ct2
tmar=porterStemmer(text2{r});
mapr{i,r}=tmar;
end
end
[mr,mc]=size(mapr);
mapr
A=zeros(mr,mr);
for i=1:mr
for j=1:mc
for m=i+1:mr
for k=1:mc
if ~isempty(mapr{i,j})
%if(~(mapr{i,j}=='[]'))
%mapr(i,j)
if strcmp(mapr{i,j},mapr{m,k})
p=mapr{i,j};
str=sprintf('Sentences %d and %d match',i,m)
str;
str1=sprintf('And the word is : %s ',p)
str1;
A(i,m)=1;
A(m,i)=1;
end
end
end
end
end
end
sprintf('Adjacency matrix is:')
A
sprintf('The corresponding diagonnal matrix is:')
[ar,ac]=size(A);
for i=1:ar
B(i)=0;
for j=1:ac
B(i)=B(i)+A(i,j);
end
end
[br,bc]=size(B);
D=zeros(bc,bc);
for i=1:bc
D(i,i)=B(i);
end
D
sprintf('The similarity matrix is:')
C=D-A
[V,D]=eig(C,'nobalance')
F=inv(V);
V*D*F
%mvar =no of edges/total degree of vertices
no_of_edges=0;
for i=1:ar
for j=1:ac
if(i<=j)
no_of_edges=no_of_edges+A(i,j);
end
end
end
no_of_edges;
tdv=0;
for i=1:bc
tdv=tdv+B(i);
end
tdv;
mvar=no_of_edges/tdv
[dr,dc]=size(D);
temp=abs(D(1,1)-mvar);
x=D(1,1);
for i=2:dc
temp2=abs(D(i,i)-mvar);
if temp>temp2
temp=temp2;
x=D(i,i);
q=i
end
end
x
[vr,vc]=size(V);
for i=1:vr
V(i,q);
Track(i)=V(i,q);
end
sprintf('Eigen vectors corresponding to the closest value:')
Track
j=1;
m=1;
C1=' ';
C2=' ';
for i=1:vr
if(Track(i)<0)
C1=strcat(C1,text1{1,i},'.');
else
C2=strcat(C2,text1{1,i},'.');
end
end
我可以从文档生成最初的两个集群。但话说回来,我希望聚类过程继续在生成的聚类上产生越来越多的每个子聚类,直到生成的总体没有变化。有人可以帮我实现这个解决方案,这样我不仅可以生成集群,还可以跟踪它们以便进一步处理。提前谢谢。