如何使用Mallet API从描述特征值对的文件创建实例?

时间:2015-03-10 16:20:16

标签: java lda topic-modeling mallet

我想运行LDA从txt文件生成一些主题,如下所示:

Document1 label1 forest = 3.4 tree = 5 wood = 2.85 hammer = 1 color = 1 leaf = 1.5

Document2 label2 forest = 10 tree = 5 wood = 2.75 hammer = 1 color = 4 leaf = 1

Document3 label3 forest = 19 tree = 0.90 wood = 2 hammer = 2 color = 9 leaf = 4.3

Document4 label4 forest = 4 tree = 5 wood = 10 hammer = 1 color = 6 leaf = 3

文件中的每个数值表示每个特征(例如,森林,树)的出现次数乘以给定的罚分。

要从这样的文件生成实例,我使用以下Java代码:



String lineRegex = "^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$";

String dataRegex = "[\\p{L}([0-9]*\\.[0-9]+|[0-9]+)_\\=]+";

InstanceList generateInstances(String dataPath) throws UnsupportedEncodingException, FileNotFoundException {
   	 
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
       
       pipeList.add(new Target2Label());
       pipeList.add( new CharSequenceLowercase() ); 
       pipeList.add( new Input2CharSequence() ); 
       pipeList.add( new CharSequence2TokenSequence(Pattern.compile(dataRegex)) );
       /*pipeList.add( new TokenSequenceRemoveStopwords(new File(stopwordListPath), "UTF-8", 
       		      false, false, false) );*/
       pipeList.add( new TokenSequenceParseFeatureString(true,true,"=") );
       pipeList.add( new PrintInputAndTarget());

       InstanceList instances = new InstanceList (new SerialPipes(pipeList));

       Reader fileReader = new InputStreamReader(new FileInputStream(new File(dataPath)), 
     		                                    "UTF-8");
       
       instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile(lineRegex),
                                              3, 2, 1)); 
      
       return instances;
   }
&#13;
&#13;
&#13;

然后我使用指令model.addInstances(generatedInstances)将如此生成的实例添加到我的模型中。结果输出如下所述。它包含由指令model.addInstances(generatedInstances)引起的错误。调试我的代码向我显示与模型关联的字母表为空。我使用错误的迭代器吗?任何人都可以帮我修改我的代码吗?

&#13;
&#13;
name: document1
target: label1
input: TokenSequence [forest=3.4 feature(forest)=3.4  span[0..10], tree=5 feature(tree)=5.0  span[11..17], wood=2.85 feature(wood)=2.85  span[18..27], hammer=1 feature(hammer)=1.0  span[28..36], colour=1 feature(colour)=1.0  span[37..45], leaf=1.5 feature(leaf)=1.5  span[46..54]]
Token#0:forest=3.4 feature(forest)=3.4  span[0..10]
Token#1:tree=5 feature(tree)=5.0  span[11..17]
Token#2:wood=2.85 feature(wood)=2.85  span[18..27]
Token#3:hammer=1 feature(hammer)=1.0  span[28..36]
Token#4:colour=1 feature(colour)=1.0  span[37..45]
Token#5:leaf=1.5 feature(leaf)=1.5  span[46..54]

name: document2
target: label2
input: TokenSequence [forest=10 feature(forest)=10.0  span[0..9], tree=5 feature(tree)=5.0  span[10..16], wood=2.75 feature(wood)=2.75  span[17..26], hammer=1 feature(hammer)=1.0  span[27..35], colour=4 feature(colour)=4.0  span[36..44], leaf=1 feature(leaf)=1.0  span[45..51]]
Token#0:forest=10 feature(forest)=10.0  span[0..9]
Token#1:tree=5 feature(tree)=5.0  span[10..16]
Token#2:wood=2.75 feature(wood)=2.75  span[17..26]
Token#3:hammer=1 feature(hammer)=1.0  span[27..35]
Token#4:colour=4 feature(colour)=4.0  span[36..44]
Token#5:leaf=1 feature(leaf)=1.0  span[45..51]

name: document3
target: label3
input: TokenSequence [forest=19 feature(forest)=19.0  span[0..9], tree=0.90 feature(tree)=0.9  span[10..19], wood=2 feature(wood)=2.0  span[20..26], hammer=2 feature(hammer)=2.0  span[27..35], colour=9 feature(colour)=9.0  span[36..44], leaf=4.3 feature(leaf)=4.3  span[45..53]]
Token#0:forest=19 feature(forest)=19.0  span[0..9]
Token#1:tree=0.90 feature(tree)=0.9  span[10..19]
Token#2:wood=2 feature(wood)=2.0  span[20..26]
Token#3:hammer=2 feature(hammer)=2.0  span[27..35]
Token#4:colour=9 feature(colour)=9.0  span[36..44]
Token#5:leaf=4.3 feature(leaf)=4.3  span[45..53]

name: document4
target: label4
input: TokenSequence [forest=4 feature(forest)=4.0  span[0..8], tree=5 feature(tree)=5.0  span[9..15], wood=10 feature(wood)=10.0  span[16..23], hammer=1 feature(hammer)=1.0  span[24..32], colour=6 feature(colour)=6.0  span[33..41], leaf=3 feature(leaf)=3.0  span[42..48]]
Token#0:forest=4 feature(forest)=4.0  span[0..8]
Token#1:tree=5 feature(tree)=5.0  span[9..15]
Token#2:wood=10 feature(wood)=10.0  span[16..23]
Token#3:hammer=1 feature(hammer)=1.0  span[24..32]
Token#4:colour=6 feature(colour)=6.0  span[33..41]
Token#5:leaf=3 feature(leaf)=3.0  span[42..48]

Coded LDA: 5 topics, 3 topic bits, 111 topic mask
Exception in thread "main" java.lang.NullPointerException
at cc.mallet.topics.ParallelTopicModel.addInstances(ParallelTopicModel.java:217)
at mallet.examples.TopicModel3.runLDA(MyTopicModel.java:106)
at mallet.examples.TopicModel3.main(MyTopicModel.java:57)
&#13;
&#13;
&#13;

提前致谢。

1 个答案:

答案 0 :(得分:0)

以下是mallet使用的输入格式: http://mallet.cs.umass.edu/import.php

你的数据在某种程度上是Svmlight格式,如下所示:“目标功能:价值特征:价值......”

但遗憾的是你不能将这种格式用于主题建模,LDA !!它使用featureSequence,而不是featureVector。所以你可以做的是将你的输入转换为单词包,例如,如果你有 Document2 label2 forest = 3 tree = 2 ... 将其转换为:Document2 label2森林森林树木树...