stanford-nlp天真贝叶斯分类器训练

时间:2017-10-09 12:10:24

标签: stanford-nlp naivebayes

作为理解stanford nlp api进行分类的一部分,我正在一个非常简单的训练集上训练天真的贝叶斯分类器(3个标签=> [' happy',' sad&#39 ;,'中性'])。该训练数据集是

happy   happy
happy   glad
sad gloomy
neutral fine

这是训练分类器(错误之前)

的输出的一部分
numDatumsPerLabel: {happy=2.0, sad=1.0, neutral=1.0}
numLabels: 3 [happy, sad, neutral]
numFeatures (Phi(X) types): 4 [1-SW-happy, 1-SW-glad, 1-SW-gloomy, 1-SW-fine]

我得到一个数组索引越界错误。我附加了堆栈跟踪。我无法找到问题。

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeightsJL(NaiveBayesClassifierFactory.java:171)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeights(NaiveBayesClassifierFactory.java:146)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:84)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:352)
    at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1458)
    at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2091)
    at edu.stanford.nlp.classify.demo.ClassifierDemo.main(ClassifierDemo.java:35)

作为获取权重的一部分

 private NBWeights trainWeightsJL(int[][] data, int[] labels, int numFeatures, int numClasses) {
    int[] numValues = numberValues(data, numFeatures);
    double[] priors = new double[numClasses];
    double[][][] weights = new double[numClasses][numFeatures][];
    //init weights array
    for (int cl = 0; cl < numClasses; cl++) {
      for (int fno = 0; fno < numFeatures; fno++) {
        weights[cl][fno] = new double[numValues[fno]];
//        weights[cl][fno] = new double[numFeatures];
      }
    }
    for (int i = 0; i < data.length; i++) {
      priors[labels[i]]++;
      for (int fno = 0; fno < numFeatures; fno++) {
        weights[labels[i]][fno][data[i][fno]]++;
      }
    }
    for (int cl = 0; cl < numClasses; cl++) {
      for (int fno = 0; fno < numFeatures; fno++) {
        for (int val = 0; val < numValues[fno]; val++) {
          weights[cl][fno][val] = Math.log((weights[cl][fno][val] + alphaFeature) / (priors[cl] + alphaFeature * numValues[fno]));
        }
      }
      priors[cl] = Math.log((priors[cl] + alphaClass) / (data.length + alphaClass * numClasses));
    }
    return new NBWeights(priors, weights);
  }

我无法理解

int[] numValues = numberValues(data, numFeatures);

装置。错误来自

weights[labels[i]][fno][data[i][fno]]++;

我原本认为权重是一个二维数组来跟踪不同类(标签)的特征(fno)出现。不确定为什么需要第三个维度。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

我对这些属性没有任何问题:

#
# Features
#
useClassFeature=true
1.useNGrams=true
1.usePrefixSuffixNGrams=true
1.maxNGramLeng=4
1.minNGramLeng=1
1.binnedLengths=10,20,30
#
# Printing
#
# printClassifier=HighWeight
printClassifierParam=200
#
# Mapping
#
goldAnswerColumn=0
displayedColumn=1
#
# Optimization
#
intern=true
sigma=3
useQN=true
QNsize=15
tolerance=1e-4
useNB=true
useClass=true
#
# Training input
#
trainFile=simple-classifier-training-set.txt
serializeTo=model.txt

运行此命令:

java -Xmx8g edu.stanford.nlp.classify.ColumnDataClassifier -prop example.prop