Question

我有一个课程项目，我需要完成。我使用的是Weka 3.8，我需要对文本进行分类。结果需要尽可能准确。我们收到了火车和测试.arff文件。我们当然需要用火车文件训练它，然后让它对测试文件进行分类。教授上传了100％准确的测试文件分类。我们需要上传自己的结果，而不是系统比较这两个文件。目前我一直在使用由SMO和StringToWordVector组成的FilteredClassifier和Snowball stremmer，但由于某些原因我不能获得比65.9％更好的准确度（这不是分割精度，而是我获得的精度）当系统将我的结果与100％准确的结果进行比较时）。我无法弄清楚原因。

train.arff文件：

@relation train

@attribute index numeric
@attribute ingredients string
@attribute cuisine {greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,thai,vietnamese,cajun_creole,brazilian,french,japanese,irish,korean,moroccan,russian}

@data
0,'romaine lettuce;black olives;grape tomatoes;garlic;pepper;purple onion;seasoning;garbanzo beans;feta cheese crumbles',greek
1,'plain flour;ground pepper;salt;tomatoes;ground black pepper;thyme;eggs;green tomatoes;yellow corn meal;milk;vegetable oil',southern_us
2,'eggs;pepper;salt;mayonaise;cooking oil;green chilies;grilled chicken breasts;garlic powder;yellow onion;soy sauce;butter;chicken livers',filipino
3,'water;vegetable oil;wheat;salt',indian

... 还有4995多行像这样。

test.arff与此类似：

@relation test

@attribute index numeric
@attribute ingredients string
@attribute cuisine {greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,thai,vietnamese,cajun_creole,brazilian,french,

japanese,irish,korean,moroccan,russian}

@data
0,'white vinegar;sesame seeds;english cucumber;sugar;extract;Korean chile flakes;shallots;garlic cloves;pepper;salt',?
1,'eggplant;fresh parsley;white vinegar;salt;extra-virgin olive oil;onions;tomatoes;feta cheese crumbles',?

......还有4337多行，就像这些一样。这是我的weka配置：

他告诉我们，在某些情况下，在.arff文件中@data中的一些成分与＆＃39;，＆＃39;分开。偶然发生并且经常发生的话，那些可能没有多大帮助。我不知道这是否重要。有什么方法可以提高分类准确度吗？我甚至使用正确的分类器来完成工作吗？提前谢谢！

使用哪种分类器？

0 个答案: