Question

我的第一篇帖子在这里！我在使用nltk NaiveBayesClassifier时遇到了问题。我有7000个训练项目。每个培训项目都有2或3个世界的描述和代码。我想使用代码作为类的标签和描述的每个世界作为功能。一个例子：

“我叫奥巴马”，001 ...

训练集= {[feature ['My'] = True，feature ['name'] = True，feature ['is'] = True，feature [Obama] = True]，001}

不幸的是，使用这种方法，训练程序NaiveBayesClassifier.train使用最多3 GB的ram .. 我的方法有什么问题？谢谢！

def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)

...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
  t = t.split("'")
  code = t[0] #class
  desc = t[1] # description
  words = words.union(s) #update dictionary with the new words in the description
  entries.append((s,code))
  t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training

Answer 1

使用nltk.classify.apply_features返回一个像列表一样的对象，但不会将所有功能集存储在内存中。

from nltk.classify import apply_features

更多信息和示例here

您无论如何都要将文件加载到内存中，您需要使用某种形式的延迟加载方法。哪个将根据需要加载。考虑调查this

Nltk天真的贝叶斯分类器记忆问题

1 个答案: