Question

我想用keras建立一个RNN模型来对句子进行分类。

我尝试了以下代码：

docs = []
with open('all_dga.txt', 'r') as f:
    for line in f.readlines():
        dga_domain, _ = line.split(' ')
        docs.append(dga_domain)

t = Tokenizer()
t.fit_on_texts(docs)
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

但得到了一个MemoryError。似乎我无法将所有数据加载到内存中。这是输出：

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    encoded_docs = t.texts_to_matrix(docs, mode='count')
  File "/home/yurzho/anaconda3/envs/deepdga/lib/python3.6/site-packages/keras/preprocessing/text.py", line 273, in texts_to_matrix
    return self.sequences_to_matrix(sequences, mode=mode)
  File "/home/yurzho/anaconda3/envs/deepdga/lib/python3.6/site-packages/keras/preprocessing/text.py", line 303, in sequences_to_matrix
    x = np.zeros((len(sequences), num_words))
MemoryError

如果有人熟悉keras，请告诉我如何预处理数据集。

提前致谢！

Answer 1

由于错误发生在t.fit_on_texts(docs)上，因此您在t.texts_to_matrix(docs, mode='count')创建词汇表时似乎没有问题。

所以你可以批量转换文件

from keras.preprocessing.text import Tokenizer

t = Tokenizer()

with open('/Users/liling.tan/test.txt') as fin:
    for line in fin:      
        t.fit_on_texts(line.split()) # Fitting the tokenizer line-by-line.

M = []

with open('/Users/liling.tan/test.txt') as fin:
    for line in fin:
        # Converting the lines into matrix, line-by-line.
        m = t.texts_to_matrix([line], mode='count')[0]
        M.append(m)

但如果你的计算机无法处理内存中的数据量，你会看到稍后会遇到MemoryError。

Answer 2

我意识到这是一个比较老的问题，但是我自己才遇到这个问题。我在上面结合了alvas答案，然后使用了keras template <typename Type> void store(Type element) { // A lot of lengthy storage preparation code // ... // // Final Storage globalStorage.push_back(A(element)); } void store(const std::string& s) { //convert s to QString or char* or a number, as you want auto converted_element = ...; store(converted_element); }方法。

使用数据生成器和alvas提到的批处理方法解决了内存使用问题。

在keras.preprocessing.text中使用Tokenizer时内存不足

2 个答案: