Question

我有一个.tfrecord数据集的文本文档（电子邮件），相应的标签为“0”或“1”（垃圾邮件/非垃圾邮件）。所有这些数据集都已经是.tfrecord文件的形式。我正试图将电子邮件变成一个词袋表示。我有所有辅助方法来做，但我仍然不熟悉tfrecords。这是我到目前为止读取tf_record文件的原因：

def read_from_tfrecord(filenames):

    tfrecord_file_queue = tf.train.string_input_producer([filenames], name='queue')
    reader = tf.TFRecordReader()

    _, tfrecord_serialized = reader.read(tfrecord_file_queue)

    tfrecord_features = tf.parse_single_example(tfrecord_serialized,
                        features={
                            'label': tf.FixedLenFeature([], tf.int64),
                            'text': tf.FixedLenFeature([], tf.string),
                        }, name='features')

    text = tfrecord_features['text']
    label = tfrecord_features['label']

    return label, text

如果我想使用辅助方法修改'文本'，我该怎么办？

Answer 1

tf.parse_single_example将返回一个字典映射键到张量，这意味着text是一个张量。因此，您可以使用张量操作将其转换为一袋文字。

例如：

text = tf.unique(tf.string_split([text]).values).y

这将返回电子邮件中的所有唯一标记（按空格分隔）。您可能需要添加更多操作来处理标点符号和其他情况。

使用TFRecords文件预处理文本数据

1 个答案: