Question

我的大学项目有点问题。

我必须使用遗传算法实现文档分类。

我已经看过 this 示例，并且（比方说）了解了遗传算法的原理，但我不确定它们如何在文档分类中实现。无法弄清楚健身功能。

这是我到目前为止所能想到的（可能是完全错误的......）

接受我有类别，每个类别由一些关键字描述将文件拆分为单词。
从数组创建第一个种群（例如，100个数组，但这取决于文件的大小），填充文件中的随机单词。
1：
为人口中的每个孩子选择最佳类别（通过计算其中的关键词）人口中每2个孩子交叉（新阵列中包含每个孩子的一半） - “交叉”
用文件中随机未使用的单词填充剩余的交叉孩子 - “进化??” 用来自文件中的随机单词（使用或不使用）替换来自新群体的随机子中的随机单词 - “变异”
将最佳结果复制到新的人口中转到1直到达到某个人口限制或找到某个类别足够的次数

我不确定这是否正确，并且很乐意提出一些建议，伙计们非常欣赏它！

Answer 1

Ivane，为了正确应用GA进行文档分类：

您必须将问题减少到可以进化的组件系统。
您无法对单个文档进行文档分类的GA培训。

所以你所描述的步骤是正确的，但我会给你一些改进：

拥有足够数量的培训数据：您需要一组已经分类且多样化的文档，以涵盖您可能遇到的文档范围。
训练您的GA以正确分类这些文档的子集，即训练数据集。
在每一代中，根据验证数据集测试最佳样本，并在验证准确度开始下降时停止培训。

所以你想要做的是：

prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;

while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
    prevValidationFitness = currentValidationFitness;

    // Randomly generate a population of GAs
    population[] = randomlyGenerateGAs();

    // Train your population on the training data set
    bestGA = Train(population);

    // Get the validation fitness fitness of the best GA 
    currentValidationFitness = Validate(bestGA);

    // Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
    selection[] = makeSelection(population);

    // Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
    population = mate(selection);
}

每当你收到一份新文件（之前没有被分类的文件）时，你现在可以用最好的GA对其进行分类：

category = bestGA.Classify(document);

所以这不是最终所有解决方案，但它应该给你一个不错的开始。 Pozdravi，基里尔

Answer 2

您可能会发现Learning Classifier Systems有用/有趣。 LCS是一种用于分类问题的进化算法。 Eiben＆amp; amp;中有一章介绍它们。史密斯的Introduction to Evolutionary Computing。

文档分类，使用遗传算法

2 个答案: