Question

我想在ArrayList内添加一个“干净的文字”，没有介词和某些类型的文字。

我在Ph内有所有禁止的单词，其中string用“word1，word2等...”分隔，textEnArray是一个普通文件一本书的一段。

我正在尝试检查禁止字的值是否与textEnArray的值不同。如果它不对应，我在名为ArrayList的{{1}}内添加值。

我遇到了麻烦，因为如果两个值相同并且它没有过滤任何内容并且在totEnArray内添加了所有文本，则foreach不能很好地比较。

ArrayList

Answer 1

我特别没有给你一个完整的答案，只是想告诉你你的代码是什么样的。试试这个：

public static List<string> topFive()
{
    string totElText = "this is, or is not, the source text and should, mostly, be ok";
    string PH = "the,is,not";
    char[] delimiterCharsText = { ' ', ',', '.', ':', '\t' };
    string[] arrayPH = PH.Split(',');
    string[] textEnArray = totElText.Split(delimiterCharsText, StringSplitOptions.RemoveEmptyEntries);

    return new List<string>(textEnArray.Where(text => !arrayPH.Contains(text)));
}

在这种情况下，它给出：

this 
or 
source 
text 
and 
should 
mostly 
be 
ok

Answer 2

正如@Enigmativity在评论中指出的那样，你应该省略第一个foreach并在整个数组中搜索单词。像这样：

public static ArrayList topFive(string nomFitxer){
    ArrayList totEnArray = new ArrayList();

    string totElText = File.ReadAllText(nomFitxer); 
    string PH = File.ReadAllText(GetValues.obtenirRutaFitxerBlackList());
    char[] delimiterCharsText = { ' ',',', '.', ':', '\t' };
    string[] arrayPH = PH to.Split(',');
    string[] textEnArray = totElText.Split(delimiterCharsText);

    foreach (string text in textEnArray){
       if (!(arrayPH.Contains(text))){
            totEnArray.Add(text);
       }   
    }
}

您还可以在if语句中添加&& !String.IsNullOrEmpty(text)，以便空字符串不会添加到结果数组中。

你总是拥有结果数组中所有文本的原因是，因为你在外部foreach循环的第一次迭代中过滤了某个单词，而不是在第二次，第三次，...之后仍然加入了被禁止的词语。

Answer 3

据我所知，你想要

加载黑名单（或停用词）集合paraulaProhibida
从nomFitxer档案

您可以实现以下内容：

   string blackListFileName = GetValues.obtenirRutaFitxerBlackList();

   // Hash set is more efficien O(1) than obsolete ArrayList O(N)
   HashSet<String> paraulaProhibida = new HashSet<string>(File
     .ReadLines(blackListFileName)
     .SelectMany(line => new char[] { ',', ';' },  StringSplitOptions.None))
     ,StringComparer.OrdinalIgnoreCase);

主要困难是提取一个词。在自然语言（英语，西班牙语等）中，一个词可能是一个非常复杂的概念：

   I cannot          // 2 words (shall we split "cannot" into "can" and "not"?)
   I may not         // 3 words 
   Forget-me-not     // 1 word
   Do not forget me  // 4 words
   It's an IT; it is // "It" and "it" are the same, IT is a different (acronym)
   per cent          // do we have 1 word? 2 words?
   George W. Bush    // is "W" a word?

这就是为什么提取单词我建议使用正则表达式;一个简单的尝试：

 "[\p{L}'\-]+"

枚举不在paraulaProhibida中的所有单词并将它们实现为数组：

   string pattern = @"[\p{L}'\-]+";

   string[] textEnArray = File
     .ReadLines(nomFitxer)
     .SelectMany(line => Regex.Matches(line, pattern)
       .OfType<Match>()
       .Select(match => match.Value))
     .Where(word => !paraulaProhibida.Contains(word))
     .ToArray();

Answer 4

如果你想检查textEnArray中的每个短语是否包含一个禁止的单词并将其删除，你可以用这样的东西代替你的循环：

totEnArray = new ArrayList(textEnArray.Where(x => !arrayPH.Any(y => x.Contains(y))).ToList());

这可以在不改变代码的情况下解决您的问题，但您的代码可以改进...例如，您可以使用数组或List而不是ArrayList ...

比较两个字符串是否相等

4 个答案: