Question

我有一个PHP数组：

$excerpts = array(
    'I love cheap red apples',
    'Cheap red apples are what I love',
    'Do you sell cheap red apples?',
    'I want red apples',
    'Give me my red apples',
    'OK now where are my apples?'
);

我想找到这些行中的所有n-gram来得到这样的结果：

便宜的红苹果：3
红苹果：5
apples：6

我试图破坏数组，然后解析它，但这是愚蠢的，因为可以找到新的n-gram，因为字符串的串联彼此无关。

你会怎么做？

Answer 1

我想在不知道它们的情况下找到一组单词有了你的功能，我需要在任何事情之前提供它们

试试这个：

mb_internal_encoding('UTF-8');

$joinedExcerpts = implode(".\n", $excerpts);
$sentences = preg_split('/[^\s|\pL]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

$wordsSequencesCount = array();
foreach($sentences as $sentence) {
    $words = array_map('mb_strtolower',
                       preg_split('/[^\pL+]/umi', $sentence, -1, PREG_SPLIT_NO_EMPTY));
    foreach($words as $index => $word) {
        $wordsSequence = '';
        foreach(array_slice($words, $index) as $nextWord) {
                $wordsSequence .= $wordsSequence ? (' ' . $nextWord) : $nextWord;
            if( !isset($wordsSequencesCount[$wordsSequence]) ) {
                $wordsSequencesCount[$wordsSequence] = 0;
            }
            ++$wordsSequencesCount[$wordsSequence];
        }
    }
}

$ngramsCount = array_filter($wordsSequencesCount,
                            function($count) { return $count > 1; });

我假设你只想重复一组单词。 var_dump($ngramsCount);的输出是：

array (size=11)
  'i' => int 3
  'i love' => int 2
  'love' => int 2
  'cheap' => int 3
  'cheap red' => int 3
  'cheap red apples' => int 3
  'red' => int 5
  'red apples' => int 5
  'apples' => int 6
  'are' => int 2
  'my' => int 2

可以优化代码，例如，使用更少的内存。

Answer 2

上面的

The code provided by Pedro Amaral Couto非常好。由于我将它用于法语，我修改了正则表达式如下：

$sentences = preg_split('/[^\s|\pL-\'’]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

这样，我们可以分析包含连字符和撇号的词（“est-ce que”，“j'ai”等）

Answer 3

试试这个（使用implode，因为您已经提到过这是一次尝试）：

$ngrams = array(
    'cheap red apples',
    'red apples',
    'apples',
);

$joinedExcerpts = implode("\n", $excerpts);
$nGramsCount = array_fill_keys($ngrams, 0);
var_dump($ngrams, $joinedExcerpts);
foreach($ngrams as $ngram) {
    $regex = '/(?:^|[^\pL])(' . preg_quote($ngram, '/') . ')(?:$|[^\pL])/umi';
    $nGramsCount[$ngram] = preg_match_all($regex, $joinedExcerpts);
}

Answer 4

假设您只想计算字符串的出现次数：

$cheapRedAppleCount = 0;
$redAppleCount = 0;
$appleCount = 0;
for($i = 0; $i < count($excerpts); $i++)
{
    $cheapRedAppleCount += preg_match_all('cheap red apples', $excerpts[$i]);
    $redAppleCount += preg_match_all('red apples', $excerpts[$i]);
    $appleCount += preg_match_all('apples', $excerpts[$i]);
}

preg_match_all返回给定字符串中的匹配数，因此您只需将匹配数添加到计数器上即可。

preg_match_all了解更多信息。

如果我误解了道歉。

PHP在数组中找到n-gram

4 个答案: