用文字和标点符号分割文字

时间:2013-11-04 12:54:27

标签: php regex

我有这样的文字:

  

男士夹克是绿色的。他 - 现代历史上最大的明星 - 骑自行车的速度非常快(每小时230公里)。这怎么可能?!他用的是什么样的自行车?他的自行车的半自动装备相当昂贵,显着有助于达到这个速度。一些(或许可能很多)声称他是世界上最快的! “我看到他骑自行车!”约翰迪尔先生说。 “他设定的速度是每小时133.78公里,”听起来令人难以置信;听起来很有欺骗性。

我希望得到以下结果数组:

words[1] = "A"
words[2] = "man's"
words[3] = "jacket"
...
words[n+1] = "color"
words[n+2] = "."
words[n+3] = "He"
words[n+4] = "-"
words[n+5] = "the"
...

此数组应分别包含所有单词和标点符号。可以使用regexp执行吗?任何人都可以帮忙撰写吗? 谢谢!

编辑:根据要求展示我的作品。 我正在使用以下函数处理文本,但我想在正则表达式中执行相同的操作:

$text = explode(' ', $this->rawText);
$marks = Array('.', ',', ' ?', '!', ':', ';', '-', '--', '...');
for ($i = 0, $j = 0; $i < sizeof($text); $i++, $j++) {
    $skip = false;
    //check if the word contains punctuation mark
    foreach ($marks as $value) {
        $markPosition = strpos($text[$i], $value);
        //if contains separate punctation mark from the word
        if ($markPosition !== FALSE) {
            //check position of punctation mark - if it's 0 then probably it's punctuation mark by itself like for example dash
            if ($markPosition === 0) {
                //add separate mark to array
                $words[$j] = new Word($j, $text[$i], 2, $this->phpMorphy);
            } else {
                $words[$j] = new Word($j, substr($text[$i], 0, strlen($text[$i]) - 1), 0, $this->phpMorphy);
                //add separate mark to array
                $punctMark = substr($text[$i], -1);
                $j += 1;
                $words[$j] = new Word($j, $punctMark, 1, $this->phpMorphy);
            }
            $skip = true;
            break;
        }
    }
    if (!$skip) {
        $words[$j] = new Word($j, $text[$i], 0, $this->phpMorphy);
    }
}

2 个答案:

答案 0 :(得分:1)

以下内容将拆分您的具体文字。

$words = preg_split('/(?<=\s)|(?<=\w)(?=[.,:;!?()-])|(?<=[.,!()?\x{201C}])(?=[^ ])/u', $text);

请参阅working demo

答案 1 :(得分:0)

尝试使用 preg_split 。将标点符号(您选择)放在方括号 [ ]

<?php
$str="A man’s jacket is of green color. He – the biggest star in modern history – rides bikes very fast (230 km per hour). How is it possible?! What kind of bike is he using? The semi-automatic gear of his bike, which is quite expensive, significantly helps to reach that speed. Some (or maybe many) claim that he is the fastest in the world! “I saw him ride the bike!” Mr. John Deer speaks. “The speed he sets is 133.78 kilometers per hour,” which sounds incredible; sounds deceiving.";

$keywords=preg_split("/[-,. ]/", $str);

print_r($keywords);

<强>输出:

  

阵列(       [0] =&gt;一个       [1] =&gt;男人的       [2] =&gt;夹克       [3] =&gt;是       [4] =&gt;的       [5] =&gt;绿色       [6] =&gt;颜色       [7] =&gt;       [8] =&gt;他       [9] =&gt; -       [10] =&gt;该       [11] =&gt;最大       [12] =&gt;星       [13] =&gt;在       [14] =&gt;现代       [15] =&gt;历史       [16] =&gt; -

消息被截断以防止滥用资源...... Shankar;)