使用php从文本中提取单词

时间:2014-12-17 05:28:07

标签: php regex

你好朋友有点问题。我只需要提取文本“任何人”的文字。

我尝试使用strtok(),strstr()检索单词。一些正则表达式,但只能设法提取一些单词。

由于单词附带的字符和符号数量,问题很复杂。

必须提取单词的示例文本。这是一个示例文本:

Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and http://www.google.com (r) The 509th "composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters).

Sample text, for testing.

提取文本的结果应为:

Main article our required but March Gutenberg's a go or and The composite and dog as is done article agriculture cat now Hi meters

Sample text for testing

我为编写工作而编写的第一个函数

function PreText($text){
  $text = str_replace("\n", ".", $text);
  $text = str_replace("\r", ".", $text);

  $text = str_replace("'", "", $text);
  $text = str_replace("?", "", $text);
  $text = str_replace("¿", "", $text);
  $text = str_replace("(", "", $text);
  $text = str_replace(")", "", $text);
  $text = str_replace('"', "", $text);
  $text = str_replace(';', "", $text);
  $text = str_replace('!', "", $text);
  $text = str_replace('<', "", $text);
  $text = str_replace('>', "", $text);
  $text = str_replace('#', "", $text);

  $text = str_replace(",", "", $text);

  $text = str_replace(".c", "", $text);
  $text = str_replace(".C", "", $text);
  return $text;
}

分割功能:

function SplitWords($text){
  $words = explode(" ", $text);
  $ContWords = count($words);

  for ($i = 0; $i < $ContWords; $i++){
    if (ctype_alpha($words[$i])) {
      $NewText .= $words[$i].", ";
    }
  }
  return $NewText;
}

该计划:

<?
  include_once ('functions.php');

  $text = "Main article: our 46,000 ...";
  $text = PreText($text);
  $text = SplitWords($text);
  echo $text;
?>

这是代码还有很长的路要走。感谢您的帮助。

2 个答案:

答案 0 :(得分:5)

如果我理解正确,您想删除字符串中的所有非字母。我会使用preg_replace

$text = "Main article: our 46,000...";
$text = preg_replace("/[^a-zA-Z' ]/","",$text);

这应该删除所有不是字母,撇号或空格的内容。

答案 1 :(得分:0)

尝试这几乎是你的要求

<?php
$text = <<<HEREDOC
Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and
        http://www.google.com (r) The 509th composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters). Sample text, for testing.
HEREDOC;
//replace all kind of URLs and emails from text
$url_email = "((https?|ftp)\:\/\/)?"; // SCHEME
$url_email .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$url_email .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$url_email .= "(\:[0-9]{2,5})?"; // Port
$url_email .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$url_email .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$url_email .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor

$text = preg_replace("/$url_email/","",$text);
//replace anything like Us: @unknown
$text = preg_replace("/Us:.?@\\w+/","",$text);
//replace all Non-Alpha characters
$text = preg_replace("/[^a-zA-Z' ]/","",$text);
echo $text;
?>