正则表达式找到单独的单词?

时间:2011-02-05 11:02:00

标签: c# .net regex

这是您的RegEx向导的快捷方式。我需要一个能找到单词组的正则表达式。任何一组词。例如,我希望它能找到任何句子中的前两个单词。

示例“你好,你好吗?” - 回归将是“你好”

示例“你好吗?” - 返回将是“如何”

3 个答案:

答案 0 :(得分:4)

试试这个:

^\w+\s+\w+

说明:一个或多个单词字符,空格和一个或多个单词字符在一起。

答案 1 :(得分:2)

正则表达式可以用于解析语言。正则表达式是一种更自然的工具。收集完这些单词后,使用字典查看它们是否真的是特定语言的单词。

前提是定义一个正则表达式,它将分割出%99.9个可能的单词,单词是一个关键定义。

我认为C#将使用基于5.8 Perl的PCRE 这是我的ascii定义如何拆分单词(扩展):

regex = '[\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* )

和unicode(更多必须添加/减去特定编码的套件):

regex = '[\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* )'

要查找单词的 ALL ,请将正则表达式字符串转换为正则表达式(我不知道c#):

@matches =~ /$regex/xg

其中/ xg是扩展和全局修饰符。请注意,正则表达式字符串中只有捕获组1,因此不会捕获介入文本。

只找到 FIRST TWO

@matches =~ /(?:$regex)(?:$regex)/x

以下是Perl示例。无论如何,玩弄它。干杯!

use strict;
use warnings;

binmode (STDOUT,':utf8');

# Unicode
my $regex = qr/ [\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* ) /x;

# Ascii
# my $regex = qr/ [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) /x;


my $text = q(
  I confirm that sufficient information and detail have been
  reported in this technical report, that it's "scientifically" sound,
  and that appropriate conclusion's have been included
);
print "\n**\n$text\n"; 

my @matches = $text =~ /$regex/g;
print "\nTotal ".scalar(@matches)." words\n",'-'x20,"\n";
for (@matches) {
    print "$_\n";
}

# =======================================

my $junk = q(
Hi, there, A écafé and Horse d'oeuvre 
hasn't? 'n? '? a-b? -'a-? 
);
print "\n\n**\n$junk\n"; 

# First 2 words
@matches = $junk =~ /(?:$regex)(?:$regex)/;
print "\nFirst 2 words\n",'-'x20,"\n";
for (@matches) {
    print "$_\n";
}

# All words
@matches = $junk =~ /$regex/g;
print "\nTotal ".scalar(@matches)." words\n",'-'x20,"\n";
for (@matches) {
    print "$_\n";
}

输出:
**

I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included


Total 25 words
--------------------
I
confirm
that
sufficient
information
and
detail
have
been
reported
in
this
technical
report
that
it's
scientifically
sound
and
that
appropriate
conclusion's
have
been
included


**

Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?

First 2 words
--------------------
Hi
there

Total 11 words
--------------------
Hi
there
A
écafé
and
Horse
d'oeuvre
hasn't
n
a-b
a-

答案 2 :(得分:0)

@ Rubens Farias

根据我的评论,这是我使用的代码:

public int startAt = 0;

private void btnGrabWordPairs_Click(object sender, EventArgs e)
    {
        Regex regex = new Regex(@"\b\w+\s+\w+\b"); //Start at word boundary, find one or more word chars, one or more whitespaces, one or more chars, end at word boundary

        if (startAt <= txtTest.Text.Length)
        {
            string match = regex.Match(txtArticle.Text, startAt).ToString();
            MessageBox.Show(match);
            startAt += match.Length; //update the starting position to the end of the last match
        }
     {

每次单击按钮时,它都会非常好地抓取单词对,继续执行txtTest TextBox中的文本,然后按顺序查找对,直到到达字符串末尾。

@ sln :感谢非常详细的回复!