Question

我有一个具有某些文本结构的.srt文件。示例：

1
00:00:01,514 --> 00:00:04,185
I'm investigating
Saturday night's shootings.

2
00:00:04,219 --> 00:00:05,754
What's to investigate?
Innocent people

我希望得到诸如“我”，“正在调查”，“星期六”等类似的单词。

我已经创建了模式

@"[a-zA-Z']"

其中分开我的文字几乎是正确的。但是.srt文件还包含诸如此类的无用标记结构

<i>

我要删除的

我该如何构建将单词分开的模式并删除“ <”和“>”（包括花括号）之间的所有文本的模式？

Answer 1

很难用一种方式在regexp中做到这一点（至少对我来说是如此），但是您可以分两步来完成。

首先，您从字符串中删除html字符，然后提取其后的单词。

在下面看看。

var text = "00:00:01,514 --> 00:00:04,185 I'm investigating Saturday night's shootings.<i>"

// remove all html char
var noHtml = Regex.Replace(text, @"(<[^>]*>).*", "");

// and now you could get only the words by using @"[a-zA-Z']" on noHtml. You should get "I'm investigating Saturday night's shootings."

Answer 2

您可以否定环顾四周，断言没有<之后没有>的序列，也没有<之后没有not {{的序列前1}}个。

输出：

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        string input = @"
<garbage>
Hello world, <rubbish>it's a wonderful day.



<trash>
";
        foreach (Match match in Regex.Matches(input, @"(?<!<[^>]*)[a-zA-Z']+(?![^<]*>)"))
        {
            Console.WriteLine(match.Value);
        }
    }
}

.NET Fiddle

如何使用正则表达式按单词分隔文本？

2 个答案: