从文本文件中提取电子邮件地址和名称

时间:2015-02-02 22:33:37

标签: c# email text

我会尽力解释这个问题。我有一个包含电子邮件地址和姓名的文本文件。它看起来像这样:Barb Beney "de.mariof@vienna.aa", "Beny Beney" bet@catering.at等......都在同一行。这只是一个例子,我在一个大文本文件中有数千个这样的数据。我想提取电子邮件和名字,以便我最终得到这样的东西:

Beny Beney bet@catering.at-彼此相邻,分成一行,没有引号。最后,它应该从文件中删除所有重复的地址。

我编写了用于提取电子邮件地址的代码,但它确实有效,但我不知道如何完成其​​余工作。如何提取名称将其放在一行作为地址并消除重复。我希望我能正确描述它,以便你知道我在做什么。这是我的代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
using System.IO;

namespace Email
{
class Program
{
    static void Main(string[] args)
    {
        ExtractEmails(@"C:\Users\drake\Desktop\New.txt", @"C:\Users\drake\Desktop\Email.txt");   
    }


    public static void ExtractEmails(string inFilePath, string outFilePath)
    {
        string data = File.ReadAllText(inFilePath);

        Regex emailRegex = new Regex(@"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*",
            RegexOptions.IgnoreCase);


        MatchCollection emailMatches = emailRegex.Matches(data);


        StringBuilder sb = new StringBuilder();

        foreach (Match emailMatch in emailMatches)
        {
            sb.AppendLine(emailMatch.Value);

        }

        File.WriteAllText(outFilePath, sb.ToString());
    }

} }

3 个答案:

答案 0 :(得分:0)

欢迎您可以使用此代码,它将通过创建新文件来处理文件,该文件将包含所有不重复的电子邮件:

    static void Main(string[] args)
    {
        TextWriter w = File.CreateText(@"C:\Users\drake\Desktop\NonDuplicateEmails.txt");
        ExtractEmails(@"C:\Users\drake\Desktop\New.txt", @"C:\Users\drake\Desktop\Email.txt");
        TextReader r = File.OpenText(@"C:\Users\drake\Desktop\Email.txt");
        RemovingAllDupes(r, w);
    }

    public static void RemovingAllDupes(TextReader reader, TextWriter writer)
    {
        string currentLine;
        HashSet<string> previousLines = new HashSet<string>();

        while ((currentLine = reader.ReadLine()) != null)
        {
            // Add returns true if it was actually added,
            // false if it was already there
            if (previousLines.Add(currentLine))
            {
                writer.WriteLine(currentLine);
            }
        }
        writer.Close();
    }

答案 1 :(得分:0)

对于新的所需格式,您可以执行以下操作:

private string[] parseEmails(string bigStringiIn){

string[] output;
string bigString;

bigString = bigStringiIn.Replace("\"", "");

output = bigString.Slit(",".ToCharArray());

return output;
}

它接受带有邮件地址的字符串,替换引号,然后将字符串拆分为字符串数组,格式为:name lastname email@some.com

对于重复的条目删除,嵌套的应该做的技巧,检查(可能在.Split()之后匹配字符串。

答案 2 :(得分:0)

您也可以将此代码用于大文件:

    static void Main(string[] args)
    {
        ExtractEmails(@"C:\Users\drake\Desktop\New.txt", @"C:\Users\drake\Desktop\Email.txt");
        var sr = new StreamReader(File.OpenRead(@"C:\Users\drake\Desktop\Email.txt"));
        var sw = new StreamWriter(File.OpenWrite(@"C:\Users\drake\Desktop\NonDuplicateEmails.txt"));
        RemovingAllDupes(sr, sw);
    }

    public static void RemovingAllDupes(StreamReader str, StreamWriter stw)
    {

        var lines = new HashSet<int>();
        while (!str.EndOfStream)
        {
            string line = str.ReadLine();
            int hc = line.GetHashCode();
            if (lines.Contains(hc))
                continue;

            lines.Add(hc);
            stw.WriteLine(line);
        }
        stw.Flush();
        stw.Close();
        str.Close();