如何使用awk或grep从头文件中提取电子邮件字段

时间:2015-07-22 16:29:09

标签: awk grep mbox

关于:邮箱(mbox格式)电子邮件

多封邮件文件: Inbox.mbox

From - Thu Mar 26 16:16:21 2015
From: Mail Delivery System <Mailer-Daemon@200.netwizz.com>
To: edge@notterribe.org
Subject: Mail delivery failed: returning message to sender
Message-Id: <E1Yb3yX-0004CB-QH@200.netwizz.com>
Date: Thu, 26 Mar 2015 02:21:17 -0700
Date: Thu, 26 Mar 2015 02:20:44 -0700
From: edge <edge@notterribe.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.5.0
MIME-Version: 1.0
To: leasing@theedgehenderson.com
CC: etpmgr@movein.net, t.simmonds@movein.ne
Subject: Fwd: Today's Breach Of Our Security.
From - Fri Mar 27 12:00:00 2015  

所需的模式匹配顺序;

Date: Thu, 26 Mar 2015 02:21:17 -0700  
From - Thu Mar 26 16:16:21 2015  
From: Mail Delivery System <Mailer-Daemon@200.netwizz.com>  
To: edge@notterribe.org  
Message-Id: &lt;E1Yb3yX-0004CB-QH@200.netwizz.com>  
Subject: Mail delivery failed: returning message to sender 

期望的最终结果;

Date: Thu; 26 Mar 2015 02:21:17 -0700;From - Thu Mar 26 16:16:21 2015;From: Mail Delivery System <Mailer-Daemon@200.netwizz.com>;To: edge@notterribe.org;Message-Id: &lt;E1Yb3yX-0004CB-QH@200.netwizz.com>;Subject: Mail delivery failed: returning message to sender

目标;
*&#34; Inbox.mbox&#34;中的每条邮件消息从&#34;从&#34;开始 *匹配第一次出现仅为&#34; ^日期:| ^从| ^从:| ^到:| ^消息标识:| ^主题:&#34;,打印该行。
*格式输出结果以分号分隔的csv

我已经尝试过;
grep -a -E -i "^Date: |^From |^From: |^To: |^Message-ID: |^Subject: " Inbox.mbox
awk '/^Date: / || /^From / || /^From: / || /^To: / || /^Message-ID: / || /^Subject: /' Inbox.mbox

评论:上面给了我一个好的开始,我对awk和grep最熟悉,所以我只想尝试使用它们。难以按照我希望的顺序打印出行,匹配仅以换行结束的第一次出现。二进制数据存在于某些消息中,所以我使用-a和grep。

非常感谢任何帮助 谢谢。

1 个答案:

答案 0 :(得分:0)

好的,所以你只有Thunderbird mbox。

以下是我的想法,名为mbox2csv

的文件
#!/usr/bin/gawk -f
BEGIN {
    # initialize an array and set the "i" variable to 0
    i = split("", row, ":");
}

# awk does not have a "join"
function join(array, sep) {
    sep = sep ? sep : ";";
    result = array[0];
    for (i=1; i<length(array); ++i) {
        result = result sep array[i];
    }
    return result;
}

# the keys you want to store
/^(From|Date|To|Message-ID|Subject):/ {
    row[i++] = $0;
}

# every time we match a mbox message separator
/^From /{
    # if there is data (not the first line)
    if (length(row) > 1) {
        print join(row);
        # reinitialise the array and "i"
        i = split("", row, ":");
    }
}

然后:mbox2csv INBOX > result.csv

大警告:* 这不考虑在网络标题中常见的行继续,也不考虑转义行。

修改:代码将显示在gist