如何将换行符分隔的半结构化数据转换为结构化数据

时间:2015-07-02 16:35:20

标签: python bash text

我有一个data set,其中每条记录都用一个空行分隔:

P http://codeproject.com/kb/silverlight/convertsilverlightcontrol.aspx
T 2008-08-01 00:00:00
Q how to create property binding in a visual webgui silverlight control
Q videoplayer silverlight controls videoplayer videoplayer silverlight controls version 1 0 0 0 culture neutral publickeytoken null
Q videoplayer controls videoplayer videoplayer controls

P http://wallstreetexaminer.com/?p=2987
T 2008-08-01 00:00:01

P http://news.bbc.co.uk/go/rss/-/1/hi/scotland/highlands_and_islands/7535558.stm
T 2008-08-01 00:00:01
Q our continuing strategic priority is to provide a safe and efficient group of airports while pursuing development opportunities which improve the air transport network serving the region
Q our results for the year demonstrate that we have delivered against these targets and ensured that our airports have continued to play a central role in the economic and social life of the highlands and islands and tayside

P http://news.bbc.co.uk/go/rss/-/1/hi/scotland/south_of_scotland/7535392.stm
T 2008-08-01 00:00:01
Q safeguard our fishing communities birthright for future generations
Q every time i visit a fishing community in scotland i am asked to take steps to protect fishing rights for future generations

P http://news.bbc.co.uk/go/rss/-/1/hi/scotland/north_east/7535090.stm

每一行都被视为一个单独的"字段"。某些字段对同一记录多次出现。这是数据" schema"。你可以看到短语或" Q"可以多次出现在同一记录中。

该行的第一个字母编码:

  • P:文档的网址
  • T:帖子的时间(时间戳)
  • Q:从文档文本中提取的短语
  • L:文档中的超链接(链接指向网络上的其他文档)

如何将此数据集转换为可以更轻松地运行聚合,分组,计数,区分,统计的表单?

理想情况下,将此数据集转换为csv或json格式是最有帮助的,这样我就可以使用已构建的工具(如bash / python / R / mongodb)来进行文本挖掘/处理/自然语言处理。

1 个答案:

答案 0 :(得分:0)

以下代码获取给定格式的文件并生成csv。同一短语的多个外观与管道连接。

#!/usr/bin/perl -wl


my @t = qw(P T Q L);
my %h = ();

while (<>) {


    if (/^$/ .. /^$/) {

        my $r = "";

        foreach my $k (@t) {

            $r .= ',' if length($r) > 0;
            $r .= join('|', @{$h{$k}}) if exists($h{$k});           
        }

        print $r;
        %h = ();

    } else {

        if (/^([PTQL])\s+(.*)$/) {

            $h{$1} = () unless exists($h{$1});  
            push @{$h{$1}}, $2;
        }
    }
}

显然,不是Python解决方案和有点天真的方法,因为在字段中没有使用过的分隔符(逗号和管道)。