对非xml文件使用XML解析器?

时间:2014-08-14 16:10:12

标签: xml regex perl

我有一堆LTE CDR,当解码的外观和感觉就像XML一样,但不是(我不确定确切的差异,但它是分层的,类似于XML)。我复制了以下其中一行。每个文件中有50或60个条目。

我的目标是搜索匹配的条目和IP地址(下面的HEX)和时间范围。然后将IMSI与它相关联。这些字段在下面。

字段我正在搜索:

...
<servedIMSI>13 91 03 00 00 00 10 F8</servedIMSI>
...
<servedPDPAddress>
        <iPAddress>
            <iPBinaryAddress>
                <iPBinV4Address>0A 37 00 11</iPBinV4Address>
            </iPBinaryAddress>
        </iPAddress>
    </servedPDPAddress>
...
<timeOfFirstUsage>14 02 04 04 09 40 2D 06 00</timeOfFirstUsage>
<timeOfLastUsage>14 02 04 04 12 44 2D 06 00</timeOfLastUsage>
...

我尝试使用XML工具,但由于这不是XML,因此无法使用。

我想知道是否有更好的方法来搜索和检索我想要的数据。我可以使用正则表达式来查找数据,但XML方法似乎是一种更好的方法(即使这不是XML)。我对任何想法都持开放态度!

CDR的片段:

<GPRSRecord>
    <egsnPDPRecord>
        <recordType>70</recordType>
        <servedIMSI>13 91 03 00 00 00 10 F8</servedIMSI>
        <ggsnAddress>
            <iPBinaryAddress>
                <iPBinV4Address>AB CD 72 62</iPBinV4Address>
            </iPBinaryAddress>
        </ggsnAddress>
        <chargingID>126400647</chargingID>
        <sgsnAddress>
                <iPBinaryAddress>
                    <iPBinV4Address>AB CD 72 62</iPBinV4Address>
                </iPBinaryAddress>

        </sgsnAddress>
        <accessPointNameNI><bs/>Internet<si/>syringawireless<etx/>com</accessPointNameNI>
        <pdpType>01 21</pdpType>
        <servedPDPAddress>
            <iPAddress>
                <iPBinaryAddress>
                    <iPBinV4Address>0A 37 00 11</iPBinV4Address>
                </iPBinaryAddress>
            </iPAddress>
        </servedPDPAddress>
        <dynamicAddressFlag><true/></dynamicAddressFlag>
        <listOfTrafficVolumes>
            <ChangeOfCharCondition>
                <dataVolumeGPRSUplink>192323</dataVolumeGPRSUplink>
                <dataVolumeGPRSDownlink>320043</dataVolumeGPRSDownlink>
                <changeCondition><recordClosure/></changeCondition>
                <changeTime>14 02 04 04 12 46 2D 06 00</changeTime>
                <userLocationInformation>01 13 01 39 01 86 BD 01</userLocationInformation>
            </ChangeOfCharCondition>
        </listOfTrafficVolumes>
        <recordOpeningTime>14 02 04 04 09 40 2D 06 00</recordOpeningTime>
        <duration>186</duration>
        <causeForRecClosing>16</causeForRecClosing>
        <recordSequenceNumber>26784</recordSequenceNumber>
        <nodeID>1</nodeID>
        <localSequenceNumber>8858562</localSequenceNumber>
        <apnSelectionMode><mSorNetworkProvidedSubscriptionVerified/></apnSelectionMode>
        <servedMSISDN>91 02 98 99 00 81</servedMSISDN>
        <chargingCharacteristics>01 00</chargingCharacteristics>
        <chChSelectionMode><sGSNSupplied/></chChSelectionMode>
        <sgsnPLMNIdentifier>13 01 39</sgsnPLMNIdentifier>
        <servedIMEISV>53 97 04 40 81 57 80 00</servedIMEISV>
        <rATType>6</rATType>
        <userLocationInformation>01 13 01 39 01 86 BD 01</userLocationInformation>
        <listOfServiceData>
            <ChangeOfServiceCondition>
                <ratingGroup>1</ratingGroup>
                <localSequenceNumber>1</localSequenceNumber>
                <timeOfFirstUsage>14 02 04 04 09 40 2D 06 00</timeOfFirstUsage>
                <timeOfLastUsage>14 02 04 04 12 44 2D 06 00</timeOfLastUsage>
                <serviceConditionChange>
                    00000000000000000000000010000000
                </serviceConditionChange>
                <sgsn-Address>
                    <iPBinaryAddress>
                        <iPBinV4Address>AB CD 72 62</iPBinV4Address>
                    </iPBinaryAddress>
                </sgsn-Address>
                <sGSNPLMNIdentifier>13 01 39</sGSNPLMNIdentifier>
                <datavolumeFBCUplink>192323</datavolumeFBCUplink>
                <datavolumeFBCDownlink>320043</datavolumeFBCDownlink>
                <timeOfReport>14 02 04 04 12 46 2D 06 00</timeOfReport>
                <rATType>6</rATType>
                <userLocationInformation>01 13 01 39 01 86 BD 01</userLocationInformation>
            </ChangeOfServiceCondition>
        </listOfServiceData>
    </egsnPDPRecord>
</GPRSRecord>    

3 个答案:

答案 0 :(得分:4)

存在XML解析器来解析格式良好的XML。如果您的XML 格式正确,它们通常会失败 - 通常是混乱的。

但您的XML似乎格式正确。所以我个人最喜欢使用XML::Twig作为个人喜爱。

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

sub extractIMSI {
    my ( $twig, $servedIMSI ) = @_;
    print $servedIMSI -> text(),"\n";
    $twig -> purge(); #why I like XML::Twig - it lets you clear memory on the fly
}

my $parser = XML::Twig -> new ( twig_handlers => { 'servedIMSI' => \&extractIMSI } );

$parser -> parsefile ( 'test.xml' );

无论如何,如果'test.xml'包含您的样本数据,则无效。

答案 1 :(得分:3)

这个简短的Perl程序处理一个名为GPRSRecord.xml的文件,其中包含您在问题中显示的数据,包含在<root>...</root>元素中。它从它找到的每个egsnPDPRecord元素中提取您说您感兴趣的字段。显然,在这种情况下,只有一个。

use strict;
use warnings;

use XML::LibXML;

my $xml = XML::LibXML->load_xml(location => 'GPRSRecord.xml');

for my $pdp_rec ($xml->findnodes('/root/GPRSRecord/egsnPDPRecord')) {

  my ($imsi_address) = $pdp_rec->findnodes('servedIMSI');
  printf "%s: %s\n", $imsi_address->nodeName, $imsi_address->textContent;

  my ($ip_v4_address) = $pdp_rec->findnodes('servedPDPAddress/iPAddress/iPBinaryAddress/iPBinV4Address');
  printf "%s: %s\n", $ip_v4_address->nodeName, $ip_v4_address->textContent;

  my ($service_condition) = $pdp_rec->findnodes('listOfServiceData/ChangeOfServiceCondition');
  my ($first_usage)       = $service_condition->findnodes('timeOfFirstUsage');
  my ($last_usage)        = $service_condition->findnodes('timeOfLastUsage');
  printf "%s: %s\n", $first_usage->nodeName, $first_usage->textContent;
  printf "%s: %s\n", $last_usage->nodeName, $last_usage->textContent;

}

<强>输出

servedIMSI: 13 91 03 00 00 00 10 F8
iPBinV4Address: 0A 37 00 11
timeOfFirstUsage: 14 02 04 04 09 40 2D 06 00
timeOfLastUsage: 14 02 04 04 12 44 2D 06 00

答案 2 :(得分:1)

Perl中的有状态循环可以很容易地工作,但需要注意的是,XML解析器为处理多行条目等所做的大部分工作都需要在这里复制,以适应任何与之不匹配的文件。示例文本。像

这样的东西
my $infile;
open($infile, "MyCDRFile.nxm");

my %searches = {
  "rec_start" => "egsnPDPRecord",
  "imsi" => "servedIMSI",
  "ip" => "iPBinV4Address",
  "firsttime" => "timeOfFirstUsage",
  "lasttime" => "timeOfLastUsage"
};
my %finds;
my ($imsi,) = ("");

while (my $line = <$infile>) {
  chomp($line);

  if (index($line, $searches{"rec_start"}) > -1) {
    if ($imsi ne "") print "[$imsi, " + join(',', @finds{"ip", "firsttime", "lasttime"}) + "]\n";
    $imsi = "";
  }
  if (index($line, $searches{"imsi"}) > -1) {
    $imsi = (split($line, $searches{"imsi"}))[1];
    $imsi =~ s![<>/]!!g;
  }
  foreach my $search ("ip", "firsttime", "lasttime") {
    if ($imsi ne "" and index($line, $searches{$search}) > -1) {
      $finds{$search} = (split($line, $searches{$search}))[1];
      $finds{$search} =~ s![<>/]!!g;
    }
  }
}

close($infile);

打印到单独的文件,从STDIN读取等都可以相当容易地添加到此文件中。