如何在Perl中提取和解析引用的字符串?

时间:2009-12-22 10:40:59

标签: regex perl parsing

美好的一天。

我的文字文件内容如下。 tmp.txt(一个非常大的文件)

constant fixup private AlarmFileName = <A "C:\\TMP\\ALARM.LOG">  /* A Format */

constant fixup ConfigAlarms = <U1 0>         /*  U1 Format  */

constant fixup ConfigEvents = <U2 0>         /*  U2 Format  */

我的解析代码如下。 代码无法在此处理C:\\TMP\\ALARM.LOG(带引号的字符串)。 我不知道如何替换代码“s +([a-zA-Z0-9])+&gt;”处理[a-zA-Z0-9](上面的0)字符串和quated字符串(上面的“C:\ TMP \ ALARM.LOG”)。

$source_file = "tmp.txt";
$dest_xml_file = "my.xml";

#Check existance of root directory
open(SOURCE_FILE, "$source_file") || die "Fail to open file $source_file";
open(DEST_XML_FILE, ">$dest_xml_file") || die "Coult not open output file $dest_xml_file";

$x = 0;

print DEST_XML_FILE  "<!-- from tmp.txt-->\n";
while (<SOURCE_FILE>) 
{
    &ConstantParseAndPrint;

}

sub ConstantParseAndPrint
{
 if ($x == 0)
 {

     if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9])+>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
                {
                    $name1 = $1;
                    $name2 = $2;
                    $name3 = $3;
                    $name4 = $4;
                    $name5 = $5;
                    $name6 = $6;
                    $name7 = $7;
                    printf DEST_XML_FILE "\t\t$name1";
                    printf DEST_XML_FILE "\t\t$name2";
                    printf DEST_XML_FILE "\t\t$name3";
                    printf DEST_XML_FILE "\t\t$name4";
                    printf DEST_XML_FILE "\t\t$name5";
                    printf DEST_XML_FILE "\t\t$name6";
                    printf DEST_XML_FILE "\t\t$name7";
                    $x = 1;
  }
 }
}

感谢您的投入。

** HELLO ALL,

感谢您提供了许多出色的解决方案。我是一个新人,我想根据你的帖子做更多的研究。

感谢很多。**

6 个答案:

答案 0 :(得分:10)

我不打算为你写你的正则表达式,或者给你一些剪切和粘贴到你的代码中的东西。按照你的正则表达式的速度,无论如何它将在下一个特殊案例中打破。我会给你的是一个更好的方法。

将每一行拆分为作业的右侧和左侧。

my($lhs, $rhs) = split m{\s* = \s*}x, $line, 2;

现在单独使用它们要容易得多。您只需在空白处拆分信息即可从左侧提取信息以获取所有标志(常量,修正等...),最后一个字将是分配给的名称。

my @flags = split /\s+/, $lhs;
my $name  = pop @flags;

然后,如果需要,您可以通过@flags过滤行。

可以很容易地得到大概在括号内的值。使用非贪婪的正则表达式可确保它正确处理foo = <bar> /* comment <stuff> */

之类的内容
my($value) = $rhs =~ /<(.*?)>/;

正如您从这种方法中看到的那样,它避免了必须猜测文件中可能出现的特殊关键字(常量,修正,私有)。

我不知道这个文件中还有什么,你没说。

答案 1 :(得分:3)

您的代码中存在一些主要的设计缺陷。我没有解决你的问题,但我已经清理了你的代码。

最重要的是,不要使用全局变量。在相对较短的代码块中,您使用3个全局变量。这是一个无法追查的神秘错误。随着您的项目随着时间的推移变大,这将成为一个更大的问题。

使用Perl::Critic。它将帮助您改进代码。

以下是代码的带注释,已清理的版本:

# Always use strict and warnings.
# It prevents bugs.
use strict;
use warnings;

my $source_file   = "tmp.txt";
my $dest_xml_file = "my.xml";

# You aren't checking the existence of anyting here:
#Check existance of root directory 
# Is this a TODO item?

# Use 3 argument open with a lexical filehandle.
# Adding $! to your error messages makes them more useful.
open my $source_fh, '<', $source_file
    or die "Fail to open file $source_file - $!";

open( my $dest_fh, '>', $dest_xml_file 
    or die "Coult not open output file $dest_xml_file - $!";

my $x = 0;  # What the heck does this do?  Give it a meaningful name or
            # delete it.

print $dest_fh  "<!-- from tmp.txt-->\n";
while (my $line = <$source_fh>)   
{

    # Don't use global variables.
    # Explicitly pass all data your sub needs.
    # Any values that need to be applied to external 
    # data should be applied by the calling function,
    # from data that is returned.

    $x = ConstantParseAndPrint( $line, $x, $dest_fh );

}

sub ConstantParseAndPrint {
    my $line          = shift;
    my $mystery_value = shift;
    my $fh            = shift;

    if($mystery_value == 0) {

        # qr{} is a handy way to build a regex.
        # using {} instead of // to mark the boundaries helps
        # cut down on the escaping required when your pattern
        # contains the '/' character.

        # Use the x regex modifier to allow whitespace and 
        # comments in your regex.
        # This very is important when you can't avoid using a big, complex regex.

        # But really don't do it this way at all.
        # Do what Schwern says.
        my $line_match = qr{
            ^                      \s*  # Skip leading spaces
            (constant)             \s*  # look for the constant keyword
            (fixup|/\*fixup\*/|)   \s*  # look for the fixup keyword
            (private|)             \s*  # look for the prive keyword
            (\w+)                  \s+  # Get parameter name
            =                      \s+  
            <                           # get bracketed values
            ([a-zA-Z0-9]+)         \s+  # First value 
            ([a-zA-Z0-9])+              # Second value
            >                      \s*
            (/\*\s*(.*?)\s*\*/|)        # Find any trailing comment
            (\r|\n|\s)                  # Trailing whitespace
        }x;


        if( $line =~ /$line_match/ ) {

            # Any time you find yourself making variables
            # with names like $foo1, $foo2, etc, use an array.

            my @names = ( $1, $2, $3, $4, $5, $6, $7 );

            # printf is for printing formatted data.  
            # If you aren't using any format codes, use print.

            # Using an array makes it easy to print all the tokens.
            print $fh "\t\t$_" for @names;

            $mystery_value = 1;

        }
    }

    return $mystery_value;
}

关于你的解析问题,请遵循Schwern的建议。大而复杂的正则表达式是您需要简化的标志。将重大问题分解为可管理的任务。

答案 2 :(得分:2)

答案 3 :(得分:1)

#!/usr/bin/perl


$source_file = "tmp.txt";
$dest_xml_file = "my.xml";

#Check existance of root directory
open(SOURCE_FILE, "$source_file") || die "Fail to open file $source_file";
open(DEST_XML_FILE, ">$dest_xml_file") || die "Coult not open output file $dest_xml_file";

$x = 0;

print DEST_CS_FILE  "<!-- from tmp.txt-->\n";
while (<SOURCE_FILE>)   
{
    &ConstantParseAndPrint;

}

sub ConstantParseAndPrint
{
    if ($x == 0)
    {

#        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9])+>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)

                {
                    $name1 = $1;
                    $name2 = $2;
                    $name3 = $3;
                    $name4 = $4;
                    $name5 = $5;
                    $name6 = $7;
                    $name7 = $8;
                    printf DEST_XML_FILE "\t\t$name1";
                    printf DEST_XML_FILE "\t\t$name2";
                    printf DEST_XML_FILE "\t\t$name3";
                    printf DEST_XML_FILE "\t\t$name4";
                    printf DEST_XML_FILE "\t\t$name5";
                    printf DEST_XML_FILE "\t\t$name6";
                    printf DEST_XML_FILE "\t\t$name7\n";
#                    $x = 1;
        }
    }
}



使用以下解析代码:

if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/) 

我添加了单引号和双引号的处理。我使用反向引用来匹配引号。我还更新了路径的字符类。即它现在包括冒号(:),点(。)和反斜杠()以及字母数字字符。

答案 4 :(得分:1)

我故意删除了匹配捕获(如果需要,可以添加它们):

m{^\s*constant\s+fixup\s+(?:private\s+)?\w+\s*=\s*<[^>]+>(?:\s*/\*(?:\s*\w*)+\*/)?$};

答案 5 :(得分:1)

首先统一!

$yourstring =~ s,\\,/,g;  # transform '\' into '/'
$yourstring =~ s,/+,/,g;  # transform multiple '/' into one '/'