快速读取和解析数据

时间:2011-12-13 21:23:11

标签: c# parsing filesize

截至目前,我正在使用此代码打开文件并将其读入列表并将该列表解析为string[]

string CP4DataBase =
    "C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);
string[] splitCP4DataBaseLines = CP4DataBaseRTB.Text.Split('\n');
List<string> tempCP4List = new List<string>();
string[] line1CP4Components;

foreach (var line in splitCP4DataBaseLines)
                    tempCP4List.Add(line + Environment.NewLine);

string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
    concattedUnitPart = concattedUnitPart + line;
    line1CP4PartLines++;
}
line1CP4Components = new Regex("\"UNIT\",\"PARTS\"", RegexOptions.Multiline)
                    .Split(concattedUnitPart)
                    .Where(c => !string.IsNullOrEmpty(c)).ToArray();

我想知道是否有更快的方法来做到这一点。这只是我打开的文件之一,因此至少重复5次以打开并正确加载列表。

目前导入的最小文件大小为257 KB。最大的文件是1,803 KB。这些文件只会随着时间的推移而变大,因为它们被用于模拟数据库,用户将不断添加它们。

所以我的问题是,是否有更快的方法来完成上述所有代码?

编辑:

***CP4***
"UNIT","PARTS"
"BLOCK","HEADER-"
    "NAME","106536"
    "REVISION","0000"
    "DATE","11/09/03"
    "TIME","11:10:11"
    "PMABAR",""
    "COMMENT",""
    "PTPNAME","R160805"
    "CMPNAME","R160805"
"BLOCK","PRTIDDT-"
    "PMAPP",1
    "PMADC",0
    "ComponentQty",180
"BLOCK","PRTFORM-"
    "PTPSZBX",1.60
    "PTPSZBY",0.80
    "PTPMNH",0.25
    "NeedGlue",0
"BLOCK","TOLEINF-"
    "PTPTLBX",0.50
    "PTPTLBY",0.40
    "PTPTLCL",10
    "PTPTLPX",0.30
    "PTPTLPY",0.30
    "PTPTLPQ",30
"BLOCK","ELDT+"     "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
    0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
    "PTPVIPL",0
    "PTPVILCA",0
    "PTPVILB",0
    "PTPVICVT",10
    "PENVILIT",0
"BLOCK","ENVDT"
    "ELEMENT","CP43ENVDT-"
        "PENNMI",1.0
        "PENNMA",1.0
        "PENNZN",""
        "PENNZT",1.0
        "PENBLM",12
        "PENCRTS",0
        "PENSPD1",100
        "PTPCRDCT",0
        "PENVICT",1
        "PCCCRFT",1
"BLOCK","CARRING-"
    "PTPCRAPO",0
    "PTPCRPCK",0
    "PTPCRPUX",0.00
    "PTPCRPUY",0.00
    "PTPCRRCV",0
"BLOCK","PACKCLS-"
    "FDRTYPE","Emboss"
    "TAPEWIDTH","8mm"
    "FEEDPITCH",4
    "REELDIAMETER",0
    "TAPEDEPTH",0.0
    "DOADVVACUUM",0
    "CHKBEFOREFEED",0
    "TAPEARMLENGTH",0
    "PPCFDPP",0
    "PPCFDEC",4
    "PPCMNPT",30
"UNIT","PARTS"
"BLOCK","HEADER-"
    "NAME","106653"
    "REVISION","0000"
    "DATE","11/09/03"
    "TIME","11:10:42"
    "PMABAR",""
    "COMMENT",""
    "PTPNAME","0603R"
    "CMPNAME","0603R"
"BLOCK","PRTIDDT-"
    "PMAPP",1
    "PMADC",0
    "ComponentQty",18
"BLOCK","PRTFORM-"
    "PTPSZBX",1.60
    "PTPSZBY",0.80
    "PTPMNH",0.23
    "NeedGlue",0
"BLOCK","TOLEINF-"
    "PTPTLBX",0.50
    "PTPTLBY",0.34
    "PTPTLCL",0
    "PTPTLPX",0.60
    "PTPTLPY",0.40
    "PTPTLPQ",30
"BLOCK","ELDT+"     "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
    0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
    "PTPVIPL",0
    "PTPVILCA",0
    "PTPVILB",0
    "PTPVICVT",10
    "PENVILIT",0
"BLOCK","ENVDT"
    "ELEMENT","CP43ENVDT-"
        "PENNMI",1.0
        "PENNMA",1.0
        "PENNZN",""
        "PENNZT",1.0
        "PENBLM",12
        "PENCRTS",0
        "PENSPD1",80
        "PTPCRDCT",0
        "PENVICT",1
        "PCCCRFT",1
"BLOCK","CARRING-"
    "PTPCRAPO",0
    "PTPCRPCK",0
    "PTPCRPUX",0.00
    "PTPCRPUY",0.00
    "PTPCRRCV",0
"BLOCK","PACKCLS-"
    "FDRTYPE","Emboss"
    "TAPEWIDTH","8mm"
    "FEEDPITCH",4
    "REELDIAMETER",0
    "TAPEDEPTH",0.0
    "DOADVVACUUM",0
    "CHKBEFOREFEED",0
    "TAPEARMLENGTH",0
    "PPCFDPP",0
    "PPCFDEC",4
    "PPCMNPT",30

......文件继续打开......只会变大。

REGEX 将每个“UNIT PARTS”和以下代码放入NEXT“UNIT PARTS”到字符串[]。 在此之后,我正在检查每个字符串[]以查看“NAME”部分是否存在于不同的列表中。如果确实存在,我将在文本文件的末尾输出“UNIT PARTS”。

3 个答案:

答案 0 :(得分:1)

这一位是潜在的性能杀手:

string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
    concattedUnitPart = concattedUnitPart + line;
    line1CP4PartLines++;
}

(请参阅this article了解原因。)使用StringBuilder重复连接:

// No need to use tempCP4List at all
StringBuilder builder = new StringBuilder();
foreach (var line in splitCP4DataBaseLines)
{
    concattedUnitPart.AppendLine(line);
    line1CP4PartLines++;
}

甚至只是:

string concattedUnitPart = string.Join(Environment.NewLine,
                                       splitCP4DataBaseLines);

现在正则表达式部分可能慢 - 我不确定。你想要实现的目标并不明显,你是否需要正则表达式,或者你是否真的需要一次完成整个事情。你绝对不能一行一行地处理它吗?

答案 1 :(得分:1)

您可以获得相同的输出列表&#39; line1CP4Components&#39;使用以下内容:

Regex StripEmptyLines = new Regex(@"^\s*$", RegexOptions.Multiline);
Regex UnitPartsMatch = new Regex(@"(?<=\n)""UNIT"",""PARTS"".*?(?=(?:\n""UNIT"",""PARTS"")|$)", RegexOptions.Singleline);

string CP4DataBase =
"C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);

List<string> line1CP4Components = new List<string>(
    UnitPartsMatch.Matches(StripEmptyLines.Replace(CP4DataBaseRTB.Text, ""))
        .OfType<Match>()
        .Select(m => m.Value)
    );

return line1CP4Components.ToArray();

您可以忽略StripEmptyLines的使用,但您的原始代码是通过Where(c => !string.IsNullOrEmpty(c))执行此操作。您的原始代码也会导致&#39; \ r&#39; &#34; \ r \ n&#34;的一部分新行/换行对要复制。我以为这是一次意外而非故意?

此外,您似乎无法使用&#39; line1CP4PartLines&#39;所以我省略了值的创建。它似乎与后来遗漏空行不一致,所以我猜你不依赖它。如果你需要这个值,一个简单的正则表达式可以告诉你字符串中有多少新行:

int linecount = new Regex("^", RegexOptions.Multiline).Matches(CP4DataBaseRTB.Text).Count;

答案 2 :(得分:0)

//您的代码的示例

 string CP4DataBase = "C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt"; 
List<string> Cp4DataList =  new List<string>(File.ReadAllLines(CP4DataBase);
//or create a Dictionary<int,string[]> object

    string strData = string.Empty;//hold the line item data which is read in line by line
    string[] strStockListRecord = null;//string array that holds information from the TFE_Stock.txt file
    Dictionary<int, string[]> dctStockListRecords = null; //dictionary object that will hold the KeyValuePair of text file contents in a DictList
    List<string> lstStockListRecord = null;//Generic list that will store all the lines from the .prnfile being processed
    if (File.Exists(strExtraLoadFileLoc + strFileName))
    {
        try
        {
            lstStockListRecord = new List<string>();
            List<string> lstStrLinesStockRecord = new List<string>(File.ReadAllLines(strExtraLoadFileLoc + strFileName));
            dctStockListRecords = new Dictionary<int, string[]>(lstStrLinesStockRecord.Count());
            int intLineCount = 0;
            foreach (string strLineSplit in lstStrLinesStockRecord)
            {
                lstStockListRecord.Add(strLineSplit);
                dctStockListRecords.Add(intLineCount, lstStockListRecord.ToArray());
                lstStockListRecord.Clear();
                intLineCount++;
            }//foreach (string strlineSplit in lstStrLinesStockRecord)
            lstStrLinesStockRecord.Clear();
            lstStrLinesStockRecord = null;
            lstStockListRecord.Clear();
            lstStockListRecord = null;

//Alter the code to fit what you are doing.. 
相关问题