如何从具有集合结构的PDF中有效地提取有意义的数据?

时间:2015-03-17 21:55:37

标签: c# regex winforms pdf itextsharp

enter image description here我一直致力于帮助管理健康管理组织部门的应用程序。 它是我的第一个商业软件,如果我能解决这个问题,我本周很兴奋。 这是问题所在...... MIS部门每年收到4次PDF。此PDF包含2条信息。

a)在组织下注册的所有医院的清单。

b)在每家医院注册的登记者名单。

我的任务是编写一个程序来检索PDF中的所有医院,将它们注册到应用程序的数据库中,然后检索所有登记者并在各自的数据库中注册它们。医院(使用数据库中的外键关系进行管理)。

我使用Regex编写了一个注册所有医院的解决方案,并且在解析PDF(长度为4000页)时节省了一些延迟,它可以很好地工作。

问题在于我注册登记者的解决方案效率不高,因为我的代码效率低下,大约有十分之二的登记者没有注册。

当我将已经部分工作的解决方案转移到最终将驻留的客户端服务器时,我收到一条错误,其中显示“#34;源代码无法找到"”。但是当我在调试模式下运行它以检查问题可能是什么时,它会按预期提取登记者详细信息。所以我对此很困惑。

如果我可以获得帮助a)无法找到"源代码"错误或b)为什么我的代码在我的开发机器上运行而不是服务器我会非常感激。

我会包含我的代码,并且还会包含PDF的快照,但我怀疑堆栈会让附件有问题。

感谢。

private void extractEnrolleesFromPDF(string enrolleeExtraction, string hospital)
    {
        int start;
        int end;
        string substring;

        try
        {
            MatchCollection policyNumbers = Regex.Matches(enrolleeExtraction, @"(\*)(\d{8})(\*)");

            foreach (var policyNumber in policyNumbers)
            {
                Match match = Regex.Match(enrolleeExtraction, "\\" + policyNumber.ToString());
                if (match.Success)
                {
                    //Strore the first occurence of the enrollee's policy number
                    start = match.Index;

                    Match match2 = Regex.Match(enrolleeExtraction.Substring(start + 10), @"(\*)");
                    if (match2.Success)
                    {
                        end = match2.Index + 9;

                        substring = enrolleeExtraction.Substring(start, end);

                        enrolleePolicyNumber.Add(substring);
                    }                        
                }
            }
            //Extract enrollee data an insert into the database

            ArrayList individualEnrolees = new ArrayList();

            int numberOfEnrollees = enrolleePolicyNumber.Count;
            bool principal = false;
            string fName;
            string lName;
            DateTime dob;
            string sex;
            string hospitalCode = hospital.Substring(1, 7);
            for (int i = 0; i < numberOfEnrollees; i++)
            {
                string enrolleePolNumber;
                Match policyNumber = Regex.Match(enrolleePolicyNumber[i].ToString(), @"((\*)(\d{8})(\*))");
                if (policyNumber.Success)
                {
                    enrolleePolNumber = policyNumber.Value;
                }
                MatchCollection enrolleeRecords = Regex.Matches(enrolleePolicyNumber[i].ToString(), @"(\d{1})(\s)(\D*)(\d{2})/(\d{2})/(\d{4})");

                //Empty the array list each time to avoid going over the same recors over and over again
                individualEnrolees.Clear();

                foreach (var record in enrolleeRecords)
                {
                    individualEnrolees.Add(record);
                }

                //The way our search works at the moment is that is uses the pattern *-------* at th ebeginning and end to
                //mark where an enrolleee's records begin and end. The problem now is that the last record does not have
                //that pattern at the end. So we need to find a way to retrieve the last record and add it to the collection we parse
                //for the enrollee data.
                try
                {
                    Match lastPolicyNumberInHospital = Regex.Match(enrolleeExtraction, @"(\*)(\d{8})(\*)", RegexOptions.RightToLeft);

                    string lastRecord = enrolleeExtraction.Substring(lastPolicyNumberInHospital.Index);

                    enrolleePolicyNumber.Add(lastRecord);
                }
                catch (Exception ex)
                {
                    MessageBox.Show("Failed to extract last record: " + ex.Message);
                }

                foreach (var record in individualEnrolees)
                {
                    string princ;

                    string[] splitEnrolleeData = record.ToString().Split(' ');

                    //int splitSectionCount counts how many section our split enrollee data is
                    int splitSectionCount = splitEnrolleeData.Count();

                    //if we have six sections then we expect the Principal or Spouse record to be
                    //on index 1
                    if (splitSectionCount == 5)
                    {
                        princ = splitEnrolleeData[1].ToString();
                        if (princ == "Principal")
                        {
                            principal = true;
                        }
                        else
                        {
                            principal = false;
                        }
                    }
                    //if we have five sections then we expect the Principal or Spouse record to be
                    //on index 0.
                    //i.e. Merged with the serial number so we check to see if it contains
                    //the string "Principal" or "Spouse"
                    else if (splitSectionCount == 4)
                    {
                        if (splitEnrolleeData[0].ToString().Contains("0"))
                        {
                            principal = true;
                        }
                        else if (!splitEnrolleeData[0].ToString().Contains("0"))
                        {
                            principal = false;
                        }
                    }
                    //TO-DO: Eliminate this comment block is else-if above works properly
                    //princ = splitEnrolleeData[1].ToString();
                    //if (princ == "Principal")
                    //{
                    //    principal = true;
                    //}
                    //else
                    //{
                    //    principal = false;
                    //}

                    enrolleePolNumber = policyNumber.Value.Substring(1, policyNumber.Value.Length - 2);

                    //if we have 6 sections as expected carry on and register the enrollee as usual
                    //if not, if we have 5 do something else
                    //this is because some enrollees in the NHIS PDF arent split properly returning 
                    //5 items instead of 6
                    if (splitSectionCount == 5)
                    {
                        lName = splitEnrolleeData[2].ToString();
                        fName = splitEnrolleeData[3].ToString();
                        dob = Convert.ToDateTime(splitEnrolleeData[4].ToString());
                        hosp = getHospitalID(hospitalCode);
                        if (principal == true)
                        {
                            if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                            {
                                registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                            }
                        }
                        else if (principal == false)
                        {
                            if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                            {
                                registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                            }
                        }
                    }
                    else if (splitSectionCount == 4)
                    {
                        lName = splitEnrolleeData[1].ToString();
                        fName = splitEnrolleeData[2].ToString();
                        dob = Convert.ToDateTime(splitEnrolleeData[3].ToString());
                        hosp = getHospitalID(hospitalCode);
                        if (principal == true)
                        {
                            if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                            {
                                registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                            }
                        }
                        else if (principal == false)
                        {
                            if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                            {

                                    registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());

                                //else if (!parentExists(enrolleePolNumber))
                                //{

                                //}
                            }
                        }
                    }
                }
            }
        }
        catch (Exception ex)
        {
            MetroFramework.MetroMessageBox.Show(this, "Error retrieving subsitring: " + ex.Message);
        }

    }

0 个答案:

没有答案