提取Word文档数据并插入SQL数据库

时间:2015-04-27 09:36:12

标签: c# sql asp.net scripting ms-word

Word文档示例

A 1. Name of House: Aasleagh Lodge
Townland: Srahatloe
Near: Killary Harbour, Leenane
Status/Public Access: maintained, private fishing lodge
Date Built: 1838-1850, burnt 1923, rebuilt 1928
Description: Large Victorian country house. Original house 6-bay, 2-storey, 3-bay section on right is higher; after fire house was reduced in size giving current three parallel- hipped roof bays. 
Associated Families: Lord Sligo; rented - Hon David Plunkett ; Capt W.E. and Constance Mary Phillips; James Leslie Wanklyn M.P. for Bradford; Walter H. Maudslay; Ernest Richard Hartley; Alice Marsh, Lord and Lady Brabourne; Western Fisheries Board; Inland Fisheries Ireland.

有没有办法插入标题后面的数据,例如在word文档中存在“Townland”的地方我希望将其后面的数据插入到数据库中的列中,在本例中为“Srahatloe”。我想从Word文档中提取所有这些数据,它是我正在构建的网站,所有信息都存储在Word文档中,但我需要将文本添加到数据库而不复制和粘贴,因为文档非常大(70,000+个单词)是否有可用于执行此操作的脚本?

源代码

var wordApp = new Microsoft.Office.Interop.Word.Application();
            var wordDoc = wordApp.Documents.Open(@"C:\Users\mhoban\Documents\Book.docx");
            var txt = wordDoc.Content.Text;
            var regex = new Regex(@"(Townland\: )(.+?)[\r\n]");
            var allMatches = regex.Matches(txt);
            foreach (Match match in allMatches)
            {
                var townValue = match.Groups[2].Value;

                // Insert values into database
                SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["ConnectionString"].ToString());
                SqlCommand com = new SqlCommand();

                com.CommandText = "INSERT INTO Houses (Townland) VALUES (@town)";

                com.Parameters.Add("@town", SqlDbType.NVarChar).SqlValue = townValue;

                com.Connection = con;

                con.Open();

                com.ExecuteNonQuery();

                con.Close();
            }

2 个答案:

答案 0 :(得分:0)

为RegEx尖叫。这样的事情会让你工作:

var wordApp = new Microsoft.Office.Interop.Word.Application();
var wordDoc = wordApp.Documents.Open(pathToYourDocument);
var txt = wordDoc.Content.Text;
    var regex = new Regex(@"(Townland\: )(.+?)[\r\n]");
    var allMatches = regex.Matches(txt);
    foreach (Match match in allMatches)
    {
        var townValue = match.Groups[2].Value;
        //townValue now holds "Srahatloe"
        //do your magic
    }

答案 1 :(得分:0)

以下是我用于从word文档中提取特定文本的代码。

我最终使用正则表达式,速度要快得多,但我不再拥有代码了。无论如何这里是如何从word中提取文本并将其放在csv中。

请不要在开发PC上安装PIA以进行Office自动化。

要添加对Microsoft.Office.Interop.Word的引用,请转到Visual Studio - >右键点击参考 - > COM - > Micrososft.Word 14.0(抱歉我无法访问我的工作PC,因此无法附上截图)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Office.Interop.Word;
using Microsoft.Office.Interop.Excel;
using System.IO;

namespace ConsoleApplication2
{
class Program
{
    static void Main(string[] args)
    {
        string month = "July2014";
        string delimiter = ",";
        string[] files = Directory.GetFiles("C:\\temp\\"+ month);
        string[][] csvoutput = new string[][] { };
        csvoutput = new string[][] { new string[]{"School Name","Student Name","Id","ReportDate"}};
        StringBuilder sb = new StringBuilder();
        sb.AppendLine(string.Join(delimiter, csvoutput[0]));
        File.AppendAllText("C:\\Temp\\"+month+".csv", sb.ToString());

        foreach (var file in files)
        {
            var id = string.Empty;
            var studentName = string.Empty;
            var school = string.Empty;
            var reportDate = string.Empty;

            if (file.ToLower().EndsWith(".doc"))
            {
                var word = new Microsoft.Office.Interop.Word.Application();
                var sourceFile = new FileInfo(file);
                var doc = word.Documents.Open(sourceFile.FullName);
                Console.WriteLine("Processing :-{ " + file.ToLower());

                for (int i = 0; i < doc.Paragraphs.Count; i++)
                {

                    try
                    {
                        if (doc.Paragraphs[i + 1].Range.Text.StartsWith("School:"))
                        {
                            school = doc.Paragraphs[i + 1].Range.Text.ToString().Replace("\r\a","").Replace("School: ","").Trim();

                        }
                        if (doc.Paragraphs[i + 1].Range.Text.StartsWith("Student Names:"))
                        {
                            studentName = doc.Paragraphs[i + 1].Range.Text.ToString().Replace("\r\a", "").Replace("Student Names:","").Trim();

                        }
                        if (doc.Paragraphs[i + 1].Range.Text.StartsWith("xx Id:"))
                        {
                            id = doc.Paragraphs[i + 1].Range.Text.ToString().Replace("\r\a", "").Replace("xx Id:", "").Trim();

                        }

                        if (doc.Paragraphs[i + 1].Range.Text.StartsWith("Date of Report:"))
                        {
                            reportDate = doc.Paragraphs[i + 1].Range.Text.ToString().Replace("\r\a", "").Replace("Date of Report:","").Trim();

                        }
                    }
                    catch (Exception)
                    {
                        Console.WriteLine("Error occurred" + file.ToLower());
                    }
                }
                csvoutput = new string[][]
                        {
                            new string[]{school,studentName,id,reportDate} 
                        };

                int csvlength = csvoutput.GetLength(0);
                for (int index = 0; index < csvlength; index++)
                    sb.AppendLine(string.Join(delimiter, csvoutput[index]));
                File.AppendAllText("C:\\Temp\\" + month + ".csv", sb.ToString());
                word.ActiveDocument.Close();
                word.Quit();
            }
        }
        Console.WriteLine("Finished");
        Console.ReadLine();
    }
}

}

相关问题