读取大文本文件非常慢

时间:2018-11-28 13:11:35

标签: vb.net visual-studio visual-studio-2017 streamreader

因此,我被赋予了编写vb程序的任务,在该程序中,我读取了一个大的.txt文件(从500mb到2GB的任何地方),该文件通常以13位数字开头,然后在每行之后加载其他信息。 (例如“ 1578597500548 info info info info等”。)我必须让用户输入一个13位数字,然后我的程序在每行的开头搜索该数字的大文件,如果找到该行,则将整行写入新的。 txt文件!

我当前的程序运行正常,但是我注意到添加到列表/流阅读器部分大约需要90%的处理时间。每次平均约27秒。任何想法如何加快? 这是我写的。

Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
    Dim wtr As IO.StreamWriter
    Dim listy As New List(Of String)
    Dim i = 0

    stpw.Reset()
    stpw.Start()

    'reading in file of large data 700mb and larger
    Using Reader As New StreamReader("G:\USER\FOLDER\tester.txt")
        While Reader.EndOfStream = False
            listy.Add(Reader.ReadLine)
        End While
    End Using

    'have a textbox which finds user query number
    Dim result = From n In listy
                 Where n.StartsWith(TextBox1.Text)
                 Select n

    'writes results found into new file
    wtr = New StreamWriter("G:\USER\searched-number.txt")
    For Each word As String In result
        wtr.WriteLine(word)
    Next
    wtr.Close()

    stpw.Stop()
    Debug.WriteLine(stpw.Elapsed.TotalMilliseconds)

    Application.Exit()
End Sub

UPDATE 我已经提出了一些建议,不要先将其放入列表中,而只是在内存中进行搜索,时间大约要快5秒,但仍然需要23秒完成并在数字即时搜索上方写出一行,以便您能告诉我我要去哪里了。谢谢大家!

wtr = New StreamWriter("G:\Karl\searchednumber.txt")
        Using Reader As New StreamReader("G:\Karl\AC\tester.txt")
            While Reader.EndOfStream = False
                lineIn = Reader.ReadLine
                If Reader.ReadLine.StartsWith(TextBox1.Text) Then
                    wtr.WriteLine(lineIn)

                Else

                    Continue While
                End If
            End While
            wtr.Close()
        End Using

1 个答案:

答案 0 :(得分:1)

在程序加载时为文件编制索引。

创建一个Dictionary(Of ULong, Long),并在程序加载时通读该文件。对于每行,在字典中添加一个条目,其中将每行开头的13位数字值作为ULong键,并将文件流中的位置作为Long值。

然后,当用户输入密钥时,您可以检查几乎是实例的字典,然后查找所需磁盘上的确切位置。

在程序启动时构建文件索引可能需要一些时间,但是您只需一次即可。现在,您或者每次用户想要进行搜索时都需要搜索整个内容,或者在内存中保留数百兆字节的文本文件数据。有了索引后,在字典中查找一个值然后直接查找该值应该几乎立即发生。


我刚刚看到此评论:

  

可能出现1个以上的13位数字,因此必须搜索整个文件。

基于此,索引应该是Dictionary(Of ULong, List(Of Long)),在其中添加一个值到条目将首先创建一个列表实例(如果尚不存在),然后将新值添加到列表中。

这是直接在回复窗口中键入的基本尝试,无需借助测试数据或Visual Studio的帮助,因此可能仍包含一些错误:

Public Class MyFileIndexer
    Private initialCapacity As Integer = 1
    Private Property FilePath As String
    Private Index As Dictionary(Of ULong, List(Of Long))

    Public Sub New(filePath As String)
        Me.FilePath = filePath
        RebuildIndex()
    End Sub

    Public Sub RebuildIndex()
        Index = New Dictionary(Of ULong, List(Of Long))()

        Using sr As New StreamReader(FilePath)
            Dim Line As String = sr.ReadLine()
            Dim position As Long = 0
            While Line IsNot Nothing

                'Process this line
                If Line.Length > 13 Then
                   Dim key As ULong = ULong.Parse(Line.SubString(0, 13))
                   Dim item As List(Of Long)
                   If Not Index.TryGetValue(key, item) Then
                       item = New List(Of Long)(initialCapacity)
                       Index.Add(key, item)
                   End If

                   item.Add(position)
                End If

                'Prep for next line
                position = sr.BaseStream.Position
                Line = sr.ReadLine()
            End While
        End Using   
    End Sub

    'Expect key to be a 13-character numeric string
    Public Function Search(key As String) As List(Of String)
        'Will throw an exception if parsing fails. Be prepared for that.
        Dim realKey As ULong = ULong.Parse(key)
        Return Search(realKey)
    End Function

    Public Function Search(key As ULong) As List(Of String)
        Dim lines As List(Of Long)
        If Not Index.TryGetValue(key, lines) Then Return Nothing

        Dim result As New List(Of String)()
        Using sr As New StreamReader(FilePath)
            For Each position As Long In lines
                sr.BaseStream.Seek(position, SeekOrigin.Begin)
                result.Add(sr.ReadLine())
            Next position
        End Using
        Return Result
    End Function
End Class

'Somewhere public, when your application starts up:
Public Index As New MyFileIndexer("G:\USER\FOLDER\tester.txt")

Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
    Dim lines As List(Of String) = Nothing
    Try
        lines = Index.Search(TextBox1.Text)
    Catch
        'Do something here
    End Try

    If lines IsNot Nothing Then
        Using sw As New StreamWriter($"G:\USER\{TextBox1.Text}.txt")
            For Each line As String in lines
                 sw.WriteLine(line)
            Next 
        End Using
    End If
End Sub

有趣的是,这是该类的通用版本,可让您提供自己的键选择器函数来索引存储每行键的 any 文件,我认为这通常对例如较大的csv数据集。

Public Class MyFileIndexer(Of TKey)
    Private initialCapacity As Integer = 1
    Private Property FilePath As String
    Private Index As Dictionary(Of TKey, List(Of Long))
    Private GetKey As Func(Of String, TKey) 

    Public Sub New(filePath As String, Func(Of String, TKey) keySelector)
        Me.FilePath = filePath
        Me.GetKey = keySelector
        RebuildIndex()
    End Sub

    Public Sub RebuildIndex()
        Index = New Dictionary(Of TKey, List(Of Long))()

        Using sr As New StreamReader(FilePath)
            Dim Line As String = sr.ReadLine()
            Dim position As Long = 0
            While Line IsNot Nothing

               Dim key As TKey = GetKey(Line)
               Dim item As List(Of Long)
               If Not Index.TryGetValue(key, item) Then
                   item = New List(Of Long)(initialCapacity)
                   Index.Add(key, item)
               End If   
               item.Add(position)

                'Prep for next line
                position = sr.BaseStream.Position
                Line = sr.ReadLine()
            End While
        End Using   
    End Sub

    Public Function Search(key As TKey) As List(Of String)
        Dim lines As List(Of Long)
        If Not Index.TryGetValue(key, lines) Then Return Nothing

        Dim result As New List(Of String)()
        Using sr As New StreamReader(FilePath)
            For Each position As Long In lines
                sr.BaseStream.Seek(position, SeekOrigin.Begin)
                result.Add(sr.ReadLine())
            Next position
        End Using
        Return Result
    End Function
End Class