Question

我目前正在使用String.scan方法搜索某些单词或正则表达式的文件。但是，在1.9GB数据上运行脚本时，大约需要3个小时。我想这可能是由于重复使用.scan。代码看起来很像下面，有没有办法提高速度，即使它意味着没有扫描？

字符串＆＃39;文字＆＃39;是本例中文件的文本（50k字）。
＆quot; item.getCustomeMetadata.putText（）将结果放在一个单独的程序中。
代码重复，因为一个数组适用于pdCount，另一个适用于idrCount

idNames = [/UK[0-9]{3,6}/ , /\s*[A-C,E,G-H,J-PR-T,W-Z]{2}(?:\s*\d\s*){6}[a-dfmA-DFM]?\s*/] idCats = ["IG_EmpID" ,"IG_SSN" ]

idNames.each_with_index do |val, index|
    textScan = text.scan(val).size
    if textScan > idrHighest
        idrHighest = textScan
    end
    volume = volume + textScan
    item.getCustomMetadata.putText(idCats[index], textScan)
    if textScan != 0
        idrCount +=1
    end
end

pdRegNames = [/(\+44\s?7\d{3}|\(?07\d{3}\)?)\s?\d{3}\s?\d{3}/ , /020[0-9]{7}/ , /[a-zA-Z0-9_\.\-+]+@(infogov|gmail|hotmail|yahoo|outlook|aol|msn|verizon)(\.[a-z]{2,3}){1,2}/ , /(0[1-9]|[1-9]|[12][0-9]|3[01])[-\/ .](0[1-9]|[1-9]|1[012])[-\/ .](19|20)\d\d/ , /\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/, /[A-Z0-9]{5}\d[0156]\d([0][1-9]|[12]\d|3[01])\d[A-Z0-9]{3}[A-Z]{2}/,/([A-PR-UWYZ][A-HK-Y0-9][A-HJKS-UW0-9]?[A-HJKS-UW0-9]?)\s*([0-9][ABD-HJLN-UW-Z]{2})/i]
pdRegCats = ["IG_Phone","IG_Phone2","IG_Email","IG_DOB" , "IG_FIP", "IG_License", "IG_Address"]

pdRegNames.each_with_index do |val, index|
    textScan = text.scan(val).size
    if textScan > pdHighest
        pdHighest = textScan
    end
    volume = volume + textScan
    item.getCustomMetadata.putText(pdRegCats[index], textScan)
    if textScan != 0
        pdCount +=1
    end
end

maritalNames = ["Married" , "Divorced" , "Civil Partnership"]

temp = volume
maritalNames.each do |val|
    textScan = text.scan(/#{val}/i).size
    if textScan > pdHighest
        pdHighest = textScan
    end
    volume = volume + textScan
    item.getCustomMetadata.putText("IG_Marital", (volume - temp).round(0))
    if textScan != 0
        pdCount +=1
    end
end

foundSort = text.scan(/[0-9]{2}-[0-9]{2}-[0-9]{2}/)
textScan = 0
foundSort.each do |sort|
    if sortArray.include? sort
        textScan +=1
    end
end
if textScan > pdHighest
    pdHighest = textScan
end
volume = volume + textScan
item.getCustomMetadata.putText("IG_Sort", textScan)
if textScan != 0
    pdCount +=1
end

`

每个文件运行一次。所以你可以想象这超过数百万。我正在考虑使用线程，每个文件创建一个，但这对我不起作用。

感谢您的帮助。

Answer 1

好的，有几点意见。

最重要的是 - 每次调用scan（）时都在扫描整个文件 - 而你是在循环中执行此操作。这是一个坏主意。
你的正则表达式非常复杂，正则表达式本身并不快。

您的数据文件是否已构建？数据是用制表符，管道还是逗号分隔的？数据通常会出现在后续行的相同位置吗？

您可以将文件拆分为最小容量为1兆字节的较小工作集吗？这将使得在内存中搜索字符串的速度明显加快。

老实说，如果我遇到这种情况，我会寻找分界符或某种记录结构 - 如果数据完全是常规的话，可以根据这些进行假设和优化，然后处理错误。发生。如果数据完全是自由格式，那么唯一的选择是将文件拆分为更小的单元并使用多个线程来处理它们。

如何提高我的Ruby代码的效率和速度

1 个答案: