Question

我有非常庞大的文本文件18000000行4Gbyte，我想从中挑选一些随机行，我写了下面这段代码来做这个但是很慢

import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Main {

    public static void main(String[] args) throws IOException {
        int sampleSize =3000;
        int fileSize = 18000000;
        int[] linesNumber = new int[sampleSize];
        Random r = new Random();
        for (int i = 0; i < linesNumber.length; i++) {
            linesNumber[i] = r.nextInt(fileSize);

        }
        List<Integer> list = Arrays.stream(linesNumber).boxed().collect(Collectors.toList());
        Collections.sort(list);

        BufferedWriter outputWriter = Files.newBufferedWriter(Paths.get("output.txt"));

        for (int i : list) {

            try (Stream<String> lines = Files.lines(Paths.get("huge_text_file"))) {
                String en=enlines.skip(i-1).findFirst().get();

                outputWriter.write(en+"\n");
                lines.close();

            } catch (Exception e) {
                System.err.println(e);

            }

        }
        outputWriter.close();


    }
}

有更优雅更快的方法吗？感谢。

Answer 1

有些事情我觉得你当前的代码很麻烦。

您当前正在将整个文件加载到RAM 中。我对你的示例文件了解不多，但我使用的那个文件崩溃了我的默认JVM。
你一遍又一遍地跳过相同的行，对于早期的行来说更是如此 - 这是非常低效的，比如O（n ^ n）或其他东西。如果你能用这种方法处理一个500MB的文件，我会感到惊讶。

以下是我提出的建议：

public static void main(String[] args) throws IOException {
    int sampleSize = 3000;
    int fileSize = 50000;
    int[] linesNumber = new int[sampleSize];
    Random r = new Random();
    for (int i = 0; i < linesNumber.length; i++) {
        linesNumber[i] = r.nextInt(fileSize);

    }
    List<Integer> list = Arrays.stream(linesNumber).boxed().collect(Collectors.toList());
    Collections.sort(list);

    BufferedWriter outputWriter = Files.newBufferedWriter(Paths.get("localOutput/output.txt"));
    long t1 = System.currentTimeMillis();
    try(BufferedReader reader = new BufferedReader(new FileReader("extremely large file.txt")))
    {
        int index = 0;//keep track of what item we're on in the list
        int currentIndex = 0;//keep track of what line we're on in the input file
        while(index < sampleSize)//while we still haven't finished the list
        {
            if(currentIndex == list.get(index))//if we reach a line
            {
                outputWriter.write(reader.readLine());
                outputWriter.write("\n");//readLine doesn't include the newline characters
                while(index < sampleSize && list.get(index) <= currentIndex)//have to put this here in case of duplicates in the list
                    index++;
            }
            else
                reader.readLine();//readLine is dang fast. There may be faster ways to skip a line, but this is still plenty fast.
            currentIndex++;
        }
    } catch (Exception e) {
        System.err.println(e);
    }
    outputWriter.close();
    System.out.println(String.format("Took %d milliseconds", System.currentTimeMillis() - t1));
}

对于运行样本大小为30且文件大小为50000的4.7GB文件，我需要大约87毫秒，当我将样本大小更改为3000时花了大约91毫秒。当我将文件大小增加到10,000。 Tl;本段的博士=它可以很好地扩展，并且随着样本量的增大，它可以很好地扩展。

直接回答你的问题“有更优雅更快的方法吗？”就在这里。 更快的方法是自己跳过行，不要将整个文件加载到内存中，并确保继续使用缓冲的读者和编写者。另外，我会避免尝试你的拥有原始数据缓冲区或类似的东西 - 只是不这样做。

如果您想了解更多有用的方法，请随意介绍我已包含的方法。

Answer 2

我的第一个方法是查看Java cf.中的RandomAccess文件。 https://docs.oracle.com/javase/tutorial/essential/io/rafs.html。通常随机搜索将比读取整个文件快得多，但是您需要逐字节读取到下一行的开头（例如），然后逐字节读取该行到下一个换行符，然后寻找另一个随机位置。

我不确定这种方法会更优雅（部分取决于你猜测它是如何编码的），但我希望它会更快。

Answer 3

找不到有效的方法。我唯一能想到的就是使用RandomAccessFile，寻找随机数，然后将下两个（？）字符读入数组。然后执行换行查找并形成一个String。

doc

从巨大的文本文件中选择随机行

3 个答案: