Question

我的文本文件大小超过50GB。现在我想删除重复的单词。但我听说，我需要非常多的RAM来将文本文件中的每个Word加载到一个哈希集中。你能告诉我一个很好的方法来删除文本文件中的每个重复单词吗？单词按照白色空格排序，就像这样。

word1 word2 word3 ... ...

Answer 1

H2答案很好，但可能有点过分。英语中的所有单词都不会超过几Mb。只需使用一套。你可以在RAnders00程序中使用它。

public static void read50Gigs(String fileLocation, String newFileLocation) {
    Set<String> words = new HashSet<>();
    try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
        Scanner scanner = new Scanner(fileInputStream);) {

        while (scanner.hasNext()) {
            String nextWord = scanner.next();
            words.add(nextWord);
        }
        System.out.println("words size "+words.size());
        Files.write(Paths.get(newFileLocation), words, 
                StandardOpenOption.CREATE, StandardOpenOption.WRITE);

    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

作为对常用词的估计，我在战争与和平中加入了这一点（来自古腾堡）

public static void read50Gigs(String fileLocation, String newFileLocation) {
    try {
        Set<String> words = Files.lines(Paths.get("war and peace.txt"))
                .map(s -> s.replaceAll("[^a-zA-Z\\s]", ""))
                .flatMap(Pattern.compile("\\s")::splitAsStream)
                .collect(Collectors.toSet());

        System.out.println("words size " + words.size());//22100
        Files.write(Paths.get("out.txt"), words,
                StandardOpenOption.CREATE, 
                StandardOpenOption.TRUNCATE_EXISTING,
                StandardOpenOption.WRITE);

    } catch (IOException e) {}
}

它在0秒内完成。除非您的巨大源文件包含换行符，否则您无法使用Files.lines。使用换行符时，它会逐行处理，因此不会占用太多内存。

Answer 2

此方法使用数据库来缓冲找到的单词。

它还假设单词 - 无论大小写 - 都是相同的。

H2文档说明非FAT文件系统上的数据库的最大大小为4 TB（使用默认页面大小为2KB），这足以达到此目的。

package com.stackoverflow;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.sql.*;
import java.util.Scanner;

public class H2WordReading {

    public static void main(String[] args) {
//        read50Gigs("50gigfile.txt", "cleaned50gigfile.txt");
        read50Gigs("./testSmallFile", "./cleaned");
    }

    public static void read50Gigs(String fileLocation, String newFileLocation) {
        try (Connection connection = DriverManager.getConnection("jdbc:h2:./words");
             FileInputStream fileInputStream = new FileInputStream(fileLocation);
             Scanner scanner = new Scanner(fileInputStream);
             FileOutputStream fileOutputStream = new FileOutputStream(newFileLocation);
             OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileOutputStream)) {
            connection.createStatement().execute("DROP TABLE IF EXISTS WORDS;");
            connection.createStatement().execute("CREATE TABLE WORDS(WORD VARCHAR NOT NULL);");

            PreparedStatement insertStatement = connection.prepareStatement("INSERT INTO WORDS VALUES (?);");
            PreparedStatement queryStatement = connection.prepareStatement("SELECT * FROM WORDS WHERE UPPER(WORD) = UPPER(?);");

            while (scanner.hasNext()) {
                String nextWord = scanner.next();
                queryStatement.setString(1, nextWord);
                ResultSet resultSet = queryStatement.executeQuery();
                if (!resultSet.next())  // word not found, ok
                {
                    outputStreamWriter.write(scanner.hasNext() ? (nextWord + ' ') : nextWord);
                    insertStatement.setString(1, nextWord);
                    insertStatement.execute();
                } // word found, just don't write anything
            }

        } catch (IOException | SQLException e) {
            throw new RuntimeException(e);
        }
    }
}

您需要在类路径中添加H2 driver jar。

请注意，我只测试了一个包含10个字左右的小文件。您应该尝试使用50千兆字节的文件尝试此尝试，并报告任何错误。

请注意此尝试

将所有空格和换行标准化为单个空格字符
始终使用单词的第一个匹配项并删除所有即将出现的单词

此尝试所需的时间与文件中的单词数呈指数级增长。

Answer 3

// Remove duplicate words from a file
    public String removeDupsFromFile(String str) {
        String[] words = str.split(" ");
        LinkedHashMap<String, Integer> map = new LinkedHashMap<String, Integer>();

        for (int i = 0 ; i < words.length ; i++) {
            if (map.containsKey(words[i])) {
                int count = map.get(words[i]) + 1;
                map.put(words[i], count);
            } else {
                map.put(words[i], 1);
            }
        }

        StringBuilder result = new StringBuilder("");
        Iterator itr = map.keySet().iterator();
        while (itr.hasNext()) {
            result.append(itr.next() + " ");

        }
        return result.toString();
    }

删除大文本文件中的重复单词 - Java

3 个答案: