Question

我的任务是获取此文件的单词频率：

test_words_file-1.txt ：

The quick brown fox
Hopefully245this---is   a quick13947
task&&#%*for you to complete.
But maybe the tASk 098234 will be less
..quicK.
the the the the the the the the the the

我一直试图从该文件中删除符号和数字，并按字母顺序获取每个单词的频率，结果是：

我看到偶数数字已被删除，但仍在计算中。您能解释为什么以及如何解决此问题吗？

此外，如何分隔“ Hopefully245this --- is” 并存储3个有用的词“ hopefully”，“ this”，“ is” ？

>

public class WordFreq2 {
    public static void main(String[] args) throws FileNotFoundException {

        File file = new File("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        Scanner scanner = new Scanner(file); 
        int maxWordLen = 0; 
        String maxWord = null;

        HashMap<String, Integer> map = new HashMap<>();
        while(scanner.hasNext()) {
            String word = scanner.next();
            word = word.toLowerCase();
            // text cleaning 
            word = word.replaceAll("[^a-zA-Z]+", "");

            if(map.containsKey(word)) {
                //if the word already exists
                int count = map.get(word)+1;
                map.put(word,count);
            }
            else {
                // The word is new 
                int count = 1;
                map.put(word, count);

                // Find the max length of Word
                if (word.length() > maxWordLen) {
                    maxWordLen = word.length();
                    maxWord = word;
                }
            }   
        }

        scanner.close();

        //HashMap unsorted, sort 
        TreeMap<String, Integer> sorted = new TreeMap<>();
        sorted.putAll(map);


        for (Map.Entry<String, Integer> entry: sorted.entrySet()) {
            System.out.println(entry);
        }

        System.out.println(maxWordLen+" ("+maxWord+")");
    }

}

Answer 1

首先输入代码。解释出现在以下代码之后。

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordFreq2 {

    public static void main(String[] args) {
        Path path = Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        try {
            String text = Files.readString(path); // throws java.io.IOException
            text = text.toLowerCase();
            Pattern pttrn = Pattern.compile("[a-z]+");
            Matcher mtchr = pttrn.matcher(text);
            TreeMap<String, Integer> freq = new TreeMap<>();
            int longest = 0;
            while (mtchr.find()) {
                String word = mtchr.group();
                int letters = word.length();
                if (letters > longest) {
                    longest = letters;
                }
                if (freq.containsKey(word)) { 
                    freq.computeIfPresent(word, (w, c) -> Integer.valueOf(c.intValue() + 1));
                }
                else {
                    freq.computeIfAbsent(word, (w) -> Integer.valueOf(1));
                }
            }
            String format = "%-" + longest + "s = %2d%n";
            freq.forEach((k, v) -> System.out.printf(format, k, v));
            System.out.println("Longest = " + longest);
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

由于示例文件很小，因此我将整个文件内容加载到String中。

然后，我将整个String转换为小写，因为您对单词的定义是一系列连续的字母，不区分大小写的字符。

正则表达式[a-z]+ –搜索一个或多个连续的小写字母字符。（请记住，整个String现在都是小写的。）

每次对方法find()的调用都会在String中找到下一个单词（根据单词的上述定义，即连续的一系列小写字母）。 / p>

要计算字母频率，我使用TreeMap，其中映射键是单词，映射值是单词在String中出现的次数。请注意，映射键和值不能是基元，因此值是Integer而不是int。

如果找到的最后一个单词已经出现在地图中，我将增加计数。

如果找到的最后一个单词未出现在地图中，则将其添加到地图中，并将其计数设置为1（一个）。

在将单词添加到地图中的同时，我还对找到的每个单词的字母进行计数，以找到最长的单词。

处理完整个String之后，我打印地图的内容，每行输入一个，最后打印找到的最长单词中的字母数。请注意，TreeMap对其键进行排序，因此单词列表以字母顺序显示。

以下是输出：

a         =  1
be        =  1
brown     =  1
but       =  1
complete  =  1
for       =  1
fox       =  1
hopefully =  1
is        =  1
less      =  1
maybe     =  1
quick     =  3
task      =  2
the       = 12
this      =  1
to        =  1
will      =  1
you       =  1
Longest = 9

Answer 2

我该如何分隔“ Hopefully245this --- is”并存储3个有用的单词“希望”，“这个”，“是”？

使用正则表达式API满足此类要求。

演示：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "Hopefully245this---is";
        Pattern pattern = Pattern.compile("[A-Za-z]+");
        Matcher matcher = pattern.matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

输出：

Hopefully
this
is

检查以下链接以了解有关Java正则表达式的更多信息：

Answer 3

在Java 9或更高版本的Matcher#results上，可以在这样的流解决方案中使用：

    Pattern pattern = Pattern.compile("[a-zA-Z]+");
    try (BufferedReader br = Files.newBufferedReader(Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt"))) {
        br.lines()
                .map(pattern::matcher)
                .flatMap(Matcher::results)
                .map(matchResult -> matchResult.group(0))
                .collect(Collectors.groupingBy(String::toLowerCase, TreeMap::new, Collectors.counting()))
                .forEach((word, count) -> System.out.printf("%s=%s%n", word, count));
    } catch (IOException e) {
        System.err.format("IOException: %s%n", e);
    }

输出：

a=1
be=1
brown=1
but=1
complete=1
for=1
fox=1
hopefully=1
is=1
less=1
maybe=1
quick=3
task=2
the=12
this=1
to=1
will=1
you=1

如何在文本文件中查找单词频率？

3 个答案: