如何在文本文件中查找单词频率?

时间:2020-05-24 15:57:57

标签: java word-frequency

我的任务是获取此文件的单词频率:

test_words_file-1.txt

The quick brown fox
Hopefully245this---is   a quick13947
task&&#%*for you to complete.
But maybe the tASk 098234 will be less
..quicK.
the the the the the the the the the the

我一直试图从该文件中删除符号和数字,并按字母顺序获取每个单词的频率,结果是:

word list

我看到偶数数字已被删除,但仍在计算中。您能解释为什么以及如何解决此问题吗?

此外,如何分隔“ Hopefully245this --- is” 并存储3个有用的词“ hopefully”,“ this”,“ is”

>
public class WordFreq2 {
    public static void main(String[] args) throws FileNotFoundException {

        File file = new File("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        Scanner scanner = new Scanner(file); 
        int maxWordLen = 0; 
        String maxWord = null;

        HashMap<String, Integer> map = new HashMap<>();
        while(scanner.hasNext()) {
            String word = scanner.next();
            word = word.toLowerCase();
            // text cleaning 
            word = word.replaceAll("[^a-zA-Z]+", "");

            if(map.containsKey(word)) {
                //if the word already exists
                int count = map.get(word)+1;
                map.put(word,count);
            }
            else {
                // The word is new 
                int count = 1;
                map.put(word, count);

                // Find the max length of Word
                if (word.length() > maxWordLen) {
                    maxWordLen = word.length();
                    maxWord = word;
                }
            }   
        }

        scanner.close();

        //HashMap unsorted, sort 
        TreeMap<String, Integer> sorted = new TreeMap<>();
        sorted.putAll(map);


        for (Map.Entry<String, Integer> entry: sorted.entrySet()) {
            System.out.println(entry);
        }

        System.out.println(maxWordLen+" ("+maxWord+")");
    }

}

3 个答案:

答案 0 :(得分:2)

首先输入代码。解释出现在以下代码之后。

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordFreq2 {

    public static void main(String[] args) {
        Path path = Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        try {
            String text = Files.readString(path); // throws java.io.IOException
            text = text.toLowerCase();
            Pattern pttrn = Pattern.compile("[a-z]+");
            Matcher mtchr = pttrn.matcher(text);
            TreeMap<String, Integer> freq = new TreeMap<>();
            int longest = 0;
            while (mtchr.find()) {
                String word = mtchr.group();
                int letters = word.length();
                if (letters > longest) {
                    longest = letters;
                }
                if (freq.containsKey(word)) { 
                    freq.computeIfPresent(word, (w, c) -> Integer.valueOf(c.intValue() + 1));
                }
                else {
                    freq.computeIfAbsent(word, (w) -> Integer.valueOf(1));
                }
            }
            String format = "%-" + longest + "s = %2d%n";
            freq.forEach((k, v) -> System.out.printf(format, k, v));
            System.out.println("Longest = " + longest);
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

由于示例文件很小,因此我将整个文件内容加载到String中。

然后,我将整个String转换为小写,因为您对单词的定义是一系列连续的字母,不区分大小写的字符。

正则表达式[a-z]+ –搜索一个或多个连续的小写字母字符。 (请记住,整个String现在都是小写的。)

每次对方法find()的调用都会在String中找到下一个单词(根据单词的上述定义,即连续的一系列小写字母)。 / p>

要计算字母频率,我使用TreeMap,其中映射键是单词,映射值是单词在String中出现的次数。请注意,映射键和值不能是基元,因此值是Integer而不是int

如果找到的最后一个单词已经出现在地图中,我将增加计数。

如果找到的最后一个单词未出现在地图中,则将其添加到地图中,并将其计数设置为1(一个)。

在将单词添加到地图中的同时,我还对找到的每个单词的字母进行计数,以找到最长的单词。

处理完整个String之后,我打印地图的内容,每行输入一个,最后打印找到的最长单词中的字母数。请注意,TreeMap对其键进行排序,因此单词列表以字母顺序显示。

以下是输出:

a         =  1
be        =  1
brown     =  1
but       =  1
complete  =  1
for       =  1
fox       =  1
hopefully =  1
is        =  1
less      =  1
maybe     =  1
quick     =  3
task      =  2
the       = 12
this      =  1
to        =  1
will      =  1
you       =  1
Longest = 9

答案 1 :(得分:1)

我该如何分隔“ Hopefully245this --- is”并存储3个有用的 单词“希望”,“这个”,“是”?

使用正则表达式API满足此类要求。

演示:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "Hopefully245this---is";
        Pattern pattern = Pattern.compile("[A-Za-z]+");
        Matcher matcher = pattern.matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

输出:

Hopefully
this
is

检查以下链接以了解有关Java正则表达式的更多信息:

  1. https://docs.oracle.com/javase/tutorial/essential/regex/index.html
  2. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html
  3. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Matcher.html

答案 2 :(得分:0)

在Java 9或更高版本的Matcher#results上,可以在这样的流解决方案中使用:

    Pattern pattern = Pattern.compile("[a-zA-Z]+");
    try (BufferedReader br = Files.newBufferedReader(Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt"))) {
        br.lines()
                .map(pattern::matcher)
                .flatMap(Matcher::results)
                .map(matchResult -> matchResult.group(0))
                .collect(Collectors.groupingBy(String::toLowerCase, TreeMap::new, Collectors.counting()))
                .forEach((word, count) -> System.out.printf("%s=%s%n", word, count));
    } catch (IOException e) {
        System.err.format("IOException: %s%n", e);
    }

输出:

a=1
be=1
brown=1
but=1
complete=1
for=1
fox=1
hopefully=1
is=1
less=1
maybe=1
quick=3
task=2
the=12
this=1
to=1
will=1
you=1
相关问题