在多个文本文件中查找常用单词

时间:2014-04-28 19:01:02

标签: java arraylist read-write

我有100个文本文件。其中50个被称为text_H,另一个被称为text_T。我想要做的是下面打开两个文本文件text_T_1和text_H_1,找到常用单词的数量并将其写入文本文件,然后打开text_H_2和text_T_2,找到常用单词的数量....然后打开text_H_50和text_T_50并找到常用词的数量。

我编写了以下代码,打开两个文本文件并查找常用单词并返回两个文件之间的常用单词数。结果写在文本文件

无论出于何种原因而不是仅给出开放文本文件的常用字数,它就为我提供了所有文件的常用字数。例如,如果fileA_1和fileB_1之间的公共字数是10,并且fileA_2和fileB_2之间的公共字数是5,那么对于后两个文件的公共字数,得到的结果是10 + 5 = 15。 我希望这里有人可以抓住我所遗漏的任何东西,因为我已经多次通过这段代码而没有成功。提前感谢您的帮助!

代码:

package xml_test;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;

public class app {

    private static ArrayList<String> load(String f1) throws FileNotFoundException 
    {
        Scanner reader = new Scanner(new File(f1));
        ArrayList<String> out = new ArrayList<String>();
        while (reader.hasNext())
        {
            String temp = reader.nextLine();
            String[] sts = temp.split(" ");
            for (int i = 0;i<sts.length;i++)
            {
                if(sts[i] != "" && sts[i] != " " && sts[i] != "\n")
                    out.add(sts[i]);
            }
        }
        return out;
    }

    private static void write(ArrayList<String> out, String fname) throws IOException
    {
        FileWriter writer = new FileWriter(new File(fname));
        //int count=0;
        int temp1=0;
        for (int ss= 1;ss<=3;ss++)
        {
            int count=0;
            for (int i = 0;i<out.size();i++)
            {
                //writer.write(out.get(i) + "\n");
                //writer.write(new Integer(count).toString());
                count++;
            }
            writer.write("count ="+new Integer(temp1).toString()+"\n");
        }
        writer.close();
    }

    public static void main(String[] args) throws IOException 
    {
        ArrayList<String> file1;
        ArrayList<String> file2;
        ArrayList<String> out = new ArrayList<String>();
        //add for loop to loop through all T's and H's 
        for(int kk = 1;kk<=3;kk++)
        {
            int count=0;
            file1 = load("Training_H_"+kk+".txt");
            file2 = load("Training_T_"+kk+".txt");
            //int count=1;

            for(int i = 0;i<file1.size();i++)
            {
                String word1 = file1.get(i);
                count=0;
                //System.out.println(word1);
                for (int z = 0; z <file2.size(); z++)
                {
                    //if (file1.get(i).equalsIgnoreCase(file2.get(i)))
                    if (word1.equalsIgnoreCase(file2.get(z)))
                    {
                        boolean already = false;
                        for (int q = 0;q<out.size();q++)
                        {
                            if (out.get(q).equalsIgnoreCase(file1.get(i)))
                            {
                                count++;
                                //System.out.println("count is "+count);
                                already = true;
                            }
                        }
                        if (already==false)
                        {
                            out.add(file1.get(i));
                        }
                    }
                }
                //write(out,"output_"+kk+".txt");
            }
            //count=new Integer(count).toString();
            //write(out,"output_"+kk+".txt");
            //write(new Integer(count).toString(),"output_2.txt");
            //System.out.println("count is "+count);
        }//
    }
}

2 个答案:

答案 0 :(得分:2)

让我告诉你你的代码在做什么,看看你是否能发现问题。

List wordsInFile1 = getWordsFromFile();
List wordsInFile2 = getWordsFromFile();

List foundWords = empty;

//Does below for each compared file
for each word in file 1
    set count to 0
    compare to each word in file 2
        if the word matches see if it's also in foundWords
            if it is in foundWords, add 1 to count
        otherwise, add the word to foundWords

//Write the number of words
prints out the number of words in foundWords

提示:问题出在foundWords以及您要添加到count的位置。 arunmoezhi的评论在正确的轨道上,以及board_reader在他的回答中的第3点。

就目前而言,您的代码对任何count变量

都没有任何意义

答案 1 :(得分:1)

  1. 在循环中使用更有意义的变量名,使代码可读。
  2. 使用HashMap-s而不是ArrayList-s,将使代码更小,更快,更容易。如果文件在文件中重复多次,也会使用更少的内存。
  3. 你不应该在已经== false的情况下增加计数吗?
  4. 无法弄清楚写入方法计算次数3次,是不是等于out.size()?
  5. 可能还有更多...