// Calculating term frequency
int filename = 11;
String[] fileName = new String[filename];
int a = 0;
int totalCount = 0;
int wordCount = 0;
// Count inverse document frequency
System.out.println("Please enter the required word :");
Scanner scan2 = new Scanner(System.in);
String word2 = scan2.nextLine();
String[] array2 = word2.split(" ");
int numofDoc;
for (int b = 0; b < array2.length; b++) {
numofDoc = 0;
for (int i = 0; i < filename; i++) {
try {
BufferedReader in = new BufferedReader(new FileReader(
"C:\\Users\\user\\fypworkspace\\TextRenderer\\abc"
+ i + ".txt"));
int matchedWord = 0;
Scanner s2 = new Scanner(in);
{
while (s2.hasNext()) {
if (s2.next().equals(array2[b]))
matchedWord++;
}
}
if (matchedWord > 0)
numofDoc++;
} catch (IOException e) {
System.out.println("File not found.");
}
}
System.out.println(array2[b]
+ " --> This number of files that contain the term "
+ numofDoc);
//calculate TF-IDF
for (a = 0; a < filename; a++) {
try {
System.out.println("The word inputted : " + word2);
File file =
new File("C:\\Users\\user\\fypworkspace\\TextRenderer\\abc"
+ a + ".txt");
System.out.println(" _________________");
System.out.print("| File = abc" + a + ".txt | \t\t \n");
for (int i = 0; i < array2.length; i++) {
totalCount = 0;
wordCount = 0;
Scanner s = new Scanner(file);
{
while (s.hasNext()) {
totalCount++;
if (s.next().equals(array2[i]))
wordCount++;
}
System.out.print(array2[i] + " --> Word count = "
+ "\t\t " + "|" + wordCount + "|");
System.out.print(" Total count = " + "\t\t " + "|"
+ totalCount + "|");
System.out.printf(" Term Frequency = | %8.4f |",
(double) wordCount / totalCount);
System.out.println("\t ");
double inverseTF = Math.log10((float) numDoc / numofDoc);
System.out.println(" --> IDF " + inverseTF );
double TFIDF = (((double) wordCount / totalCount) * inverseTF );
System.out.println(" --> TF/IDF " + TFIDF);
}
}
} catch (FileNotFoundException e) {
System.out.println("File is not found");
}
}
}
当我输入一个字符串时,让我们说'how',代码将搜索包含字符串'how'的文件数。
例如输出:
The number of files containing 'how' is 5.
然后代码将继续计算频率 - 逆文档频率这一术语。
当我输入3个字符串时,例如“你好吗”。
输出仅显示字符串'how'。
示例输出:
Please enter the required word :
you
you --> This number of files that contain the term 6
The word inputted : you
_________________
| File = abc0.txt |
you --> Word count = |3| Total count = |150| Term Frequency = | 0.0200 |
--> IDF 0.2632414441876607
--> TF/IDF 0.005264828883753215
The word inputted : you
如果我输入3个字符串:'你好吗'
Please enter the required word :
how are you
how --> This number of files that contain the term 6
&lt; ---它只处理第一个字符串'how'
The word inputted : how are you
_________________
| File = abc0.txt |
how --> Word count = |0| Total count = |150| Term Frequency = | 0.0000 |
--> IDF Infinity
--> TF/IDF NaN
are --> Word count = |0| Total count = |150| Term Frequency = | 0.0000 |
--> IDF Infinity
--> TF/IDF NaN
you --> Word count = |3| Total count = |150| Term Frequency = | 0.0200 |
--> IDF Infinity
--> TF/IDF Infinity
然后字符串的其余部分将只使用一个数量为0的文件。每个字符串都假设有各自的文件数。
如何让代码接收3个不同的文件数?
答案 0 :(得分:3)
为了计算每个searchterm的文档数量,可以使用int数组来保持计数:
String[] array2 = word2.split(" ");
int[] numofDoc = new int[array2.length];
for (int b = 0; b < array2.length; b++) {
numofDoc[b] = 0;
在计算时使用数组元素:
if (matchedWord > 0) {
numofDoc[b]++;
}
以后使用数组元素来计算:
double inverseTF = Math.log10((float) numDoc / numofDoc[i]);