如何计算文件中的单词数?

时间:2019-07-02 21:34:37

标签: c++

我正在创建一个程序,该程序计算输入文件中有多少个单词。我似乎无法弄清楚如何用空格,句点,逗号或行的开头或结尾来定义单词。

输入文件的内容:

世界您好! HELLO WORLD一切都很棒。你好,世界也很棒。

输出应为15个单词,而我的输出应为14

我尝试添加包含句号,逗号等的or,但它也只将空格中的数字计算在内。

#include <iostream> 
#include <string>
#include <fstream>
using namespace std;

//Function Declarations
void findFrequency(int A[], string &x);
void findWords(int A[], string &x);

//Function Definitions
void findFrequency(int A[], string &x)
{   

    //Counts the number of occurences in the string
    for (int i = 0; x[i] != '\0'; i++)
    {

        if (x[i] >= 'A' && x[i] <= 'Z')
            A[toascii(x[i]) - 64]++;
        else if (x[i] >= 'a' && x[i] <= 'z')
            A[toascii(x[i]) - 96]++;
    }

    //Displaying the results
    char ch = 'a';

    for (int count = 1; count < 27; count++)
    {
        if (A[count] > 0)
        {

            cout << A[count] << " : " << ch << endl;
        }
        ch++;
    }
}


void findWords(int A[], string &x)
{

    int wordcount = 0;
    for (int count = 0; x[count] != '\0'; count++)
    {

        if (x[count] == ' ')
        {
            wordcount++;
            A[0] = wordcount;
        }
    }
    cout << A[0] << " Words " << endl;
}



int main()
{
    string x;
    int A[27] = { 0 }; //Array assigned all elements to zero
    ifstream in;    //declaring an input file stream
    in.open("mytext.dat");

    if (in.fail())
    {
        cout << "Input file did not open correctly" << endl;
    }

    getline(in,x);
    findWords(A, x);
    findFrequency(A, x);

    in.close();

    system("pause");
    return 0;
}

当我得到的结果是14时,输出应该为15。

2 个答案:

答案 0 :(得分:1)

也许这就是您所需要的?

size_t count_words(std::istream& is) {
    size_t co = 0;
    std::string word;
    while(is >> word) {       // read a whitespace separated chunk
        for(char ch : word) { // step through its characters
            if(std::isalpha(ch)) {
                // it contains at least one alphabetic character so
                // count it as a word and move on
                ++co;
                break;
            }
        }
    }
    return co;
}

答案 1 :(得分:-1)

这也是一种包含一些测试用例的方法。

测试用例是一系列带有特定字符串的char数组,用于测试findNextWord()结构/类的RetVal方法。

char line1[] = "this is1    a  line. \t of text  \n ";  // multiple white spaces
char line2[] = "another   line";    // string that ends with zero terminator, no newline
char line3[] = "\n";                // line with newline only
char line4[] = "";                  // empty string with no text

这是实际的源代码。

#include <iostream>
#include <cstring>
#include <cstring>

struct RetVal {
    RetVal(char *p1, char *p2) : pFirst(p1), pLast(p2) {}
    RetVal(char *p2 = nullptr) : pFirst(nullptr), pLast(p2) {}
    char *pFirst;
    char *pLast;

    bool  findNextWord()
    {
        if (pLast && *pLast) {
            pFirst = pLast;
            // scan the input line looking for the first non-space character.
            // the isspace() function indicates true for any of the following
            // characters: space, newline, tab, carriage return, etc.
            while (*pFirst && isspace(*pFirst)) pFirst++;

            if (pFirst && *pFirst) {
                // we have found a non-space character so now we look
                // for a space character or the end of string.
                pLast = pFirst;
                while (*pLast && ! isspace(*pLast)) pLast++;
            }
            else {
                // indicate we are done with this string.
                pFirst = pLast = nullptr;
            }
        }
        else {
            pFirst = nullptr;
        }

        // return value indicates if we are still processing, true, or if we are done, false.
        return pFirst != nullptr;
    }
};

void printWords(RetVal &x)
{
    int    iCount = 0;

    while (x.findNextWord()) {
        char xWord[128] = { 0 };

        strncpy(xWord, x.pFirst, x.pLast - x.pFirst);
        iCount++;
        std::cout << "word " << iCount << " is \"" << xWord << "\"" << std::endl;
    }

    std::cout << "total word count is " << iCount << std::endl;
}

int main()
{
    char line1[] = "this is1    a  line. \t of text  \n ";
    char line2[] = "another   line";
    char line3[] = "\n";
    char line4[] = "";

    std::cout << "Process line1[] \"" << line1 << "\""  << std::endl;
    RetVal x (line1);
    printWords(x);

    std::cout << std::endl << "Process line2[] \"" << line2 << "\"" << std::endl;
    RetVal x2 (line2);
    printWords(x2);

    std::cout << std::endl << "Process line3[] \"" << line3 << "\"" << std::endl;
    RetVal x3 (line3);
    printWords(x3);

    std::cout << std::endl << "Process line4[] \"" << line4 << "\"" << std::endl;
    RetVal x4(line4);
    printWords(x4);

    return 0;
}

这是该程序的输出。在某些情况下,要处理的行中会有新行,这会在打印到控制台时通过执行新行来影响输出。

Process line1[] "this is1    a  line.    of text
 "
word 1 is "this"
word 2 is "is1"
word 3 is "a"
word 4 is "line."
word 5 is "of"
word 6 is "text"
total word count is 6

Process line2[] "another   line"
word 1 is "another"
word 2 is "line"
total word count is 2

Process line3[] "
"
total word count is 0

Process line4[] ""
total word count is 0

如果您需要将类似于空白的标点符号视为要忽略的东西,则可以修改findNextWord()方法以在循环中包括ispunct()个字符测试,如下所示:

bool  findNextWord()
{
    if (pLast && *pLast) {
        pFirst = pLast;
        // scan the input line looking for the first non-space character.
        // the isspace() function indicates true for any of the following
        // characters: space, newline, tab, carriage return, etc.
        while (*pFirst && (isspace(*pFirst) || ispunct(*pFirst))) pFirst++;

        if (pFirst && *pFirst) {
            // we have found a non-space character so now we look
            // for a space character or the end of string.
            pLast = pFirst;
            while (*pLast && ! (isspace(*pLast) || ispunct (*pLast))) pLast++;
        }
        else {
            // indicate we are done with this string.
            pFirst = pLast = nullptr;
        }
    }
    else {
        pFirst = nullptr;
    }

    // return value indicates if we are still processing, true, or if we are done, false.
    return pFirst != nullptr;
}

通常,如果您需要优化单词开头和结尾的过滤器,则可以使用其他一些功能来修改这两个位置,这些功能可以查看字符并将其分类为单词的有效字符。< / p>