
时间:2016-02-06 16:02:33

标签: text-processing file string c





3 个答案:

答案 0 :(得分:2)



Ps:如果你不想对行大小有限制,那么你必须使用fgetc/fputc逐字符处理(没有汗水; C可以非常快,你的磁盘允许)。 / p>

答案 1 :(得分:1)

上个月我实际上遇到了这个问题,日志文件已增长到30 GB且一行。像sed这样的工具,perl想要消耗所有可用的内存来对它们做任何事情。从技术上讲,您的解决方案都不能很好地扩展。但在实践中,它们很好,(b)是首选。您应该使用缓冲区大小为8kB的fgets并迭代,直到最后一个字符是换行符或者您已达到EOF。在我的灵魂中,我使用perl的sysread函数,一次读取16 kB块。


#define BUF_SZ 16383
char *buf = alloca(BUF_SZ + 1);
infile = fopen(...);
while (!feof(infile) && fgets(buf, BUF_SZ, infile) != NULL) { 
   readmore = (buf[0] != '\0' && buf[ strlen(buf)-1 ] != '\n');
   /* other processing
   if (readmore) {
     /* apply different strategies for dealing with buf */

我认为策略实际上取决于您要做的事情。如果你想删除该行或截断它,但你只需要匹配行的开头,那么它非常简单(没有特殊代码)。但是,如果你需要做一个可能延伸超过前16kB的长模式匹配,那么你必须做一些事情,比如将最后n个字节(其中n是搜索模式的最大化大小)移动到buf的开头 并做下一个读入& buf [n]。


答案 2 :(得分:0)

替换文件中的一行文本的最有效方法取决于许多事项。 [1] 希望高效搜索和替换时的主要问题是为了最小化文件读/写的数量,因为文件I / O通常比内存操作慢一个数量级。当搜索和替换字符串具有确切的字符数时,会发生这种简单的情况。在那种情况下,只有在这种情况下,您可以在不必编写第二个(或临时文件)的情况下对文本进行文件内替换。

考虑到文件I / O效率,执行搜索/替换的最有效(最快)方式是mmap整个文件或使用sendfile。两者都可以利用文本块的内核空间复制,这通常会对用户空间复制操作产生显着的改进。这两者都不困难。下一个最佳选择是使用缓冲读取将文件的全部内容读入内存,然后在内存缓冲区上执行搜索,以识别要更改内容的位置(地址)。然后,您可以将缓冲区逐渐写入第三个文件,在搜索原始缓冲区期间识别的每个所需位置写入替换文本。





#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

#define BUFSZ (1 << 20)  /* default max block size (1M) */

void *find_rplc_file (char *srch, char *rplc, FILE *ifp, FILE *ofp, long blksz);

int main (int argc, char **argv) {

    if (argc != 4) {
        fprintf (stderr, "error: insufficient input.\n"
                         "usage: %s infile <search> <replace>\n", argv[0]);
        return 1;

    FILE *ifp = fopen (argv[1], "rb");

    if (!ifp) {
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;

    if (!find_rplc_file (argv[2], argv[3], ifp, stdout, BUFSZ)) {
        fprintf (stderr, "error: find/replace failure.\n");
        return 1;
    putchar ('\n');

    fclose (ifp);

    return 0;

void *find_rplc_file (char *srch, char *rplc, FILE *ifp, FILE *ofp, long blksz)
    if (!ifp || !srch || !rplc || !blksz) return NULL;

    char *fb, *filebuf = NULL;
    size_t offset = 0, nbytes = 0, readsz = 0, rlen, slen;
    long  bytecnt = 0, readpos = 0, size = 0;

    rlen = strlen (rplc);   /* length of search/replace text */
    slen = strlen (srch);

    fseek (ifp, 0, SEEK_END);
    if ((size = ftell (ifp)) == -1) {  /* get file length */
        fprintf (stderr, "error: unable to determine file length.\n");
        return NULL;
    fseek (ifp, 0, SEEK_SET);

    /* limit blksz to less or INT_MAX or blksz */
    blksz = blksz > INT_MAX ? INT_MAX : blksz;

    /* validate blksz does not exceed file size */
    readsz = blksz > size ? size : blksz;

    /* allocate memory for filebuf */
    if (!(filebuf = calloc (readsz, sizeof *filebuf))) {
        fprintf (stderr, "error: virtual memory exhausted.\n");
        return NULL;

    /* read entire file readsz bytes at a time */
    while ((nbytes = fread (filebuf, sizeof *filebuf, readsz, ifp))) {

        if (nbytes != readsz) fprintf (stderr, "warning: short read.\n");

        readpos = 0;    /* initialize read position & pointer */
        fb = filebuf;

        /* for each occurrence of 1st char of search term */
        while ((fb = memchr (fb, *srch, nbytes - offset))) {
            /* set current offset in buffer */
            offset = fb - filebuf;
            /* if less than length of search term remains */
            if (offset + slen > nbytes) {
                nbytes = offset; /* set nbytes to current offset */
                /* reset file pointer to account for nbytes reduction */
                fseek (ifp, bytecnt + nbytes, SEEK_SET);
                goto getnext;    /* read next block from here */
            /* otherwise compare fb to search term */
            if (memcmp (srch, fb, slen) == 0) {
                /* if term found, write prior buffer to output file */
                fwrite (filebuf + readpos, sizeof *filebuf, 
                        offset - readpos, ofp);
                /* write replacement text */
                fwrite (rplc, sizeof *rplc, rlen, ofp);
                /* set next readpos to 1st char following search term */
                readpos = offset + slen;
            fb++;   /* advance fb pointer for next memchr search */

        bytecnt += nbytes;  /* increment bytecnt with bytes searched */

        /* write remaining buffer to output file */
        fwrite (filebuf + readpos, sizeof *filebuf, 
                nbytes - readpos, ofp);

        /* check file complete */
        if (bytecnt == size) break;

        /* set next read size (either blksz or remaining chars < blksz) */
        readsz = size - bytecnt > blksz ? blksz : size - bytecnt;

    /* validate all bytes successfully read */
    if ((long)bytecnt != size) {
        fprintf (stderr, "error: file read failed.\n");
        return NULL;

    free (filebuf); /* free filebuf */

    return srch;   /* return something other than NULL for success */


$ cat dat/damages.txt
Personal injury damage awards are unliquidated
and are not capable of certain measurement; thus, the
jury has broad discretion in assessing the amount of
damages in a personal injury case. Yet, at the same
time, a factual sufficiency review insures that the
evidence supports the jury's award; and, although
difficult, the law requires appellate courts to conduct
factual sufficiency reviews on damage awards in
personal injury cases. Thus, while a jury has latitude in
assessing intangible damages in personal injury cases,
a jury's damage award does not escape the scrutiny of
appellate review.

Because Texas law applies no physical manifestation
rule to restrict wrongful death recoveries, a
trial court in a death case is prudent when it chooses
to submit the issues of mental anguish and loss of
society and companionship. While there is a
presumption of mental anguish for the wrongful death
beneficiary, the Texas Supreme Court has not indicated
that reviewing courts should presume that the mental
anguish is sufficient to support a large award. Testimony
that proves the beneficiary suffered severe mental
anguish or severe grief should be a significant and
sometimes determining factor in a factual sufficiency
analysis of large non-pecuniary damage awards.


$ ./bin/fread_blks_min dat/damages.txt "injury" "hygiene"
Personal hygiene damage awards are unliquidated
and are not capable of certain measurement; thus, the
jury has broad discretion in assessing the amount of
damages in a personal hygiene case. Yet, at the same
time, a factual sufficiency review insures that the
evidence supports the jury's award; and, although
difficult, the law requires appellate courts to conduct
factual sufficiency reviews on damage awards in
personal hygiene cases. Thus, while a jury has latitude in
assessing intangible damages in personal hygiene cases,
a jury's damage award does not escape the scrutiny of
appellate review.

Because Texas law applies no physical manifestation
rule to restrict wrongful death recoveries, a
trial court in a death case is prudent when it chooses
to submit the issues of mental anguish and loss of
society and companionship. While there is a
presumption of mental anguish for the wrongful death
beneficiary, the Texas Supreme Court has not indicated
that reviewing courts should presume that the mental
anguish is sufficient to support a large award. Testimony
that proves the beneficiary suffered severe mental
anguish or severe grief should be a significant and
sometimes determining factor in a factual sufficiency
analysis of large non-pecuniary damage awards.

