Question

我有2个文本文件

File1有超过400K行。每行与此示例类似：

hstor,table,"8bit string",ABCD,0000,0,4,19000101,today

File2有一个新的8位字符串列表，用于替换file1中的当前字符串，同时保留file1中的其余字符串。

所以file1来自

hstor,table,"OLD 8bit string",ABCD,0000,0,4,19000101,today

到

hstor,table,"NEW 8bit string",ABCD,0000,0,4,19000101,today

我不能发现400K次

如何编写脚本，以便将file1中的所有OLD 8位字符串替换为file2中列出的新8位字符串？

Answer 1

这可能适合你（GNU sed）：

sed 's#.*#s/[^,]*/&/3#' file2 | cat -n | sed -f - file1

这会将file2转换为sed脚本文件，然后在file1上运行它。

第一个sed脚本获取file2中的每一行并将其更改为替换命令，该命令用目标file2的内容替换目标中的第三个字段。

通过管道传输到cat命令，该命令插入行号，sed脚本将使用这些行号来处理每个替换命令。

最后的sed命令使用/ dev / stdin读入sed脚本并针对输入文件file1运行它。

Answer 2

如果您需要多次执行此操作并且性能很重要，我在C中编写了一个程序来执行此操作。它是this code的修改版本。我知道你没有使用任何C-tag，但我的印象是你主要关心的是完成工作。

注：

我对此不承担任何责任。这是一个很快的黑客攻击，我确实假设了一些东西。一个假设是您要替换的字符串不包含任何逗号。另一个是没有行超过100个字节。第三个假设是输入文件分别命名为file和rep。如果您想尝试一下，请务必事后检查数据。它写入stdout，因此您只需将输出重定向到新文件。它在大约两秒钟内完成工作。

以下是代码：

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>

int main()
{

  /* declare a file pointer */
  FILE    *infile;
  FILE    *replace;
  char    *buffer;
  char    *rep_buffer;
  long    numbytes;
  long    rep_numbytes;

  /* open an existing file for reading */
  infile = fopen("file", "r");
  replace = fopen("rep", "r");

  /* quit if the file does not exist */
  if(infile == NULL)
    return 1;
  if(replace == NULL)
    return 1;

  /* Get the number of bytes */
  fseek(infile, 0L, SEEK_END);
  numbytes = ftell(infile);
  fseek(replace, 0L, SEEK_END);
  rep_numbytes = ftell(replace);

  /* reset the file position indicator to
     the beginning of the file */
  fseek(infile, 0L, SEEK_SET);
  fseek(replace, 0L, SEEK_SET);

  /* grab sufficient memory for the
     buffer to hold the text */
  buffer = (char*)calloc(numbytes, sizeof(char));
  rep_buffer = (char*)calloc(rep_numbytes, sizeof(char));

  /* memory error */
  if(buffer == NULL)
    return 1;
  if(rep_buffer == NULL)
    return 1;

  /* copy all the text into the buffer */
  fread(buffer, sizeof(char), numbytes, infile);
  fclose(infile);
  fread(rep_buffer, sizeof(char), rep_numbytes, replace);
  fclose(replace);


  char line[100]={0};
  char *i=buffer;
  char *r=rep_buffer;

  while(i<&buffer[numbytes-1]) {
    int n;

    /* Copy from infile until second comma */
    for(n=0; i[n]!=','; n++);
    n++;
    for(; i[n]!=','; n++);
    n++;
    memcpy(line, i, n);

    /* Copy a line from replacement */
    int m;
    for(m=0; r[m]!='\n'; m++);

    memcpy(&line[n], r, m);

    /* Skip corresponding text from infile */
    int k;
    for(k=n; i[k]!=','; k++);

    /* Copy the rest of the line */
    int l;
    for(l=k; i[l]!='\n'; l++);
    memcpy(&line[n+m], &i[k], l-k);

    /* Next line */
    i+=l;
    r+=m+1;

    /* Print to stdout */
    printf("%s", line);
  }    


  /* free the memory we used for the buffer */
  free(buffer);
  free(rep_buffer);
}

多字符串搜索和替换

2 个答案: