解析制表符分隔的文本文件

时间:2016-10-03 04:13:43

标签: c parsing tab-delimited

我正在尝试编写一个代码,该代码将通过将制表符之间的每个字符串分配给我定义的示例结构的给定元素来解析制表符分隔的文本文件。在输入文件中,第一行将包含所有类标识符(c_name),第二行将包含所有样本标识符(s_name),其余行将包含数据。

我知道它会有点复杂,因为第一列实际上只包含标签,但我想我开始尝试找出一般的解析方案。

我可以收集一下,例如,对于类标识符,我应该在for循环中使用fscanf将每个标识符添加到给定样本的类字段中,但是我在实际实现中迷路了。根据我看到的一篇文章,我认为我可以在fscanf中使用%[^\t]\t来读取数组中所有不是标签的选项卡,但我认为我没有右。

任何建议都将不胜感激。

#define LENGTH 30
#define MAX_OBS 80000

typedef struct
{
    char c_name[LENGTH];
    char s_name[LENGTH];
    double value[MAX_OBS];
}
sample;

// I've already calculated the number of columns in the file
sample sample[total_columns];
for (int i = 0; i < total_columns; i++)
   {
      fscanf(input, "%[^\t]\t", sample[i].s_name);
   }

编辑:我已尝试过以下代码的几种不同变体(“%[^ \ t \ n \ r \ n] \ t \ n \ r \ n”或“%[^ \ t \ n \ r] \ n%* * 1 [\ t \ n \ r]“或”%[^ \ t \ n \ r]“)它们似乎一般都在工作,除了这个,取决于我分配给数据的大小以及我有多长迭代,它在某个时刻给出了分段错误。下面的代码立即给出了分段错误,但如果我随意将两个地方的total_columns更改为3,它将打印Class Case Case。这似乎工作到14,此时整个程序分段出错。我对此问题相当困惑。我还尝试将内存malloc到示例数据数组,以查看它是否是堆栈与堆的问题,但这似乎也没有帮助。非常感谢你的帮助!

sample data[total_columns];
fseek(input, 0, SEEK_SET);
for (int i = 0; i < total_columns; i++)
{
    fscanf(input, "%[^\t\n\r]\t\n\r", data[i].s_name);
    printf("%s\n", data[i].s_name);
}

示例输入文件如下所示:

Class   Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control
Subject G038    G144    G135    G161    G116    G165    G133    G069    G002    G059    G039    G026    G125    G149    G108    G121    G060    G140    G127    G113    G023    G147    G011    G019    G148    G132    G010    G142    G020    G021
Data1   0.000741628 0.00308607  0.000267431 0.001418697 0.001237904 0.000761145 0.0008281   0.002426075 0.000236698 0.004924871 0.000722752 0.003758006 0.000104813 0.000986619 0.000121803 0.000666854 0   0.000171394 0.000877993 0.002717391 0.001336501 0.000812089 0.001448743 5.28E-05    0.001944298 0.000292529 0.000469631 0.001674047 0.000651526 0.000336615
Data2   0.102002396 0.108035127 0.015052531 0.079923731 0.020643362 0.086480609 0.017907667 0.016279315 0.076263965 0.034876124 0.187481931 0.090615572 0.037460171 0.143326961 0.029628502 0.049487575 0.020175439 0.122975405 0.019754837 0.006702899 0.014033264 0.040024363 0.076610375 0.069287599 0.098896479 0.011813681 0.293331246 0.037558052 0.303052867 0.137591517
Data2   0.218495065 0.242891829 0.23747851  0.101306336 0.309040188 0.237477347 0.293837554 0.34351816  0.217572429 0.168651691 0.179387106 0.166516699 0.099970652 0.181003474 0.076126675 0.10244981  0.449561404 0.139257863 0.127579104 0.355797101 0.354544105 0.262855651 0.10167146  0.186068602 0.316763006 0.187466247 0.05701315  0.123825467 0.064780343 0.069847682
Data4   0.141137543 0.090948286 0.102502388 0.013063365 0.162060849 0.166292135 0.070215996 0.063535037 0.333743609 0.131011609 0.140936687 0.150108506 0.07812762  0.230704405 0.069792935 0.120770743 0.164473684 0.448110378 0.42599534  0.074094203 0.096525097 0.157661185 0.036737518 0.213931398 0.091119285 0.438073807 0.224921728 0.187034237 0.06611442  0.086005218
Data5   0.003594044 0.003948354 0.008137536 0.001327901 0.002161974 0.003552012 0.002760334 0.001898667 0.001420186 0.003165988 0.001011853 0.001217382 0.000314439 0.004254794 0.000213155 0.003650147 0   0.002742309 0.002633978 0   0.002524503 0.002146234 0.001751465 0.006543536 0.003941146 0.00049505  0.00435191  0.001944054 0.001303053 0.004207692
Data6   0.000285242 2.27E-05    0   1.13E-05    0.0002964   3.62E-05    0.000138017 0.000210963 0.000662753 0   0   0   0   4.11E-05    0   0   0   0   0.000101307 0   0   0   0   5.28E-05    0.00152391  0   0   0   0   0
Data7   0.002624223 0.001134584 0.00095511  0.000419934 0.000401011 0.001739761 0.00272583  0.002566717 0.000520735 0.002311674 0.006287944 0   6.29E-05    0.000143882 3.05E-05    0.000491366 0   0   3.38E-05    0   0.001782002 0.000957104 0.002594763 0.000527704 0.000105097 0.001192619 3.13E-05    0   0.000744602 0.000252461
Data8   0.392777683 0.383875286 0.451499522 0.684663315 0.387394299 0.357992026 0.488406597 0.423473155 0.27267563  0.47454646  0.331020526 0.484041709 0.735955056 0.338841956 0.781699147 0.625403622 0.313596491 0.270545891 0.379259109 0.498913043 0.372438372 0.446271644 0.606698813 0.305593668 0.360535996 0.29889739  0.328710081 0.521222594 0.419924299 0.584111756

编辑:我似乎已经通过更改MAX_OBS定义来修复它 - 非常确定我对这实际意味着什么有一个根本的误解。我要调查一下。再次感谢您的帮助!

1 个答案:

答案 0 :(得分:2)

试试这个:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define LENGTH 30
#define MAX_OBS 80000

typedef struct{
    char c_name[LENGTH];
    char s_name[LENGTH];
    double value[MAX_OBS];
} Sample;//Duplication of type and variable names should be avoided. pointed out by Jonathan Leffler.

int main(void){
    char line[1024];
    FILE *input = fopen("data.txt", "r");

    fgets(line, sizeof(line), input);

    int total_columns = 0;
    char *p = strtok(line, "\t\n");

    while(p){
        ++total_columns;
        p = strtok(NULL, "\t\n");
    }
    --total_columns;//first column is field name
    rewind(input);
 //*******************************************************************************
    Sample *sample = malloc(total_columns * sizeof(*sample));//To allocate in the stack is large. So allocate by malloc.

    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].c_name);//\n for last column
    }
    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].s_name);
    }
    int r;
    for(r = 0; r < MAX_OBS; ++r){
        if(EOF==fscanf(input, "%*s")) break;
        for (int i = 0; i < total_columns; i++){
            fscanf(input, "%lf", &sample[i].value[r]);
        }
    }
    fclose(input);

    //test print
    printf("%s\n", sample[0].c_name);
    printf("%s\n", sample[0].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[0].value[i]);
    printf("\n%s\n", sample[total_columns-1].c_name);
    printf("%s\n", sample[total_columns-1].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[total_columns-1].value[i]);
    free(sample);
}