基于多列SAS的百分比计算

时间:2014-05-20 21:19:22

标签: sas

我有一个带有patientID的数据和下一个有疾病的专栏,其中有一个以逗号分隔的类别。我需要找出每个疾病类别的患者总数和每个类别的患者百分比。我尝试了正常的方式,它给出的频率是正确的而不是百分比。

数据看起来像这样。

ID    Type_of_illness  
4   lf13  
5   lf5,lf11    
63      
13  lf12    
85      
80      
15      
20      
131 lf6,lf7,lf12  
22      
24      
55  lf12  
150 lf12  
34  lf12  
49  lf12  
151 lf12  
60      
74      
88      
64      
82  lf13  
5   lf5,lf7  
112     
87  lf17  
78      
79  lf16  
83  lf11    

其中空格代表没有疾病。我首先将疾病分成不同的栏目,然后卡在那里不知道如何处理以找出百分比。

我写的代码如下:

    Data new;
    set old;
    array P(3) L1 L2 L3;
    do i to dim(p);
    p(i)=scan(type_of_illness,i,',');
    end;
    run;

然后我创建了一个新列来复制所有疾病,所以我认为它会给我正确的频率,但它没有给我正确的百分比。

data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;

我必须创建类似

的表格
*Total numer of patients    Percent*  
.......................................   
lf5         
lf7         
lf6         
lf11            
lf12            
lf13            

请帮忙。

3 个答案:

答案 0 :(得分:1)

您试图输出非互斥群体(每种疾病)的百分比。在SAS中如何做到这一点并不明显。

以下内容采用了Joe的输入代码,但在确定事件数据的百分比时采用了另一种方法(如果愿意,可以使用'长'数据集)。我更喜欢这个在患者层面为一种疾病创建一个二元变量(一个'广泛的'数据集),因为对我来说,这很快变得笨拙。也就是说,如果你继续做一些建模,那么“广泛”的数据集通常会更有用。

以下代码生成输出如下: -


|                       |  Pats  |  Pats  |        |        |  Mean  |        |        |
|                       | with 0 |with 1+ | % with |  Num   | events |        |        |
|                       |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11                   |      24|       2|       8|       2|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12                   |      19|       7|      27|       7|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13                   |      24|       2|       8|       2|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16                   |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17                   |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5                    |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6                    |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7                    |      24|       2|       8|       2|     1.0|    0.00|       1|
---------------------------------------------------------------------------------------|

请注意,您的疾病数据lf5中会重复患者5。我的代码只记录一次此记录。如果是慢性病,这是好的,但如果是急性则不是。此外,我的代码包括分母中没有事件的患者。

最后,你可以看到这个代码的另一个例子,使用日期 - 带有测试数据 - 这里是mycodestock.com代码共享网站=> https://mycodestock.com/public/snippet/11251

以下是上表的代码: -

options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";

data have;
  format type_of_illness $30.;
  infile datalines truncover;
  input ID Type_of_illness $;
  datalines;
  4 lf13
  5 lf5,lf11
  63
  13 lf12
  85
  80
  15
  20
  131 lf6,lf7,lf12
  22
  24
  55 lf12
  150 lf12
  34 lf12
  49 lf12
  151 lf12
  60
  74
  88
  64
  82 lf13
  5 lf5,lf7
  112
  87 lf17
  78
  79 lf16
  83 lf11
  ;;;;
proc sort;
  by id;
run;

**  Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
  by id;
run;

**  Create event table (1 row per patient*event);
**  NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
  set have;
  attrib grp length = $5 label = 'Illness';

  do i = 1 to countc(type_of_illness, ',') + 1;
    grp = scan(type_of_illness, i, ',');
    if grp ne '' then output;
  end;
run;

**  Count the number of events each patient had for each grp;
**  NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
**  NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
  by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
  by grp;
  class id;
  output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;

**  Add a denominator variable - value '1' for each row.;
**  Ensure when num_events = 0 the value is set to missing;  
**  Create a flag variable - set to 1 - if a patient has a record (no matter how many);  
data perc2_summ_grp_pat;
  set perc2_summ_grp_pat;
  denom = 1;
  if num_events = 0 then num_events = .;
  flg_scripts = ifn(num_events, 1, .);
run;

proc tabulate data = perc2_summ_grp_pat format=comma8.;
  title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
  title2 "Units: Patients - within each group level";
  title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
  title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";

  class grp;
  var denom flg_scripts num_events;
  table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record') 
                num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;

答案 1 :(得分:0)

你正在做这件事,但你需要选择不同的百分比。通常,百分比是整个数据集的百分比,这意味着它将使您的基数重复三次。你想要百分比基于疾病。这意味着每种疾病需要1/0。

一个缺点是你的自动表中有0;您必须将表输出到数据集并删除它们,然后proc打印/报告结果数据集以仅获取1 - 或使用PROC SQL生成表。

data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;

data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
  if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
  else L[_t]=0;
end;
run;

proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;

答案 2 :(得分:0)

多标签格式解决方案很有趣,所以我单独介绍它。

使用相同的,我们创建一种格式,它可以获取疾病的每种组合,并为其中的每种疾病输出一行,即,如果您有“1,2,3”,则输出行

1,2,3 = 1
1,2,3 = 2
1,2,3 = 3

启用多标记格式并使用启用class的启动操作,例如proc tabulate,然后您可以使用它来允许每个响应者计入每个标签值,但不能多次计数总数。

data for_procformat; 
set have;
start=type_of_illness;                     *start is the input to the format;
hlo=' m';                                  *m means multilabel, adding a space 
                                            here to leave room for the o later;
type='c';                                  *character format - n is numeric;
fmtname='$ILLF';                           *whatever name you like;
do _t = 1 to countw(type_of_illness,',');  *for each 'word' do this once;
  label=scan(type_of_illness,_t,',');      *label is the 'result' of the format;
  if not missing(label) then output;       
end;
if _n_=1 then do;                          *this block adds a row to deal with values;
  hlo='om';                                *not defined (in this case, just missings);
  label='No Illness';                      *the o means 'other';
  output;
end;
run;

proc sort data=for_procformat nodupkey;    *remove duplicates (which there will be many);
by start label;
run;

proc format cntlin=for_procformat;         *import the formats;
quit;

proc tabulate data=have;
class type_of_illness/mlf missing ;        *mlf means multilabel formats;
format type_of_illness $ILLF.;             *apply said format;
tables type_of_illness,n pctn;             *and your table;
run;