我有一个带有patientID的数据和下一个有疾病的专栏,其中有一个以逗号分隔的类别。我需要找出每个疾病类别的患者总数和每个类别的患者百分比。我尝试了正常的方式,它给出的频率是正确的而不是百分比。
数据看起来像这样。
ID Type_of_illness
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
其中空格代表没有疾病。我首先将疾病分成不同的栏目,然后卡在那里不知道如何处理以找出百分比。
我写的代码如下:
Data new;
set old;
array P(3) L1 L2 L3;
do i to dim(p);
p(i)=scan(type_of_illness,i,',');
end;
run;
然后我创建了一个新列来复制所有疾病,所以我认为它会给我正确的频率,但它没有给我正确的百分比。
data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;
我必须创建类似
的表格*Total numer of patients Percent*
.......................................
lf5
lf7
lf6
lf11
lf12
lf13
请帮忙。
答案 0 :(得分:1)
您试图输出非互斥群体(每种疾病)的百分比。在SAS中如何做到这一点并不明显。
以下内容采用了Joe的输入代码,但在确定事件数据的百分比时采用了另一种方法(如果愿意,可以使用'长'数据集)。我更喜欢这个在患者层面为一种疾病创建一个二元变量(一个'广泛的'数据集),因为对我来说,这很快变得笨拙。也就是说,如果你继续做一些建模,那么“广泛”的数据集通常会更有用。
以下代码生成输出如下: -
| | Pats | Pats | | | Mean | | |
| | with 0 |with 1+ | % with | Num | events | | |
| |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12 | 19| 7| 27| 7| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7 | 24| 2| 8| 2| 1.0| 0.00| 1|
---------------------------------------------------------------------------------------|
请注意,您的疾病数据lf5中会重复患者5。我的代码只记录一次此记录。如果是慢性病,这是好的,但如果是急性则不是。此外,我的代码包括分母中没有事件的患者。
最后,你可以看到这个代码的另一个例子,使用日期 - 带有测试数据 - 这里是mycodestock.com代码共享网站=> https://mycodestock.com/public/snippet/11251
以下是上表的代码: -
options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
proc sort;
by id;
run;
** Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
by id;
run;
** Create event table (1 row per patient*event);
** NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
set have;
attrib grp length = $5 label = 'Illness';
do i = 1 to countc(type_of_illness, ',') + 1;
grp = scan(type_of_illness, i, ',');
if grp ne '' then output;
end;
run;
** Count the number of events each patient had for each grp;
** NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
** NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
by grp;
class id;
output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;
** Add a denominator variable - value '1' for each row.;
** Ensure when num_events = 0 the value is set to missing;
** Create a flag variable - set to 1 - if a patient has a record (no matter how many);
data perc2_summ_grp_pat;
set perc2_summ_grp_pat;
denom = 1;
if num_events = 0 then num_events = .;
flg_scripts = ifn(num_events, 1, .);
run;
proc tabulate data = perc2_summ_grp_pat format=comma8.;
title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
title2 "Units: Patients - within each group level";
title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";
class grp;
var denom flg_scripts num_events;
table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record')
num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;
答案 1 :(得分:0)
你正在做这件事,但你需要选择不同的百分比。通常,百分比是整个数据集的百分比,这意味着它将使您的基数重复三次。你想要百分比基于疾病。这意味着每种疾病需要1/0。
一个缺点是你的自动表中有0;您必须将表输出到数据集并删除它们,然后proc打印/报告结果数据集以仅获取1 - 或使用PROC SQL生成表。
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;
data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
else L[_t]=0;
end;
run;
proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;
答案 2 :(得分:0)
多标签格式解决方案很有趣,所以我单独介绍它。
使用相同的,我们创建一种格式,它可以获取疾病的每种组合,并为其中的每种疾病输出一行,即,如果您有“1,2,3”,则输出行
1,2,3 = 1
1,2,3 = 2
1,2,3 = 3
启用多标记格式并使用启用class
的启动操作,例如proc tabulate
,然后您可以使用它来允许每个响应者计入每个标签值,但不能多次计数总数。
data for_procformat;
set have;
start=type_of_illness; *start is the input to the format;
hlo=' m'; *m means multilabel, adding a space
here to leave room for the o later;
type='c'; *character format - n is numeric;
fmtname='$ILLF'; *whatever name you like;
do _t = 1 to countw(type_of_illness,','); *for each 'word' do this once;
label=scan(type_of_illness,_t,','); *label is the 'result' of the format;
if not missing(label) then output;
end;
if _n_=1 then do; *this block adds a row to deal with values;
hlo='om'; *not defined (in this case, just missings);
label='No Illness'; *the o means 'other';
output;
end;
run;
proc sort data=for_procformat nodupkey; *remove duplicates (which there will be many);
by start label;
run;
proc format cntlin=for_procformat; *import the formats;
quit;
proc tabulate data=have;
class type_of_illness/mlf missing ; *mlf means multilabel formats;
format type_of_illness $ILLF.; *apply said format;
tables type_of_illness,n pctn; *and your table;
run;