比较观察结果

时间:2016-05-28 11:38:05

标签: compare stata

假设我的数据集包含以下变量:

set obs 100
generate var1 = rnormal()
generate var2 = rnormal()

input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end

input double(id var5 var6)
1 10000 0.4
2 22000 0.55
3 25000 0.5
4 40000 1
end

我需要删除具有 var5值增加和var6减少值的id行,与至少一个其他id 相比较。在第一个示例中,应删除带有2028和17.396的编号4。在第二个示例中,应删除带有25000和0.5的数字3。消除后,三个变量的观察结果如下:

1 1052 17.348
2 1288 17.378
3 1536 17.387
5 1810 17.402
6 2034 17.407

1 10000 0.4
2 22000 0.55
4 40000 1

var1var2应保持不变。

我该怎么做?

2 个答案:

答案 0 :(得分:0)

这很奇怪,因为你似乎说你有一个完全不相关的变量的数据集。您有一个包含变量var1var2的100个观测值的初始数据集,然后是包含变量var5var6的6个观测值的辅助数据集。您的目标似乎是删除观察结果,但仅适用于变量var5var6中包含的值。这看起来像电子表格一样,因为Stata在任何给定时间内只在内存中有一个数据集。

识别要删除的观察值的任务要求您将每个观察值与var5var6的值以及所有其他观察值与这些变量的值进行比较。这可以通过使用cross命令形成所有成对组合在Stata中完成。

这是一个解决方案,首先按照您提供的数据组织数据,然后分离两个数据集,以执行根据var5var6值删除观察值的任务。由于数据集看起来完全不相关,因此使用不匹配的merge来重新组合数据。

clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()

input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
tempfile main
save "`main'"

* extract secondary dataset 
keep id var5 var6
keep if !mi(id)
tempfile data2
save "`data2'"

* form all pairwise combinations
rename * =_0
cross using "`data2'"

* identify cases where there's an increase in var5 and decrease in var6
gen todrop = var5_0 > var5  & var6_0 < var6

* drop id if there's at least one case, reduce to original obs and vars
bysort id_0 (todrop): keep if !todrop[_N]
keep if id == id_0
keep id var5 var6
list

* now merge back with original data, use unmatched merge since 
* secondary data is unrelated
sort id
tempfile newdata2
save "`newdata2'"
use "`main'", clear
drop id var5 var6
merge 1:1 _n using "`newdata2'", nogen

答案 1 :(得分:0)

这是在不分离数据集的情况下执行此操作的一种方法。确定要丢弃的观测值的任务需要双循环来进行所有成对比较。然而,在Stata中没有命令只删除几个变量的观察结果。在下面的示例中,我切换到Mata加载观察值以保留然后清除值并将观察结果保存回Stata变量:

clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()

input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end

* an observation index
gen obsid = _n if !mi(id)

* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
  forvalues j = 1/`n' {
    replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
  }
}

* take a trip to Mata to load the data to keep and store it back from there
mata:
// load data, ignore observations with missing values
X = st_data(., ("id","var5","var6"), 0)

// set all obs to missing
st_store(., ("id","var5","var6") ,J(st_nobs(),3,.))

// store non-missing values back into the variables
st_store((1,rows(X)), ("id","var5","var6") ,X)
end

drop obsid todrop

或者,你可以通过做一些观察指数体操来手动提升价值:

clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()

input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end

* an observation index
gen obsid = _n if !mi(id)

* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
  forvalues j = 1/`n' {
    replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
  }
}

* move observations up
local j 0
quietly forvalues i = 1/`n' {

    if !mi(id[`i']) {
        local ++j
        replace id = id[`i'] in `j'
        replace var5 = var5[`i'] in `j'
        replace var6 = var6[`i'] in `j'
    }
}

local ++j
replace id = . in `j'/l
replace var5 = . in `j'/l
replace var6 = . in `j'/l

drop obsid todrop