现在,由于我想自动执行此操作,因此必须处于循环中。而且,每个时期有不同数量的序列。例如,对于第三个时期,有两个序列(在第2和第3期有付费的那些,在第1,2和3期有序列的那些)。因此,要创建的变量数是1 + 2 + 3 + 4 + ... + 17 = 153.这种可变性必须反映在循环中。我在下面提出了一个代码,但有些内容是错误的,或者我不确定,正如评论中所强调的那样。

local list b c d e f g h i j k l m n o p q r               // periods over which iterate
foreach var of local list {                 // loop over periods
    local counter = 1                   // counter to update; reflects sequence length 
    while `counter' > 0 {                   // loop over sequence lengths
        gen _`var'_counter_`counter' = 0        // generate variable with counter
            recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
            local counter = `counter' - 1       // update counter to look for a longer sequence in the next iteration
    local counter = `counter' + 1               // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.


Obs   a  b  c  d
1     1  1  .  1
2     1  1  .  .
3     .  .  1  1
4     .  1  1  .
5     1  1  1  1

其中1表示在该期间内观察到的值,并且。不是。代码的目标是创建1 + 2 + 3 = 6个新变量,以便新数据集为:

Obs   a  b  c  d  b_count_2  c_count_2  c_count_3  d_count_2  d_count_3  d_count_4
1     1  1  .  1      1          0          0          0          0          0
2     1  1  .  .      1          0          0          0          0          0
3     .  .  1  1      0          0          0          1          0          0
4     .  1  1  .      0          1          0          0          0          0
5     1  1  1  1      1          1          1          1          1          1


local list a b c d e f g h i j k l m n o p q r                  // periods over which iterate
foreach var of local list {                         // loop over periods
    local list `var'_counter_*                      // group of sequence variables for each period
    foreach var2 of local list {                        // loop over each element of the list
        quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1   // sum the number of individuals with value = 1 with sequence of length var2 in period var
        di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result


"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."


我回应@Dimitriy V.Masterov,你正在使用这个数据集形状。它可以方便地用于某些目的,但是对于像你这样的面板或纵向数据,在Stata中使用它最多是尴尬的,最坏的是不可行的。


if apay[1] != . & bpay[1] != . 


其次,更一般地说,我没有尝试理解代码的所有细节,因为我看到的是即使对于像草图中的微小数据集也创建了大量变量。对于一系列 T 个句点,您将创建一个三角形数字[( T - 1) T ] / 2个新变量;在你的例子中(17 x 18)/ 2 = 153.如果有人有100个句号长的系列,他们将需要4950个新变量。

请注意,由于刚刚提出的第一点,这些新变量仅适用于您的策略,例如pay 个别面板。据推测,对个别小组的限制可能是固定的,但主要观点在许多方面似乎是非常不明智的。简而言之,除了编写更多嵌套循环之外,您还需要采用什么策略来处理这些数百或数千个新变量?

您的主要需求似乎是识别非遗漏和缺失值的法术。自开发以来,这种机器很容易实现。讨论了一般原则in this paper,可以从SSC下载tsspell的实现。

在Statalist上,人们被要求提供可行的数据和代码示例。请参阅this FAQ这完全等同于MCVE的长期请求。


* create some fake data
set seed 12341
set obs 10
foreach pre in a b c d e f g {
    gen `pre'pay = runiform() if runiform() < .8

* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
    qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")

qui foreach pre in b c d e f g {
    noi dis "{hline 80}" _n as res "Wave `pre'"

    // the longest substring without a space up to the wave
    gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
    noi tab temp

    // loop over the various substring lengths, from 2 to max length
    gen len = length(temp)
    sum len, meanonly
    local n = r(max)
    forvalues i = 2/`n' {
        count if length(temp) >= `i'
        noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
    drop temp len


* create some fake data in wide form
set seed 12341
set obs 10
foreach pre in a b c d e f g {
    gen `pre'pay = runiform() if runiform() < .8

* reshape to long form
gen id = _n
reshape long @pay, i(id) j(wave) string

* identify spells of contiguous periods
egen wavegroup = group(wave), label 
tsset id wavegroup  
tsspell, cond(pay < .)
drop if mi(pay)

foreach pre in b c d e f g {
    dis "{hline 80}" _n as res "Wave `pre'"

    sum _seq if wave == "`pre'", meanonly
    local n = r(max)
    forvalues i = 2/`n' {
        qui count if _seq >= `i' & wave == "`pre'"
        dis as txt "length = " as res `i' as txt " obs = " as res r(N)


Obs   a  b  c  d
1     1  1  .  1
2     1  1  .  .
3     .  .  1  1
4     .  1  1  .
5     1  1  1  1

这个答案的目的不是提供OP要求的内容,而是指出有多少简单工具可用于查看非缺失值和缺失值的模式,其中没有一个需要创建大量额外变量或为每个新问题编写基于嵌套循环的复杂代码。大多数这些工具都需要reshape long

. clear  

. input a b c d

             a          b          c          d
  1.  1 1 . 1
  2.  1 1 . .
  3.  . . 1 1
  4.  . 1 1 .
  5.  1 1 1 1
  6. end 

. rename (a b c d) (y1 y2 y3 y4) 

. gen id = _n 

. reshape long y, i(id) j(time) 
(note: j = 1 2 3 4)

Data                               wide   ->   long
Number of obs.                        5   ->      20
Number of variables                   5   ->       3
j variable (4 values)                     ->   time
xij variables:
                           y1 y2 ... y4   ->   y

. xtset id time 
       panel variable:  id (strongly balanced)
        time variable:  time, 1 to 4
                delta:  1 unit

. preserve 

. drop if missing(y) 
(7 observations deleted)

. xtdescribe 

      id:  1, 2, ..., 5                                      n =          5
    time:  1, 2, ..., 4                                      T =          4
           Delta(time) = 1 unit
           Span(time)  = 4 periods
           (id*time uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         2       2       2         2         3       4       4

     Freq.  Percent    Cum. |  Pattern
        1     20.00   20.00 |  ..11
        1     20.00   40.00 |  .11.
        1     20.00   60.00 |  11..
        1     20.00   80.00 |  11.1
        1     20.00  100.00 |  1111
        5    100.00         |  XXXX

* ssc inst xtpatternvar 
. xtpatternvar, gen(pattern) 

* ssc inst groups 
. groups pattern

  | pattern   Freq.   Percent     % <= |
  |    ..11       2     15.38    15.38 |
  |    .11.       2     15.38    30.77 |
  |    11..       2     15.38    46.15 |
  |    11.1       3     23.08    69.23 |
  |    1111       4     30.77   100.00 |

. restore  

. egen npresent = total(missing(y)), by(time)

. tabdisp time, c(npresent) 

     time |   npresent
        1 |          2
        2 |          1
        3 |          2
        4 |          2