清理脏数据

时间:2017-06-05 10:55:55

标签: matlab stata

我有三个变量(ID,Name和City),需要generate一个新的变量标志。

观察结果有问题。我需要找到错误的观察结果并创建标志。变量标志指示哪个列包含错误的观察结果。

假设每行最多只有一次不良观察。

鉴于脏数据!!!!!

|ID      |Name             |City 
|1       |IBM              |D    
|1       |IBM              |D    
|2       |IBM              |D    
|3       |Google           |F    
|3       |Microsoft        |F    
|3       |Google           |F    
|8       |Microsoft        |A    
|8       |Microsoft        |B    
|8       |Microsoft        |A    

结果

|ID      |Name             |City |flag
|1       |IBM              |D    |0
|1       |IBM              |D    |0
|2       |IBM              |D    |1
|3       |Google           |F    |0
|3       |Microsoft        |F    |2
|3       |Google           |F    |0
|8       |Microsoft        |A    |0
|8       |Microsoft        |B    |3
|8       |Microsoft        |A    |0

2 个答案:

答案 0 :(得分:3)

以下是Stata的答案,它依赖于您在评论中指出的许多假设,但不是最初的问题:

clear all
input float ID str9 Name str1 City
1 "IBM"       "D"
1 "IBM"       "D"
2 "IBM"       "D"
3 "Google"    "F"
3 "Microsoft" "F"
3 "Google"    "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end

// get dummy variable for 
duplicates tag, gen(right)

gen flag = 0

encode Name, gen(Name_n)
encode City, gen(City_n)

qui sum
forvalues start = 1(3)`r(N)' {
    local end = `start'+2

    // check if ID is all same
    qui sum ID in `start'/`end'
    if `r(sd)' != 0 {
        replace flag = 1 in `start'/`end' if right == 0
        continue
    }

    // check if name is all same
    qui sum Name_n in `start'/`end'
    if `r(sd)' != 0 {
        replace flag = 2 in `start'/`end' if right == 0
        continue
    }

    // chech if city is all same
    qui sum City_n in `start'/`end'
    if `r(sd)' != 0 {
        replace flag = 3 in `start'/`end' if right == 0
        continue
    }
}

drop right Name_n City_n    

直觉是因为它们被分为3个,两个总是正确的,每组3个只有一个问题,它们按ID分类,这可能是错误的但不大于我们可以的下一个最大的权利ID首先检查重复,如果有重复的观察,那么观察是正确的。

接下来,(在forvalues循环中)我们遍历每组三个以查看哪个变量具有错误的值,当我们找到它时,我们用适当的数字替换flag。

答案 1 :(得分:2)

此代码基于Eric的回答。

clear all
input float ID str9 Name str1 City
1 "IBM"       "D"
1 "IBM"       "D"
2 "IBM"       "D"
3 "Google"    "F"
3 "Microsoft" "F"
3 "Google"    "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end

encode Name, gen(Name_n)
encode City, gen(City_n)

// get dummy variable for 
duplicates tag ID Name, gen(col_12)
duplicates tag ID City, gen(col_13)
duplicates tag Name City, gen(col_23)
duplicates tag ID Name City, gen(col_123)

// generate the flag
gen flag = 0
replace flag = 1 if col_123 == 0 & col_23 ~= 0
replace flag = 2 if col_123 == 0 & col_13 ~= 0
replace flag = 3 if col_123 == 0 & col_12 ~= 0

drop Name_n City_n col_*