有没有办法用已知格式修剪观测值?

时间:2019-06-20 06:40:22

标签: r dplyr tidyr

我正在使用具有40多个变量的数据库。每个案例都有其属性的唯一标识符。这些标识符中的一些已输入到地址变量中。

标识符只能采用以下格式:

foo-a {etc}

我不确定如何在不创建查找表和不使用left_join的情况下从其所包含的地址文本中删除唯一标识符。查找表将需要不断更新,使其非常麻烦。

我还没有找到这种事情的例子。我可能已经错过了一些东西。

我的数据如下:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>CSS Smooth Animation of Element's Text Color</title>
<style type="text/css">
    a {
        margin: 20px;
        -webkit-transition: color 0.5s; /* For Safari 3.0 to 6.0 */
        transition: color 0.5s; /* For modern browsers */
    }
    a:hover {
        color: #ff0000;
    }
</style>
</head>
<body>
    <h1><a href="#">Hover on me</a></h1>
</body>
</html>  

干净的数据将以NA123456 - First letter constant - N, 1 Letter A-K, Numbers 1-9 SA123456 - First 2 letters constant - SA, 6 Numbers 0-9 MABC1234 - First letter constant - M, 3 Letters A-Z, 4 Numbers 0-9 QABC1234 - First letter constant - Q, 3 Letters A-Z, 4 Numbers 0-9 WABC1234 - First letter constant - W, 3 Letters A-Z, 4 Numbers 1-9 TABC1234 - First letter constant - T, 3 Letters A-Z, 4 Numbers 1-9 3ABCD123 - First number constant - 3, 3 Letters A-Z, 3 Numbers 1-9 列中的唯一标识符结尾,并且不会用NA覆盖具有正确变量中的数据的观察值。

在此先感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

使用正则表达式和stringr::str_extract_all()

的可能答案

我假设您的电话号码应为0-9,而不是1-9。如果不是,请将所有[0-9]更改为[1-9]
另外,如果您要查找特定数目的字母/数字重复(例如:n),则将+更改为{n},就像在vec中的第一个模式中一样。

library( data.table )
library( stringr )

# NA123456 - First letter constant - N, Letter A-K, Numbers 1-9
# SA123456 - First 2 letters constant - SA, Numbers 1-9
# MABC1234 - First letter constant - M, Letters A-Z, Numbers 1-9
# QABC1234 - First letter constant - Q, Letters A-Z, Numbers 1-9
# WABC1234 - First letter constant - W, Letters A-Z, Numbers 1-9
# TABC1234 - First letter constant - T, Letters A-Z, Numbers 1-9
# 3ABCD123 - First number constant - 3, Letters A-Z, Numbers 1-9

#create a vector with all regex-patterns
#I assumed 1-9 should be 0-9 ??             <-- !!
vec <- c( "N[A-K]{1}[0-9]+", 
          "SA[0-9]+",
          "M[A-Z]+[0-9]+",
          "Q[A-Z]+[0-9]+",
          "W[A-Z]+[0-9]+",
          "T[A-Z]+[0-9]+",
          "3[A-Z]+[0-9]+" )
#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
DT[, Aa_reference := stringr::str_extract_all( Address, pattern )]

输出

#                           Property               Address Aa_reference
# 1:                   PIC: 3WABG086  260 SPRINGHURST ROAD             
# 2:                   PIC: 35PSR217       1350 RIVER ROAD             
# 3:                   PIC# NH244157    1038 QUONDONG ROAD             
# 4:                   PIC: 3GMUF425         70 DIGBY ROAD             
# 5:                   PIC# 3GMUF425         70 DIGBY ROAD             
# 6:                   PIC QTIWW0626               REMOLEA             
# 7:                    PIC#EBWSE235               BOX 191             
# 8:                   PIC #3WLKM019   198 MONTGOMERY ROAD             
# 9:                  PIC # 3BWMM021    149 ANDERSONS ROAD             
# 10:                   PIC: 3WCGN034              WERRIBEE             
# 11:         GARANGULA PIC: NH630488             PO BOX 84             
# 12:         GARANGULA PIC: NH630488             PO BOX 84             
# 13:                   PIC: 3GMTL320  2980 GLENELG HIGHWAY             
# 14:       GREENSLOPES PIC: MJKE0261 914 WEST KENTISH ROAD             
# 15:                   PIC: WFZB3246     859 PFEIFFER ROAD             
# 16:                   PIC: WFAY3549  34605 ALBANY HIGHWAY             
# 17:                   PIC: 3CEXK044 2244 LAVERS HILL ROAD             
# 18:                   PIC: QGWW0462            ELDERFIELD             
# 19:                   PIC: 3WCGN034              WERRIBEE             
# 20: KAYA DORPER & WHITE DORPER STUD         PIC: WABN0262     WABN0262
# 21:                      SPOTTSWOOD          PIC QKDR0078     QKDR0078
# 22:             COOMBOONA HOLSTEINS          PIC 3SPSR217     3SPSR217
# 23:                        ROSEVALE         PIC: QKEV0169     QKEV0169
# 24:                            <NA>          PIC 3EGON009     3EGON009
# 25:                            <NA>          PIC WFKPO316     WFKPO316
# 26:                         IVADENE          PIC 3WANP0T1       3WANP0
# 27:                            <NA>          PIC ND225813     ND225813
# 28:           HEAVENLY VALLEY FARMS         PIC #NF538645     NF538645
# 29:          C/- CED WISE AB CENTRE         PIC: QCST0158     QCST0158
# 30:                       GARANGULA        PIC # NH630488     NH630488
#                            Property               Address Aa_reference

使用的示例数据

DT <- fread('
Property |                       Address |              Aa_reference
PIC: 3WABG086|                   260 SPRINGHURST ROAD|  NA            
PIC: 35PSR217|                   1350 RIVER ROAD      | NA            
PIC# NH244157|                   1038 QUONDONG ROAD    |NA            
PIC: 3GMUF425|                   70 DIGBY ROAD|         NA            
PIC# 3GMUF425|                   70 DIGBY ROAD |        NA            
PIC QTIWW0626 |                  REMOLEA        |       NA            
PIC#EBWSE235   |                 BOX 191         |      NA            
PIC #3WLKM019   |                198 MONTGOMERY ROAD|   NA            
PIC # 3BWMM021   |               149 ANDERSONS ROAD  |  NA            
PIC: 3WCGN034     |              WERRIBEE             | NA            
GARANGULA PIC: NH630488|         PO BOX 84             |NA            
GARANGULA PIC: NH630488 |        PO BOX 84|             NA            
PIC: 3GMTL320|                   2980 GLENELG HIGHWAY|  NA            
GREENSLOPES PIC: MJKE0261|       914 WEST KENTISH ROAD| NA            
PIC: WFZB3246           |        859 PFEIFFER ROAD|     NA            
PIC: WFAY3549|                   34605 ALBANY HIGHWAY|  NA            
PIC: 3CEXK044 |                  2244 LAVERS HILL ROAD| NA            
PIC: QGWW0462  |                 ELDERFIELD|            NA            
PIC: 3WCGN034   |                WERRIBEE|              NA            
KAYA DORPER & WHITE DORPER STUD| PIC: WABN0262|         NA            
SPOTTSWOOD|                      PIC QKDR0078  |        NA            
COOMBOONA HOLSTEINS|             PIC 3SPSR217   |       NA            
ROSEVALE            |            PIC: QKEV0169   |      NA            
NA|                              PIC 3EGON009     |     NA            
NA |                             PIC WFKPO316      |    NA            
IVADENE|                         PIC 3WANP0T1       |   NA            
NA      |                        PIC ND225813        |  NA            
HEAVENLY VALLEY FARMS|           PIC #NF538645        | NA            
C/- CED WISE AB CENTRE|          PIC: QCST0158         |NA            
GARANGULA|                       PIC # NH630488        |NA
', sep = "|")

答案 1 :(得分:0)

最终成功了:

vec <- c( "N[A-K]{1}[0-9]+", 
          "SA[0-9]+",
          "M[A-Z]+[0-9]+",
          "Q[A-Z]+[0-9]+",
          "W[A-Z]+[0-9]+",
          "T[A-Z]+[0-9]+",
          "3[A-Z]+[0-9]+" )

#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )

df <- df %>%
  mutate(`id1` = str_extract_all(`Property`, vec),
         `id2` = str_extract_all(`Address`, vec),
         `id1` = na_if(`Pic1`, "character(0)"),
         `id2` = na_if(`Pic2`, "character(0)")
  ) %>% 
  unite(id3, id1, id2, remove = TRUE, sep = " ") %>% 
  mutate(`id3` = str_extract_all(id3, vec),
         `id3` = na_if(`id3`, "character(0)"))
相关问题