如何将字符串拆分为不同的变量?

时间:2017-03-20 15:27:39

标签: r data-analysis data-cleaning

我正在尝试分析Airbnb列表中的大型数据集,并在amenities列中列出商家信息所包含的便利设施。

例如,

{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire 
extinguisher",Essentials,Shampoo,Hangers} 

{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in 
building",Heating,"Suitable for events","Smoke detector","Carbon monoxide 
detector","First aid kit",Essentials,Shampoo,"Lock on bedroom 
door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation 
missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}

我有两个问题需要解决:

  1. 我想将字符串拆分为不同的列,例如会有一个标题为TV的列。如果字符串包含TV,则相应单元格中的条目将为1,否则为0。我怎么能这样做?

  2. 如何删除包含translation missing:.....

  3. 的变量

3 个答案:

答案 0 :(得分:0)

我相信这将是解决问题的快速解决方案:

library(data.table)

setDT(df)

dcast(df, listing_id~amenities)

答案 1 :(得分:0)

这是来自Kaggle的Boston Airbnb开放数据吗? 这是一种方式。不完全漂亮,但似乎有效:

我们的想法是删除{},然后使用read_csv()来解析字符串。

然后,列出独特的设施,并为每个设施列出一个列:

library(dplyr)
library(readr)
listings <- read_csv(file = "../data/boston-airbnb-open-data/listings.csv")
parsed_amenities <-
  listings %>% 
  .$amenities %>% 
  sub("^\\{(.*)\\}$", "\\1\n", x = .) %>% 
  lapply(function(x) names(read_csv(x)))
df <-
  unique(unlist(parsed_amenities)) %>% 
  .[!grepl("translation missing", .)] %>% 
  setNames(., .) %>% 
  lapply(function(x) vapply(parsed_amenities, "%in%", logical(1), x = x)) %>% 
  as_data_frame()
df

# # A tibble: 3,585 × 43
#       TV `Wireless Internet` Kitchen `Free Parking on Premises` `Pets live on this property` `Dog(s)` Heating
#    <lgl>               <lgl>   <lgl>                      <lgl>                        <lgl>    <lgl>   <lgl>
# 1   TRUE                TRUE    TRUE                       TRUE                         TRUE     TRUE    TRUE
# 2   TRUE                TRUE    TRUE                      FALSE                         TRUE     TRUE    TRUE
# 3   TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# 4   TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# 5  FALSE                TRUE    TRUE                      FALSE                        FALSE    FALSE    TRUE
# 6  FALSE                TRUE    TRUE                       TRUE                         TRUE    FALSE    TRUE
# 7   TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# 8   TRUE                TRUE   FALSE                       TRUE                         TRUE     TRUE    TRUE
# 9  FALSE                TRUE   FALSE                      FALSE                         TRUE    FALSE    TRUE
# 10  TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# # ... with 3,575 more rows, and 36 more variables: `Family/Kid Friendly` <lgl>, Washer <lgl>, Dryer <lgl>, `Smoke
# #   Detector` <lgl>, `Fire Extinguisher` <lgl>, Essentials <lgl>, Shampoo <lgl>, `Laptop Friendly Workspace` <lgl>,
# #   Internet <lgl>, `Air Conditioning` <lgl>, `Pets Allowed` <lgl>, `Carbon Monoxide Detector` <lgl>, `Lock on Bedroom
# #   Door` <lgl>, Hangers <lgl>, `Hair Dryer` <lgl>, Iron <lgl>, `Cable TV` <lgl>, `First Aid Kit` <lgl>, `Safety
# #   Card` <lgl>, Gym <lgl>, Breakfast <lgl>, `Indoor Fireplace` <lgl>, `Cat(s)` <lgl>, `24-Hour Check-in` <lgl>, `Hot
# #   Tub` <lgl>, `Buzzer/Wireless Intercom` <lgl>, `Other pet(s)` <lgl>, `Washer / Dryer` <lgl>, `Smoking
# #   Allowed` <lgl>, `Suitable for Events` <lgl>, `Wheelchair Accessible` <lgl>, `Elevator in Building` <lgl>,
# #   Pool <lgl>, Doorman <lgl>, `Paid Parking Off Premises` <lgl>, `Free Parking on Street` <lgl>

答案 2 :(得分:0)

这是一种方法,它同时使用dcast()包中data.table library(data.table) # read data file, returning one column raw <- fread("AirBnB.csv", header = FALSE, sep = "\n", col.names = "amenities") # add column with row numbers raw[, rn := seq_len(.N)] # remove opening and closing curly braces raw[, amenities := stringr::str_replace_all(amenities, "^\\{|\\}$", "")] # split amenities, thereby reshaping from wide to long format long <- raw[, strsplit(amenities, ",", fixed = TRUE), by = rn] # remove double quotes and leading and trailing whitespace long[, V1 := stringr::str_trim(stringr::str_replace_all(V1, '["]', ""))] # reshape from long to wide format, omitting rows which contain "translation missing..." dcast(long[!V1 %like% "^translation missing"], rn ~ V1, length, value.var = "rn", fill = 0) # rn Air conditioning Carbon monoxide detector Elevator in building Essentials #1: 1 1 0 0 1 #2: 2 1 1 1 1 # Fire extinguisher First aid kit Hair dryer Hangers Heating Iron Kitchen #1: 1 0 0 1 1 0 1 #2: 0 1 1 1 1 1 1 # Laptop friendly workspace Lock on bedroom door Shampoo Smoke detector #1: 0 0 1 0 #2: 1 1 1 1 # Suitable for events TV Wireless Internet #1: 0 0 1 #2: 1 1 1 的答案,但也解决了数据清理的繁琐但重要的细节。

"AirBnB.csv"

数据文件

OP只提供了两个数据样本,这些样本已被复制到名为{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers} {TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"} 的数据文件中:

{{1}}