Question

我是R语言的新手，我正在尝试探索由R for Data Science在线电影社区提供的数据集-https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018-10-23/movie_profit.csv。

在学习和深入了解dplyr的过滤器功能时，我注意到当我在“ distributor”列中搜索带有空格的值时，没有得到任何结果，例如“ Walt”迪斯尼”中的示例。搜索没有空格的值可以很好地工作，例如“ Universal”。

此外，我还尝试了数据集中的其他列，例如“电影”，在其中过滤出其中包含空格的特定电影；但是，当我这样做时，我没有遇到任何问题，所以我有点困惑。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(movie == "Dawn of the Planet of the Apes") %>%
  View()

简而言之，我很想知道这背后的原因，以便我知道如果我探索的任何未来数据集中再次出现这种情况，该如何处理。 给我带来麻烦的代码如下。 我想绘制发行商“沃尔特·迪斯尼”（Walt Disney）中电影的流派分布。


movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

Answer 1

字符串似乎不匹配

all.equal("Walt Disney", "Walt Disney" )
#[1] "1 string mismatch"

如果我们检查值

unique(movies$distributor)[3]
#[1] "Walt Disney"

charToRaw(unique(movies$distributor)[3])
#[1] 57 61 6c 74 c2 a0 44 69 73 6e 65 79
charToRaw("Wald Disney")
#[1] 57 61 6c 64 20 44 69 73 6e 65 79

有一些差异触发了不匹配

最好从“分销商”中的值进行复制

library(dplyr)
library(ggplot2)
library(stringr)
movies %>%
    filter(str_detect(distributor, "Walt\\s+Disney")) %>%
   count(genre) %>%
   ggplot(aes(x = genre, y = n)) +
       geom_col()

-输出

Answer 2

以下使用agrepl进行近似字符串匹配，并在akrun's answer中给出图形。

movies %>% 
  filter(agrepl("Walt Disney", distributor)) %>%
  ggplot(aes(x = genre)) +
  geom_bar()

Answer 3

如其他文章中所述，表中的特殊字符代替普通空格。您可以将它们替换为常规空格，并且您的代码应该可以正常工作，而无需手动复制字符串。我添加了trimws来删除任何尾随空格。请注意，这还会删除其他特殊字符。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

# this line replaces non-alphanumeric characters with a space and removes any trailing whitespace at the end. 
movies$distributor <- trimws(gsub("[^[:alnum:]]", " ", movies$distributor))
movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

Answer 4

dplyr过滤器与空格完美配合，但是您尝试过滤的数据包含特殊字符

为什么dplyr的过滤器不能与简单过滤器中的空格一起使用？

4 个答案: