从R中的大字符串中提取带有小数的数字

时间:2018-10-18 13:44:48

标签: r regex gsub stringr

我想从包含15个观测值的向量中提取数字:

rs <- c("\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.0\n                    (1 rating)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            9 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.7\n                    (4 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            34 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.1\n                    (5 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            22 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    2.4\n                    (14 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            2,106 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.3\n                    (67 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            1,287 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (3 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            30 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        New\n    \n\n\n                \n\n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    0.0\n                    (0 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            8 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        Highest Rated\n    \n\n\n                \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            42 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.4\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            41 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.2\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            115 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            25 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (19 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            151 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.5\n                    (10 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            385 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (166 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            754 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.6\n                    (34 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            3,396 students enrolled\n        \n    \n\n\n    \n\n    "
)

如您所见,有15个非常长的物体和污垢。但是,它们内部的模式很容易识别。每个对象都由3个数字组成(以第一个观察结果为例):

  • 评分:从0到5。例如,4.0
  • 等级数。例如(1 rating)
  • 在校学生。例如9 students enrolled

我想提取所有这些数值,并创建一个包含3列的数据框,每个列对应每个变量。

我已经在Stackoverflow中检查了几个问题,主要集中在软件包gsub()的{​​{1}}的使用上。但是,我找不到解决我问题的关键方法。

更新

这些是我尝试的代码:

stringr

3 个答案:

答案 0 :(得分:3)

使用extract中的tidyr,我们可以做到:

library(dplyr)
library(tidyr)

data.frame(rs, stringsAsFactors = FALSE) %>%
  extract(rs, c("Rating", "Number_of_ratings", "Students_enrolled"),
          "(?s)(\\d\\.\\d).*?(\\d+)\\s*ratings?.*?(\\d+(?:,\\d+)?)\\s*students enrolled", 
          convert = TRUE) %>%
  mutate(Students_enrolled = as.numeric(sub(",", "", Students_enrolled)))

输出:

   Rating Number_of_ratings Students_enrolled
1     4.0                 1                 9
2     4.7                 4                34
3     3.1                 5                22
4     2.4                14              2106
5     4.3                67              1287
6     4.6                 3                30
7     0.0                 0                 8
8     4.6                12                42
9     4.4                 6                41
10    4.2                12               115
11    4.8                 6                25
12    4.6                19               151
13    4.5                10               385
14    4.8               166               754
15    3.6                34              3396

注释:

正则表达式看起来很复杂,但实际上并非如此。 extract的作用是从每个捕获组(用括号括起来的事物)中提取匹配项,并将其变成自己的列。

  1. (?s)是一个修饰符,用于打开“点播”模式。这样,点.也可以匹配换行符。

  2. (\\d\\.\\d)Rating模式匹配

  3. (\\d+)\\s*ratingsNumber_of_ratings模式匹配,但仅提取数字(\\d+)

  4. (\\d+(?:,\\d+)?)\\s*students enrolledStudents_enrolled模式匹配,但仅提取“带或不带逗号的数字”模式

  5. convert = TRUE尝试将结果列转换为其最佳数据类型,但是由于Students_enrolled中包含逗号,因此需要额外的mutate才能将其转换为数字< / p>

通常,如果捕获组的数量不等于输出列的数量,extract会引发错误,但是由于不考虑修饰符(?s)和非捕获组(?:...)捕获组,捕获组计数与列数匹配。

答案 1 :(得分:3)

基于1依赖性的R解决方案,带有注释且可读的正则表达式。

这还显示了如何清理文本以进行处理(以可重复使用的方式)。

library(stringi)

do.call(
  rbind.data.frame,
  lapply(
    stri_match_all_regex(
      stri_replace_all_regex(
        stri_trim_both(rs),             # clean up outer spaces
        "[[:blank:][:space:]]+", " "    # clean up inner spaces
      ),
      "
([[:digit:]\\.]+)[[:space:]]+\\(([[:digit:],]+)[[:space:]]+rating[s]*\\)# pick up the rating and total number of ratings
[^[:digit:]]*([[:digit:],]+)[[:space:]]+student[s]*[[:space:]]+enrolled                          # pick up the number of students enrolled
",
      opts_regex = stri_opts_regex(comments = TRUE),
    ),
    function(x) {
      as.list(
        setNames(
          x[2:4], c("rating", "n_ratings", "enrolled")
        ),
        stringsAsFactors = FALSE
      )
    }
  )
)

结果:

##    rating n_ratings enrolled
## 2     4.0         1        9
## 21    4.7         4       34
## 3     3.1         5       22
## 4     2.4        14    2,106
## 5     4.3        67    1,287
## 6     4.6         3       30
## 7     0.0         0        8
## 8     4.6        12       42
## 9     4.4         6       41
## 10    4.2        12      115
## 11    4.8         6       25
## 12    4.6        19      151
## 13    4.5        10      385
## 14    4.8       166      754
## 15    3.6        34    3,396

在那之后将^^变成#是很基本的。

答案 2 :(得分:2)

所以您的问题是它看不到“。”作为数字的一部分,因为它在字符串中。因此,您需要明确地找到数字和小数点。

Rating <- as.numeric(str_extract(rs, "[0-9]\\.[0-9]"))
NRatings <- str_extract(rs, "\\([0-9]") %>% str_replace("\\(","") %>% as.numeric() 

我将根据这些示例让您找出最后一个;)

相关问题