从句子中提取数字

时间:2014-08-24 18:20:25

标签: regex r string text-extraction

我需要从文本中提取一些数字。文字是

x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;"

要提取的数字是325和232.这些数字在括号内和句子末尾。其他数字不包括在内。我试过了strsplit(text, "[A-Za-z]+"),但没有得到我需要的东西。

4 个答案:

答案 0 :(得分:5)

这是stringi方法

x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae; Claudii libidini, qui tum erat summo ne imperio, dederetur"

library(stringi)
stri_extract_all_regex(x, "(?<=[\\[(])\\d+(?=[\\])][.?!])")

## [[1]]
## [1] "325" "232"

答案 1 :(得分:4)

另一个:

r <- gregexpr("[[(]\\d+[])](?=\\.)", text, perl = TRUE)
(m <- regmatches(text, r)[[1]])
# [1] "(325)" "[232]"

as.integer(gsub("\\D", "", m))
# [1] 325 232

答案 2 :(得分:3)

以下是使用strsplit ....

的解决方案
> x <- 'Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;'
> strsplit(x, '[^0-9]+')[[1]][3:4]
## [1] "325" "232"

或使用基数R来提取这些值。

> regmatches(x, gregexpr('[[(]\\K\\d+(?=[])](?!,))', x, perl=T))[[1]]
## [1] "325" "232"

答案 3 :(得分:0)

使用re模块

import re

string="Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;"

print string

pattern = re.compile(r'(?<=[\[(])\d+(?=[\])]\.)')

result = pattern.findall(string)

print result