我有一个文本数据文件,我可能会用readLines
阅读。每个字符串的初始部分包含大量的乱码,然后是我需要的数据。乱码和数据通常由三个点分隔。我想在最后三个点之后拆分字符串,或者用某种标记替换最后三个点,告诉R将这三个点左边的所有点都当作一列。
这是Stackoverflow上的一个类似帖子,它将找到最后一个点:
R: Find the last dot in a string
但是,在我的情况下,某些数据有小数,因此定位最后一个点是不够的。另外,我认为...
在R中具有特殊含义,这可能使问题复杂化。另一个潜在的复杂因素是一些点比其他点大。此外,在某些行中,三个点中的一个用逗号替换。
除了上面帖子中的gregexpr
之外,我还尝试使用gsub
,但无法找出解决方案。
这是一个示例数据集和我希望实现的结果:
aa = matrix(c(
'first string of junk... 0.2 0 1',
'next string ........2 0 2',
'%%%... ! 1959 ... 0 3 3',
'year .. 2 .,. 7 6 5',
'this_string is . not fine .•. 4 2 3'),
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))
aa <- as.data.frame(aa, stringsAsFactors=F)
aa
# desired result
# C1 C2 C3 C4
# 1 first string of junk 0.2 0 1
# 2 next string ..... 2 0 2
# 3 %%%... ! 1959 0 3 3
# 4 year .. 2 7 6 5
# 5 this_string is . not fine 4 2 3
我希望这个问题不会过于具体。文本数据文件是使用我昨天发布的关于在R中读取MSWord文件的帖子中列出的步骤创建的。
有些行不包含乱码或三个点,但只包含数据。但是,这可能是后续职位的一个复杂因素。
感谢您的任何建议。
答案 0 :(得分:5)
这就是诀窍,虽然不是特别优雅......
options(stringsAsFactors = FALSE)
# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))
# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))
# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))
# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")
答案 1 :(得分:2)
这会让你大部分时间都在那里,并且包含逗号的数字也没有问题:
# First, use a regex to eliminate the bad pattern. This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
apply(aa, 1, function (x)
gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))
# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter,
# digit, or space, and (b) followed by a digit. The result is a
# list, each element of which is a list containing the parts of
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x)
strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))
# Remove the second element in aa. There is no space before the
# first data column in this string. As a result, strsplit() split
# it into three columns, not 4. That in turn throws off the code
# below.
aa.list <- aa.list[-2]
# Make the data frame.
aa.list <- lapply(aa.list, unlist) # convert list of lists to list of vectors
aa.df <- data.frame(aa.list)
aa.df <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE)
唯一剩下的就是修改strsplit()
的正则表达式,以便它可以处理aa
中的第二个字符串。或者也许最好只是手动处理这种情况。
答案 2 :(得分:0)
反转字符串
如有必要,可以反转您正在搜索的模式 - 这不符合您的情况
反转结果
[俳句-伪]
a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match
ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'
// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex
[/俳句-伪]