数字的近似匹配函数

时间:2014-06-13 16:48:46

标签: r numeric

我有一个数字向量:

[1]  96.500  96.625  96.750  96.875  97.000  97.125  97.250  97.375  97.500  97.625  97.750  97.875  98.000
[14]  98.125  98.250  98.375  98.500  98.625  98.750  98.875  99.000  99.125  99.250  99.375  99.500  99.625
[27]  99.750  99.875 100.000 100.125 100.250 100.375 100.500

我想采用不同的数字99.49,并找到它所在的向量中的值的索引号。在这种情况下,我希望它返回c(24, 25),因为感兴趣的数字介于99.375和99.5之间。

任何人都知道在R中执行此操作的简单方法(一行或两行代码)?假设感兴趣的数量可以在向量中。我目前有一个“while”循环,但试图看看是否有更简单的矢量化格式。

3 个答案:

答案 0 :(得分:2)

x是你的向量,v是这个函数的给定数字

between <- function(x, v) { c(max(which(x <= v)), min(which(x >= v))) }

答案 1 :(得分:1)

以下是处理数字数据的match的高效版本。高效,因为我的C ++实现是短路的,并在找到第一个匹配后完成搜索。也许我忽略了一些东西,但我真的认为基础R中缺少这样的功能,而我偶尔会偶然发现这个问题。

但请注意,根据问题,首先对目标向量进行排序(以及要匹配的向量)可能效率更高,而findInterval正如评论中所建议的那样。

Rcpp::cppFunction('
IntegerVector match_dbl_cpp(NumericVector x, NumericVector table,
                        int nomatch, double tolerance) {

  int n = x.size();
  int m = table.size();
  IntegerVector out(n, nomatch);

  for (int i = 0; i < n; ++i) {
    int j = 0;
    while (j < m) {
      if (std::abs(x[i] - table[j]) < tolerance) {
        out[i] = j + 1;
        break;
      }
      ++j;
    }
  }
  return out;
}
')

match_dbl <- function(x, table, nomatch = NA_integer_,
                      tolerance = sqrt(.Machine$double.eps)) {

  if (!is.integer(nomatch))
    stop("'nomatch' must be an integer'")

  if (!is.numeric(tolerance) || tolerance <= 0.0)
    stop("'tolerance' must be a positive number")

  match_dbl_cpp(x, table, nomatch, tolerance)
}

# generate some random numeric data
set.seed(123)
table <- runif(1000L)
table <- sample(c(table, table)) # 'table' now contains duplicates
x <- sample(table, 100L)

m1 <- match(x, table)
m1_dbl <- match_dbl(x, table)
identical(m1, m1_dbl) # TRUE according to expectation
[1] TRUE

microbenchmark::microbenchmark(match(x, table),
                               match_dbl(x, table)) # speed is fine
Unit: microseconds
            expr    min      lq     mean  median     uq     max neval
 match(x, table) 45.622 48.6295 52.54944 49.5540 53.995 129.079   100
 match_dbl(x, table) 46.380 48.9325 53.13952 49.6335 52.054 106.160   100

# minimally disturb x
x <- x + runif(n = length(x), min = -1e-10, max = 1e-10)

identical(m1, match(x, table)) # now FALSE
[1] FALSE 
identical(m1_dbl, match_dbl(x, table)) # still TRUE
[1] TRUE
identical(m1_dbl, match_dbl(x, table, tolerance = 1e-11)) # also FALSE now
[1] FALSE

数字数据的%in%版本可以轻松编写为:

`%in_dbl%` <- function(x, table) match_dbl(x, table, nomatch = 0L) > 0L

热烈欢迎有关改进的建议!

答案 2 :(得分:0)

z = scan(nmax = 33)
96.500  96.625  96.750  96.875  97.000  97.125  97.250  97.375  97.500  97.625  97.750  97.875  98.000
98.125  98.250  98.375  98.500  98.625  98.750  98.875  99.000  99.125  99.250  99.375  99.500  99.625
99.750  99.875 100.000 100.125 100.250 100.375 100.500 \n

btw <- function(data, num){
c(min(which(num<data))-1, min(which(num<data)))
}

btw(data = z, num = 99.49)