R出生日期和任意日期的高效准确年龄计算(年,月或周)

时间:2015-06-29 22:37:26

标签: r lubridate

我面临着计算出生日期和任意日期的年龄(年,月或周)的共同任务。事情是,我经常需要在许多记录(> 3亿)上执行此操作,因此性能是此处的关键问题。

在SO和Google中快速搜索后,我找到了3个替代方案:

  • 常用算术程序(/365.25)(link
  • 使用包new_interval()中的duration()lubridate link}
  • 来自包age_calc()的功能eeptoolslinklinklink

所以,这是我的玩具代码:

# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01", 
                       "1962-12-30", "1962-12-31", "1963-01-01", 
                       "2000-06-16", "2000-06-17", "2000-06-18", 
                       "2007-03-18", "2007-03-19", "2007-03-20", 
                       "1968-02-29", "1968-02-29", "1968-02-29"))

# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31", 
                       "2015-12-31", "2015-12-31", "2015-12-31", 
                       "2050-06-17", "2050-06-17", "2050-06-17",
                       "2008-03-19", "2008-03-19", "2008-03-19", 
                       "2015-02-28", "2015-03-01", "2015-03-02"))

# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25

# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) / 
                     duration(num = 1, units = "years")

# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")

让我们稍后谈谈准确性,首先关注绩效。这是代码:

# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = new_interval(start = birthdate, end = givendate) /
                                     duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    times = 1000
)

# And examine the results
mbm
autoplot(mbm)

结果如下:

Microbenchmark results - table Microbenchmark results - plot

底线:lubridateeeptools函数的性能比算术方法(/365.25至少快10倍)差得多。不幸的是,算术方法不够准确,我承担不起这个方法会犯的几个错误。

  

“因为现代格里高利历的方式   构造,没有简单的算术   产生一个人年龄的方法,据说   常见用法 - 常用用法,指一个人的用法   年龄应始终是一个完全增加的整数   生日“。(link

在我阅读一些帖子时,lubridateeeptools没有犯这样的错误(但是,我没有查看代码/阅读有关这些函数的更多信息,以了解他们使用的方法)和这就是为什么我想使用它们,但它们的性能不适合我的实际应用。

有关计算年龄的有效而准确的方法的任何想法?

修改

行动似乎lubridate也会出错。显然,基于这个玩具示例,它比算术方法犯了更多错误(见第3,6,9,12行)。 (我做错了吗?)

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = new_interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df

    birthdate  givendate arithmetic lubridate eeptools
1  1978-12-30 2015-12-31         37        37       37
2  1978-12-31 2015-12-31         36        37       37
3  1979-01-01 2015-12-31         36        37       36
4  1962-12-30 2015-12-31         53        53       53
5  1962-12-31 2015-12-31         52        53       53
6  1963-01-01 2015-12-31         52        53       52
7  2000-06-16 2050-06-17         50        50       50
8  2000-06-17 2050-06-17         49        50       50
9  2000-06-18 2050-06-17         49        50       49
10 2007-03-18 2008-03-19          1         1        1
11 2007-03-19 2008-03-19          1         1        1
12 2007-03-20 2008-03-19          0         1        0
13 1968-02-29 2015-02-28         46        47       46
14 1968-02-29 2015-03-01         47        47       47
15 1968-02-29 2015-03-02         47        47       47

4 个答案:

答案 0 :(得分:17)

好的,所以我在另一个post中找到了这个函数:

age <- function(from, to) {
    from_lt = as.POSIXlt(from)
    to_lt = as.POSIXlt(to)

    age = to_lt$year - from_lt$year

    ifelse(to_lt$mon < from_lt$mon |
               (to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
           age - 1, age)
}

由@Jim发布说:“以下函数采用Date对象的向量并计算年龄,正确计算闰年。似乎比任何其他答案都更简单。”

它确实更简单,它确实是我正在寻找的技巧。平均而言,它实际上比算术方法更快(大约快75%)。

mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    age = age(from = birthdate, to = givendate),
    times = 1000
)
mbm
autoplot(mbm)

enter image description here enter image description here

至少在我的例子中它没有任何错误(在任何例子中它都不应该;它是使用ifelse s的非常简单的函数。)

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years"),
    age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df

    birthdate  givendate arithmetic lubridate eeptools age
1  1978-12-30 2015-12-31         37        37       37  37
2  1978-12-31 2015-12-31         36        37       37  37
3  1979-01-01 2015-12-31         36        37       36  36
4  1962-12-30 2015-12-31         53        53       53  53
5  1962-12-31 2015-12-31         52        53       53  53
6  1963-01-01 2015-12-31         52        53       52  52
7  2000-06-16 2050-06-17         50        50       50  50
8  2000-06-17 2050-06-17         49        50       50  50
9  2000-06-18 2050-06-17         49        50       49  49
10 2007-03-18 2008-03-19          1         1        1   1
11 2007-03-19 2008-03-19          1         1        1   1
12 2007-03-20 2008-03-19          0         1        0   0
13 1968-02-29 2015-02-28         46        47       46  46
14 1968-02-29 2015-03-01         47        47       47  47
15 1968-02-29 2015-03-02         47        47       47  47

我不认为它是一个完整的解决方案,因为我也希望在几个月和几周内有年龄,这个功能多年来一直是特定的。无论如何我在这里发布它是因为它解决了多年来的年龄问题。我不会接受它,因为:

  1. 我会等@Jim发布它作为答案。
  2. 我会等到其他人想出一个完整的解决方案(高效,准确,并根据需要在几年,几个月或几周内生产年龄)。

答案 1 :(得分:14)

lubridate似乎在上面犯错误的原因是你正在计算持续时间(两个时刻之间发生的确切时间,其中1年= 31536000s),而不是句点(两个时间之间发生的时间变化)时刻)。

要更改时钟时间(年,月,日等),您需要使用

as.period(new_interval(start = birthdate, end = givendate))

,它提供以下输出

 "37y 0m 1d 0H 0M 0S"   
 "37y 0m 0d 0H 0M 0S"   
 "36y 11m 30d 0H 0M 0S" 
 ...
 "46y 11m 30d 1H 0M 0S" 
 "47y 0m 0d 1H 0M 0S"   
 "47y 0m 1d 1H 0M 0S" 

要提取年份,您可以使用以下

as.period(new_interval(start = birthdate, end = givendate))$year
 [1] 37 37 36 53 53 52 50 50 49  1  1  0 46 47 47

注意这会抛出以下警告信息(不确定原因):

 Warning message:
 In Ops.factor(left, right) : ‘-’ not meaningful for factors

并且遗憾地看起来比上面的方法更慢!

> mbm
Unit: microseconds
       expr       min        lq       mean    median         uq        max neval cld
 arithmetic   116.595   138.149   181.7547   184.335   196.8565   5556.306  1000  a 
  lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935  1000   b

答案 2 :(得分:4)

我打算在评论中留下这个,但我认为值得单独回答。正如@Molx指出的那样,你的“算术”方法并不像看起来那么简单 - 看一下-.Date的代码,最重要的是:

return(difftime(e1, e2, units = "days"))

因此,类Date的对象上的“算术”方法实际上是difftime函数的包装器。那么difftime呢?如果您所追求的是原始速度,这也会产生大量开销。

关键是Date个对象存储为1970年1月1日以来的整数天数(尽管它们实际上并未存储为integer,因此IDate的诞生{1}}中的{1}}类,因此我们可以减去这些并完成它,但为了避免调用data.table方法,我们必须-.Date我们的输入:< / p>

unclass

至于你的降压,这种方法比@ Jim的(unclass(birthdate) - unclass(givendate)) / 365.25 方法快几个数量级。

这是一些更大规模的测试数据:

age

(不包括set.seed(20349) NN <- 1e6 birthdate <- as.Date(sprintf('%d-%02d-%02d', sample(1901:2030, NN, TRUE), sample(12, NN, TRUE), sample(28, NN, TRUE))) #average 30 years, most data between 20 and 40 years givendate <- birthdate + as.integer(rnorm(NN, mean = 10950, sd = 1000)) ,因为它几乎不可能更慢 - 只需看一眼eeptools的代码,就可以看出代码为创建了每对日期的日期序列age_calc - ish),更不用说ifelses的辣椒了

O(n^2)

因此,我们也强调了对小规模数据进行基准测试的愚蠢行为。

@ Jim方法的巨大成本是microbenchmark( arithmetic = (givendate - birthdate) / 365.25, lubridate = interval(start = birthdate, end = givendate) / duration(num = 1, units = "years"), age = age(from = birthdate, to = givendate), fastar = (unclass(givendate) - unclass(birthdate)) / 365.25, overlaps = get_age(birthdate, givendate), times = 50) # Unit: milliseconds # expr min lq mean median uq max neval cld # arithmetic 28.153465 30.384639 62.96118 31.492764 34.052991 180.9556 50 b # lubridate 94.327968 97.233009 157.30420 102.751351 240.717065 265.0283 50 c # age 338.347756 479.598513 483.84529 483.580981 488.090832 770.1149 50 d # fastar 7.740098 7.831528 11.02521 7.913146 8.090902 153.3645 50 a # overlaps 316.408920 458.734073 459.58974 463.806255 470.320072 769.0929 50 d 随着向量的增长而越来越昂贵。

不准确的问题仍然存在,但除非这种准确性至关重要,否则as.POSIXlt方法似乎是无与伦比的。

答案 3 :(得分:4)

我一直在努力争取这个,最后有一些东西是a)完全准确*(与迄今为止呈现的其他选项的所有相比)和b )相当快(参见我在其他答案中的基准)。它依赖于我手工完成的一系列算法以及foverlaps包中的精彩data.table函数。

该方法的本质是从Date的整数表示开始,以及识别出所有出生日期都属于四个1461(= 365 * 4 + 1)天的周期之一,取决于明年的时间到你的生日将需要366天。

这是功能:

library(data.table)
get_age <- function(birthdays, ref_dates){
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                     I4[.BY[[1L]],])),
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra
}

比较你的主要例子:

toy_df <- data.frame(
  birthdate = birthdate,
  givendate = givendate,
  arithmetic = as.numeric((givendate - birthdate) / 365.25),
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  eeptools = age_calc(dob = birthdate, enddate = givendate,
                      units = "years"),
  mine = get_age(birthdate, givendate)
)

toy_df
#     birthdate  givendate arithmetic lubridate   eeptools       mine
# 1  1978-12-30 2015-12-31 37.0020534 37.027397 37.0027397 37.0027322 #eeptools wrong: will be 366 days until 12/31/16, so fraction is 1/366
# 2  1978-12-31 2015-12-31 36.9993155 37.024658 37.0000000 37.0000000
# 3  1979-01-01 2015-12-31 36.9965777 37.021918 36.9972603 36.9972603
# 4  1962-12-30 2015-12-31 53.0020534 53.038356 53.0027397 53.0027322 #same problem
# 5  1962-12-31 2015-12-31 52.9993155 53.035616 53.0000000 53.0000000
# 6  1963-01-01 2015-12-31 52.9965777 53.032877 52.9972603 52.9972603
# 7  2000-06-16 2050-06-17 50.0013689 50.035616 50.0000000 50.0027397 #eeptools wrong: not exactly the birthday
# 8  2000-06-17 2050-06-17 49.9986311 50.032877 50.9972603 50.0000000 #eeptools wrong: _is_ exactly the birthday
# 9  2000-06-18 2050-06-17 49.9958932 50.030137 49.9945205 49.9972603 #eeptools wrong: fraction should be 364/365
# 10 2007-03-18 2008-03-19  1.0047912  1.005479  1.0027322  1.0027397 #eeptools wrong: 2/29 already passed, only 365 days until 3/19/2009
# 11 2007-03-19 2008-03-19  1.0020534  1.002740  1.0000000  1.0000000
# 12 2007-03-20 2008-03-19  0.9993155  1.000000  0.9966839  0.9972678 #eeptools wrong: we passed 2/29, so should be 365/366
# 13 1968-02-29 2015-02-28 46.9979466 47.030137 46.9977019 46.9972603 #my judgment: birthday occurs on 3/1 for 2/29 babies, so 364/365 the way there
# 14 1968-02-29 2015-03-01 47.0006845 47.032877 47.0000000 47.0000000
# 15 1968-02-29 2015-03-02 47.0034223 47.035616 47.0027397 47.0027322

这种方法可以扩展到很容易处理数月/周。几个月会有点啰嗦(必须指明4年的月份长度),所以我没有打扰;周很容易(周数不受闰年考虑因素影响,所以我们可以除以7)。

我在使用base功能时也做了很多进展,但a)它相当丑陋(需要进行0-1460的非线性转换以避免做嵌套ifelse语句等)和b)最后一个for循环(在整个日期列表中以apply的形式)是不可避免的,所以我认为这会减慢太多事情。 (转型为x1 = (unclass(birthdays) - 59) %% 1461; x2 = x1 * (729 - x1) / 402232 + x1,为子孙后代)

我已将此功能添加到my package

*(对于non-leap centuries不关注的日期范围;我认为处理此类日期的延期不应过于繁琐)