基于第二个数据帧中的部分匹配创建新列

时间:2014-03-11 09:42:18

标签: regex r

我有两个数据帧,top3df:

http://dpaste.com/hold/1714336/

和qw:

qw <- structure(list(id = structure(1:25, .Label = c("w01", "w02", "w03", "w04", "w05", "w06", "w07", "w08", "w09", "w10", "w11", "w12", "w13", "w14", "w15", "w16", "w17", "w18", "w19", "w20", "w21", "w22", "w23", "w24", "w25"), class = "factor"), link = structure(c(5L, 4L, 19L, 2L, 18L, 24L, 20L, 23L, 7L, 12L, 14L, 15L, 21L, 17L, 10L, 13L, 16L, 25L, 22L, 6L, 11L, 3L, 1L, 9L, 8L), .Label = c("http://gezondheid.blog.nl/overgewicht/2008/06/07/dik-zijn-heeft-veel-nadelen", "http://home.deds.nl/~obesitasinfo.nl/", "http://mens-en-gezondheid.infonu.nl/ziekten/18079-risicos-van-overgewicht-en-de-gevolgen-van-obesitas.html", "http://nl.wikipedia.org/wiki/Obesitas", "http://overgewicht.pilliewillie.nl/obesitas/behandeling.overgewicht.3.php", "http://www.afslankacademie.nl/page/2634/overgewicht.html", "http://www.afvallen-voeding.nl/", "http://www.erfelijkheid.nl/node/325", "http://www.gewoongezond.nl/", "http://www.gezondafvallen.net/", "http://www.gezonderafvallen.nl/page/938/overgewicht-als-gevolg-van-de-evolutie.html", "http://www.gr.nl/nl/adviezen/overgewicht-en-obesitas", "http://www.hely.net/oorzaken.html", "http://www.kiloafvallen.nl/", "http://www.nisb.nl/kennisplein-sport-bewegen/dossiers/bewegen-en-overgewicht/oorzaken-obesitas.html", "http://www.novarum.nl/eetproblemen/obesitas/signalen-en-gevolgen", "http://www.obesitas.azdamiaan.be/nl/index.aspx?n=280", "http://www.obesitaskliniek.nl/", "http://www.obesitasvereniging.nl/", "http://www.sagbmaagband.nl/minder-gewicht/morbideobesitas.html", "http://www.tipsbijafvallen.nl/", "http://www.tweestedenziekenhuis.nl/script/Template_SubsubMenu.asp?PageID=1144&SSMID=1247", "http://www.vgz.nl/zorg-en-gezondheid/ziektes-en-aandoeningen/obesitas", "http://www.volkskrant.nl/vk/nl/2672/Wetenschap-Gezondheid/article/detail/3143483/2012/01/30/Balanstop-in-Madurodam-mueslireep-tegen-obesitas.dhtml", "http://www.zuivelengezondheid.nl/?pageID=332"), class = "factor"), quality = c(3.875, 6.25, 7.875, 3.5, 6, 4.75, 3.625, 4.125, 2.375, 6, 2.125, 6.5, 2.5, 5.375, 2.5, 6.625, 5.125, 5, 6.875, 5.75, 6.125, 3.25, 1.75, 2.5, 7.375), q1 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L), q2 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L), q3 = c(0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L)), .Names = c("id", "link", "quality", "q1", "q2", "q3"), class = "data.frame", row.names = c(NA, -25L))

使用top3df$id = qw$id[match(top3df$url,qw$link)]我可以查找an exact match,但这也会产生NA。如何查找链接的部分匹配?

我需要根据链接的第一部分(包括顶级域名,但不包括TLD之后的内容)进行匹配。例如,来自http://www.hely.net/oorzaken.html的{​​{1}} qw应与http://www.hely.net/gevolgen.html的{​​{1}}匹配。

2 个答案:

答案 0 :(得分:1)

正如@lukeA和@EDi所提到的,您可以使用正则表达式来提取TLD的URL并在此部分进行匹配,例如:

top3df$tld <- sub("(http[s]?://)?([^/]+)/.*$", "\\1\\2", top3df$url)
qw$tld <- sub("(http[s]?://)?([^/]+)/.*$", "\\1\\2", qw$link)

match(top3df$tld, qw$tld)
# [1] 22 11 25  5 14 16 18  2 16 25 

答案 1 :(得分:1)

partial <- function(txt)  
  sub("http://(.*?)/.*", "\\1", txt) 

qw$id[match(partial(top3df$url), partial(qw$link))]