或者使用分隔符拆分字符串

时间:2017-06-13 20:17:26

标签: r

我有一个像这样的网址列表:

mydata <- read.table(header=TRUE, text="
      Id
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrickpattern%3ADecorative%2FArt+Deco%3Abrickpattern%3AFloral%3Abrickpattern%3AGeometric%3Abrickpattern%3AGraphic%3Abrickpattern%3ATropical%3Aprice%3A300%2C10500&page=7&gridValue=4  
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2040%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3ABlue%3Averticalcolorfamily%3AWhite
      https://www.example.com/dp/c/830316016?q=%3Arelevance%3Averticalcolorfamily%3AWhite&gclid=CjwKEAjw9_jJBRCXycSarr3csWcSJABthk07W_H0RxQtOPZX7VdD9CSmK4S01BMYdXbtc0XxC0OeChoCky_w_wcB
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AFLYING%20MACHINE%3Abrand%3AMUFTI%3Abrand%3AUNITED%20COLORS%20OF%20BENETTON
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2038%3Averticalsizegroupformat%3AIN%2039%3Averticalsizegroupformat%3AIN%20M%3Averticalsizegroupformat%3AUK%2039%3Averticalsizegroupformat%3AUK%20M%3Averticalsizegroupformat%3AUK%20S%3Averticalsizegroupformat%3AUS%20M%3Averticalsizegroupformat%3AUS%20S%3Abrickpattern%3ASolid%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3AWhite
      https://www.example.com/dp/c/830216013?q=%3Aprce-asc%3Abricksleeve%3AShort%3Aprice%3A300%2C10500&page=2&gridValue=4
      https://www.example.com/dp/c/830216013??q=%3Aprce-asc%3Abrand%3AUS+POLO%3Abricksleeve%3AShort%3Aprice%3A300%2C10500
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AAJIO%3Abrand%3ABASICS%3Abrand%3ACelio%3Abrand%3ADNMX%3Abrand%3AGAS%3Abrand%3ALEVIS%3Abrand%3ANETPLAY%3Abrand%3ASIN%3Abrand%3ASUPERDRY%3Abrand%3AUS%20POLO%3Abrand%3AVIMAL%3Abrand%3AVIMAL%20APPARELS%3Abrand%3AVOI%20JEANS
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3ABritish+Club%3Abrand%3ACelio%3Abrand%3AFLYING+MACHINE%3Aprice%3A300%2C10500&page=1&gridValue=4          
                         ")      

我需要从网址中提取品牌,verticalcolorfamily,q =等参数的值。这些参数是网站上应用的过滤器 我正在寻找的输出是一个包含三列的数据框:参数,值和值的出现频率。对于Ex:

parameter |      value     | frequency
----------|----------------|----------
brand     | FLYING+MACHINE | 2  
q=        | relevance      | 5  
price     | 300%2C10500    | 2  
brand     | BASICS         | 1

目前我能够想到的是将每个网址收集为字符串向量,这些字符向量由交替的值&#34;%3A&#34;作为分隔符:[q =%3A相关,brickpattern%3ADecorative%2FArt + Deco,brickpattern%3AFloral,brickpattern%3AGeometric,brickpattern%3AGraphic,brickpattern%3ATropical,price%3A300%2C10500]。

然后将每个元素放在数据框的一列中,然后再次按&#39;%3A&#39;然后做一个小组。 对其他方法的建议将非常感激。 此外,如果我应该使用这种方法,我不知道使用交替&#39;%3A&#39;作为分隔符。

1 个答案:

答案 0 :(得分:1)

urltools看起来像是一个很棒的包,可以满足您的需求。在此期间,这是一个被黑的回答。从您的data.frame开始:

# Convert to character list
# Get rid of url
# Split by "%3A" and convert to "long" list
L <- as.character(mydata$Id)
L <- gsub("https://www.example.com/dp/c/830216013\\?", "", L)
L <- unlist(strsplit(L, "%3A"))

head(L)
[1] "q="                    "relevance"             "brickpattern"         
[4] "Decorative%2FArt+Deco" "brickpattern"          "Floral"

然后:

# Convert to 2-column data frame
# Count unique parameter:value pairs
df <- data.frame(parameter = L[seq(1,length(L),2)], value = L[seq(2,length(L),2)]) %>%
      group_by(parameter, value) %>%
      summarize(frequency=sum(!is.na(value)))

我只会在frequency >= 2

中显示以下条目
# Show only entries with frequency >= 2
filter(df, frequency >= 2)

            parameter     value frequency
               <fctr>    <fctr>     <int>
1               brand     Celio         2
2         bricksleeve     Short         2
3                  q= relevance         6
4 verticalcolorfamily     Black         2
5 verticalcolorfamily     White         2

请注意brand::FLYING+MACHINE != 2,因为FLYING+MACHINE出现在FLYING%20MACHINEFLYING+MACHINE

相关问题