以libsvm格式读/写数据

时间:2012-08-24 15:41:32

标签: r libsvm

如何在R中读取/写入libsvm数据?

libsvm格式是稀疏数据,如

<class/target>[ <attribute number>:<attribute value>]*

(参见Compressed Row Storage (CRS))例如,

1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:03

我相信我自己可以鞭打一些东西,但我宁愿使用现成的东西。但是,Rforeign似乎没有提供必要的功能。

6 个答案:

答案 0 :(得分:13)

e1071已下架:

install.packages("e1071")
library(e1071)
read.matrix.csr(...)
write.matrix.csr(...)

注意 :它在R中实现,而不是在C中实现,因此它是 dog-slow < /强>

它甚至有一个特殊的小插图Support Vector Machines—the Interface to libsvm in package e1071

r.vwvowpal_wabbit

捆绑在一起

注意 :它在R中实现,而不是在C中实现,因此它是 dog-slow < /强>

答案 1 :(得分:10)

我一直在使用zygmuntz解决方案在一个拥有25k观测值(行)的数据集上运行了近5个小时。它做了3k-ish行。花了这么长时间我在此期间对此进行了编码(基于zygmuntz的代码):

require(Matrix)
read.libsvm = function( filename ) {
  content = readLines( filename )
  num_lines = length( content )
  tomakemat = cbind(1:num_lines, -1, substr(content,1,1))

  # loop over lines
  makemat = rbind(tomakemat,
  do.call(rbind, 
    lapply(1:num_lines, function(i){
       # split by spaces, remove lines
           line = as.vector( strsplit( content[i], ' ' )[[1]])
           cbind(i, t(simplify2array(strsplit(line[-1],
                          ':'))))   
})))
class(makemat) = "numeric"

#browser()
yx = sparseMatrix(i = makemat[,1], 
              j = makemat[,2]+2, 
          x = makemat[,3])
return( yx )
}

这在同一台机器上运行了几分钟(zygmuntz解决方案也可能存在内存问题,不确定)。希望这可以帮助任何有同样问题的人。

请记住,如果你需要在R中做大计算,VECTORIZE!

编辑:修复了我今天早上发现的索引错误。

答案 2 :(得分:4)

我提出了自己的 ad hoc 解决方案,利用了一些data.table实用程序,

它几乎没有在我找到的测试数据集上运行(Boston Housing data)。

将其转换为data.table(与解决方案正交,但在此处添加以便于重现):

library(data.table)
x = fread("/media/data_drive/housing.data.fw",
          sep = "\n", header = FALSE)
#usually fixed-width conversion is harder, but everything here is numeric
columns =  c("CRIM", "ZN", "INDUS", "CHAS",
             "NOX", "RM", "AGE", "DIS", "RAD", 
             "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
DT = with(x, fread(paste(gsub("\\s+", "\t", V1), collapse = "\n"),
                   header = FALSE, sep = "\t",
                   col.names = columns))

这是:

DT[ , fwrite(as.data.table(paste0(
  MEDV, " | ", sapply(transpose(lapply(
    names(.SD), function(jj)
      paste0(jj, ":", get(jj)))),
    paste, collapse = " "))), 
  "/path/to/output", col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]
#what gets sent to as.data.table:
#[1] "24 | CRIM:0.00632 ZN:18 INDUS:2.31 CHAS:0 NOX:0.538 RM:6.575 
#  AGE:65.2 DIS:4.09 RAD:1 TAX:296 PTRATIO:15.3 B:396.9 LSTAT:4.98 MEDV:24"      
#[2] "21.6 | CRIM:0.02731 ZN:0 INDUS:7.07 CHAS:0 NOX:0.469 RM:6.421 
#  AGE:78.9 DIS:4.9671 RAD:2 TAX:242 PTRATIO:17.8 B:396.9 LSTAT:9.14 MEDV:21.6"
# ...

可能fwriteas.data.table理解的更好的方法,但我想不到一个(until setDT works on vectors)。< / p>

我复制了这个以测试它在更大的数据集上的性能(只是炸掉当前的数据集):

DT2 = rbindlist(replicate(1000, DT, simplify = FALSE))

与此处报道的一些时间相比,此操作相当快(我还没有直接比较):

system.time(.)
#    user  system elapsed 
#   8.392   0.000   8.385 

我还使用writeLines代替fwrite进行了测试,但后者更好。

我再次寻找,看到可能需要一段时间来弄清楚发生了什么。也许magrittr - 管道版本会更容易理解:

DT[ , 
    #1) prepend each column's values with the column name
    lapply(names(.SD), function(jj)
      paste0(jj, ":", get(jj))) %>%
      #2) transpose this list (using data.table's fast tool)
      #   (was column-wise, now row-wise)
      #3) concatenate columns, separated by " "
      transpose %>% sapply(paste, collapse = " ") %>%
      #4) prepend each row with the target value
      #   (with Vowpal Wabbit in mind, separate with a pipe)
      paste0(MEDV, " | ", .) %>%
      #5) convert this to a data.table to use fwrite
      as.data.table %>%
      #6) fwrite it; exclude nonsense column name,
      #   and force quotes off
      fwrite("/path/to/data", 
             col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]

阅读这些文件更容易**

#quickly read data; don't split within lines
x = fread("/path/to/data", sep = "\n", header = FALSE)

#tstrsplit is transpose(strsplit(.))
dt1 = x[ , tstrsplit(V1, split = "[| :]+")]

#even columns have variable names
nms = c("target_name", 
        unlist(dt1[1L, seq(2L, ncol(dt1), by = 2L), 
                   with = FALSE]))

#odd columns have values
DT = dt1[ , seq(1L, ncol(dt1), by = 2L), with = FALSE]
#add meaningful names
setnames(DT, nms)

**这将使用“参差不齐”/稀疏输入数据。在这种情况下,我认为没有办法扩展它。

答案 3 :(得分:2)

答案 4 :(得分:0)

我选择了两跳解决方案 - 首先将R数据转换为另一种格式,然后转换为LIBSVM:

  1. 使用R package foreign将数据帧转换(并写出)为ARFF格式(修改后的write.arff将write.table改为na =&#34; 0.0&#34;而不是na =&#34;?& #34;否则第2步失败)
  2. 使用https://github.com/dat/svm-tools/blob/master/arff2svm.py将ARFF格式转换为LIBSVM
  3. 我的数据集是200K x 500,这只需要3-5分钟。

答案 5 :(得分:0)

这个问题是很久以前问的,并且有几个答案。因为我的数据是长格式的,所以大多数答案都对我不起作用,因此我无法在R中一次性编码。因此,这就是我的看法。我编写了一个函数,对数据进行一次热编码,然后保存它,而不必先将矩阵转换为稀疏矩阵。

RCPP代码:

127.0.0.1 - - [31/May/2020 16:19:13] "GET / HTTP/1.1" 500 -
Traceback (most recent call last):
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 2464, in __call__
    return self.wsgi_app(environ, start_response)
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 2450, in wsgi_app
    response = self.handle_exception(e)
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 1867, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\_compat.py", line 39, in reraise
    raise value
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\_compat.py", line 39, in reraise
    raise value
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "C:\Users\Egemen\Desktop\Stock\stock\routes.py", line 36, in main
    return render_template("main.html")
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\templating.py", line 137, in render_template
    return _render(
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\flask\templating.py", line 120, in _render
    rv = template.render(context)
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\jinja2\environment.py", line 1090, in render
    self.environment.handle_exception()
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\jinja2\environment.py", line 832, in handle_exception
    reraise(*rewrite_traceback_stack(source=source))
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\jinja2\_compat.py", line 28, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\Egemen\Desktop\Stock\stock\templates\main.html", line 1, in top-level template code
    {% extends "layout.html" %}
  File "C:\Users\Egemen\Desktop\Stock\stock\templates\layout.html", line 29, in top-level template code
    {% block form %}
  File "C:\Users\Egemen\Desktop\Stock\stock\templates\main.html", line 27, in block "form"
    <option value={{i+1}}>{{months[i]}}</option>
  File "C:\Users\Egemen\Desktop\Stock\env\Lib\site-packages\jinja2\environment.py", line 452, in getitem
    return obj[argument]
jinja2.exceptions.UndefinedError: 'months' is undefined
127.0.0.1 - - [31/May/2020 16:19:13] "GET /?__debugger__=yes&cmd=resource&f=style.css HTTP/1.1" 200 -
127.0.0.1 - - [31/May/2020 16:19:13] "GET /?__debugger__=yes&cmd=resource&f=jquery.js HTTP/1.1" 200 -
127.0.0.1 - - [31/May/2020 16:19:13] "GET /?__debugger__=yes&cmd=resource&f=debugger.js HTTP/1.1" 200 -
127.0.0.1 - - [31/May/2020 16:19:13] "GET /?__debugger__=yes&cmd=resource&f=console.png HTTP/1.1" 200 -
127.0.0.1 - - [31/May/2020 16:19:13] "GET /?__debugger__=yes&cmd=resource&f=ubuntu.ttf HTTP/1.1" 200 -

R函数充当包装器:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
#include <iostream>
#include <fstream>
#include <string>
using namespace Rcpp;

// Reading data frame from R and saving it as an libFM file

// [[Rcpp::export]] 
std::string createNumber(int x, double y) {
  std::string s1 = std::to_string(x); 
  std::string s2 = std::to_string(y); 
  std::string X_elem = s1 + ":" + s2; 
  return X_elem;
}

// [[Rcpp::export]]
std::string createRowLibFM(arma::rowvec row_to_fm, arma::vec factor_levels, arma::vec position) {
  int n = factor_levels.n_elem; 
  std::string total =  std::to_string(row_to_fm[0]); 
  for (int i = 1; i < n; i++) { 
    if (factor_levels[i] > 1) { 
      total = total + " " + createNumber(position[i - 1] + row_to_fm[i], 1);
    } 
    if (factor_levels[i] == 1) {
      total = total + " " + createNumber(position[i], row_to_fm[i]);
    }
  }
  return total; 
}

// [[Rcpp::export]]
void writeFile(std::string file, arma::mat all_data, arma::vec factor_levels) {
  int n = all_data.n_rows;
  arma::vec position = arma::cumsum(factor_levels);
  std::ofstream temp_file;
  temp_file.open (file.c_str());
  for (int i = 0; i < n; i++) {
    std::string temp_row = createRowLibFM(all_data.row(i), factor_levels, position);
    temp_file << temp_row + "\n";
  }
  temp_file.close();
}

将其与假数据进行比较。

writeFileFM <- function(temp.data, path = 'test.txt') { 
  ### Dealing with y function 
  if (!(any(colnames(temp.data) %in% 'y'))) { 
    stop('No y column is given')  
  } else { 
    temp.data <- temp.data %>% select(y, everything()) ## y is required to be first column for writeFile 
  }
  ### Dealing with factors/strings 
  temp.classes <- sapply(temp.data, class) 
  class.num    <- rep(0, length(temp.classes))
  map.list     <- list()
  for (i in 2:length(temp.classes)) { ### since y is always the first column 
    if (any(temp.classes[i] %in% c('factor', 'character'))) {
      temp.col         <- as.factor(temp.data[ ,i]) ### incase it is character 
      temp.unique      <- levels(temp.col)
      factors.new      <- seq(0, length(temp.unique) - 1, 1)
      levels(temp.col) <- factors.new 
      temp.data[ ,i]   <- temp.col
      ### Saving changes 
      class.num[i]  <- length(temp.unique)
      map.list[[i - 1]] <- data.frame('original.value'  = temp.unique, 
                                      'transform.value' = factors.new)
    } else { 
      class.num[i]  <- 1  ### Numeric values require only 1 column 
    }
  }
  ### Writing file 
  print('Writing file to disc')
  writeFile(all_data = sapply(temp.data, as.numeric), file = path, factor_levels = class.num)
  return(map.list) 
}

结果。

### Creating data to save 
set.seed(999)
n <- 10000 
factor.lvl1 <- 3
factor.lvl2 <- 2 
temp.data <- data.frame('x1' = sample(stri_rand_strings(factor.lvl1, 7), n, replace = TRUE),
                        'x2' = sample(stri_rand_strings(factor.lvl2, 4), n, replace = TRUE), 
                        'x3' = rnorm(n), 
                        'x4' = rnorm(n),
                        'y'  = rnorm(n))

### Comparing to other method 
library(data.table)
library(e1071)

microbenchmark::microbenchmark(
  temp.data.table <- model.matrix( ~ 0 + x1 + x2 + x3 + x4, data = temp.data,
                                   contrasts = list(x2 = contrasts(temp.data$x2, contrasts = FALSE))),
  write.matrix.csr(temp.data.table, 'out.txt'), 
  writeFileFM(temp.data))

它比e1071选件更快,并且当增加观察次数时该选件失败,但建议的方法仍然适用。