R - 低级文件IO

时间:2015-10-25 10:50:46

标签: r matlab file

我正在尝试使用R来读取“灵活数据格式”的文件 我得到了我应该读取的字节数(从EOF算起,例如,我应该将EOF-32读取为EOF字节作为我的数据)。

我正在寻找与fseek的等价和来自R的MATLAB的

1 个答案:

答案 0 :(得分:3)

我认为你会用不同的方法做得更好(如果我在这里有正确的“灵活数据格式”文件格式)。您可以使用R:

中的基本字符串函数处理大部分(可怕的)文件
library(stringr)

# read in fdf file
l <- readLines("http://rud.is/dl/Fe.fdf")

# some basic cleanup
l <- sub("#.*$", "", l)  # remove comments
l <- sub("^=.*$", "", l) # remove comments
l <- gsub("\ +", " ", l) # compress spaces
l <- str_trim(l)         # beg/end space trim
l <- grep("^$", l, value=TRUE, invert=TRUE) # ignore blank lines

# start of data blocks
blocks <- which(grepl("^%block", l))

# all "easy"/simple lines
simple <- str_split_fixed(grep("^[[:digit:]%]", l, value=TRUE, invert=TRUE),
                          "[[:space:]]+", 2)

# "simple" name/val [unit] conversions
convert_vals <- function(simple) {

  vals <- simple[,2]
  names(vals) <- simple[,1]

  lapply(vals, function(v) {

    # if logical
    if (tolower(v) %in% c("t", "true", ".true.", "f", "false", ".false.")) {
      return(as.logical(gsub("\\.", "", v)))
    }

    # if it's just a number
    # i may be missing a numeric fmt char in this horrible format
    if (grepl("^[[:digit:]\\.\\+\\-]+$", v)) {
      return(as.numeric(v))
    }

    # if value and unit convert to an actual number with a unit attribute
    # or convert it here from the table starting on line 927 of fdf.f
    if (grepl("^[[:digit:]]", v) & (!any(is.na(str_locate(v, " "))))) {
      vu <- str_split_fixed(v, " ", 2)
      x <- as.numeric(vu[,1])
      attr(x, "unit") <- vu[,2]
      return(x)
    }

    # handle "1.d-3" and other vals with other if's

    # anything not handled is returned
    return(v)

  })

}

# handle begin/end block "complex" data conversion
convert_blocks <- function(lines) {

  block_names <- sub("^%block ", "", grep("^%block", lines, value=TRUE))
  lapply(blocks, function(blk_start) {
    blk <- lines[blk_start]
    blk_info <- str_split_fixed(blk, " ", 2)
    blk_end <- which(grepl(sprintf("^%%endblock %s", blk_info[,2]), lines))

    # this is overly simplistic since you have to do some conversions, but you know the line
    # range of the data values now so you can process them however you need to
    read.table(text=lines[(blk_start+1):(blk_end-1)], 
               header=FALSE, stringsAsFactors=FALSE, fill=TRUE)

  }) -> blks

  names(blks) <- block_names

  return(blks)

}

fdf <- c(convert_vals(simple),
         convert_blocks(l))


str(fdf)

str

的输出
List of 32
 $ SystemName                       : chr "bcc Fe ferro GGA"
 $ SystemLabel                      : chr "Fe"
 $ WriteCoorStep                    : chr ""
 $ WriteMullikenPop                 : num 1
 $ NumberOfSpecies                  : num 1
 $ NumberOfAtoms                    : num 1
 $ PAO.EnergyShift                  : atomic [1:1] 50
  ..- attr(*, "unit")= chr "meV"
 $ PAO.BasisSize                    : chr "DZP"
 $ Fe                               : num 2
 $ LatticeConstant                  : atomic [1:1] 2.87
  ..- attr(*, "unit")= chr "Ang"
 $ KgridCutoff                      : atomic [1:1] 15
  ..- attr(*, "unit")= chr "Ang"
 $ xc.functional                    : chr "GGA"
 $ xc.authors                       : chr "PBE"
 $ SpinPolarized                    : logi TRUE
 $ MeshCutoff                       : atomic [1:1] 150
  ..- attr(*, "unit")= chr "Ry"
 $ MaxSCFIterations                 : num 40
 $ DM.MixingWeight                  : num 0.1
 $ DM.Tolerance                     : chr "1.d-3"
 $ DM.UseSaveDM                     : logi TRUE
 $ DM.NumberPulay                   : num 3
 $ SolutionMethod                   : chr "diagon"
 $ ElectronicTemperature            : atomic [1:1] 25
  ..- attr(*, "unit")= chr "meV"
 $ MD.TypeOfRun                     : chr "cg"
 $ MD.NumCGsteps                    : num 0
 $ MD.MaxCGDispl                    : atomic [1:1] 0.1
  ..- attr(*, "unit")= chr "Ang"
 $ MD.MaxForceTol                   : atomic [1:1] 0.04
  ..- attr(*, "unit")= chr "eV/Ang"
 $ AtomicCoordinatesFormat          : chr "Fractional"
 $ ChemicalSpeciesLabel             :'data.frame':  1 obs. of  3 variables:
  ..$ V1: int 1
  ..$ V2: int 26
  ..$ V3: chr "Fe"
 $ PAO.Basis                        :'data.frame':  5 obs. of  3 variables:
  ..$ V1: chr [1:5] "Fe" "0" "6." "2" ...
  ..$ V2: num [1:5] 2 2 0 2 0
  ..$ V3: chr [1:5] "" "P" "" "" ...
 $ LatticeVectors                   :'data.frame':  3 obs. of  3 variables:
  ..$ V1: num [1:3] 0.5 0.5 0.5
  ..$ V2: num [1:3] 0.5 -0.5 0.5
  ..$ V3: num [1:3] 0.5 0.5 -0.5
 $ BandLines                        :'data.frame':  5 obs. of  5 variables:
  ..$ V1: int [1:5] 1 40 28 28 34
  ..$ V2: num [1:5] 0 2 1 0 1
  ..$ V3: num [1:5] 0 0 1 0 1
  ..$ V4: num [1:5] 0 0 0 0 1
  ..$ V5: chr [1:5] "\\Gamma" "H" "N" "\\Gamma" ...
 $ AtomicCoordinatesAndAtomicSpecies:'data.frame':  1 obs. of  4 variables:
  ..$ V1: num 0
  ..$ V2: num 0
  ..$ V3: num 0
  ..$ V4: int 1

您可以在this gist中看到输出(以及文件和此代码),因为它更容易复制/过去/克隆要点。

你还需要:

  • 处理单位转换(但是这个网格::类似单位的结构可以更直接)
  • 用更好的“块读取器”替换天真的read.table
  • 处理文件包含(非常简单,如果你添加一个或两个函数),

通过一些调整/抛光,这个cld是一个新的R包,而不是我曾经想要这种格式的数据文件。