根据列将大数据集拆分为子集

时间:2015-06-28 15:05:10

标签: perl awk

我有一个非常大的数据集(20000x97)我想要拆分成多个子集,其中第1列应该包含在每个子集中,然后每个剩余的列应该与第1列一起放在单独的文件中。输出应该是tab - 分隔。请参阅下面的示例。

Mydata(例子):

Seq                     124R 239G 361R 267G
TGAGGTAGTAGTTTGTGCTGTTG 27  15  15  52
CACCCGTAGAACCGACCTT 58  32  44  69
TCAAGTAATCCAGGATAGGC    4   4   6   15
TTTGGCAATGGTAGAACTCACACTGGTGAGGT    7   45  0   33
CACCCGTAGAACCGACCTTGC   488 740 834 1784
CTGAGACCTCTGGGTTCTGAGCT 20  11  4   33
CCCATAAAGTAGAAAGCAC 47  53  56  235
TACCCATTGCATATCGGAGTTGT 174 257 206 333

我想将文件分割成这样的子文件:

文件1:

Seq                     124G
TGAGGTAGTAGTTTGTGCTGTTG 27
CACCCGTAGAACCGACCTT 58
TCAAGTAATCCAGGATAGGC    4
TTTGGCAATGGTAGAACTCACACTGGTGAGGT    7
CACCCGTAGAACCGACCTTGC   488
CTGAGACCTCTGGGTTCTGAGCT 20
CCCATAAAGTAGAAAGCAC 47
TACCCATTGCATATCGGAGTTGT 174

file2:

Seq                     239G
TGAGGTAGTAGTTTGTGCTGTTG 15
CACCCGTAGAACCGACCTT 32
TCAAGTAATCCAGGATAGGC    4
TTTGGCAATGGTAGAACTCACACTGGTGAGGT    45
CACCCGTAGAACCGACCTTGC   740
CTGAGACCTCTGGGTTCTGAGCT 11
CCCATAAAGTAGAAAGCAC 53
TACCCATTGCATATCGGAGTTGT 257

... file3的

3 个答案:

答案 0 :(得分:1)

如果没有,你有答案:试试下面

use strict;
use warnings;
open FH, "<input.txt";
my @ARR = <FH>;
my (@MAIN, @one, @two, @thr, @fou);
foreach (@ARR)
{
     push (@MAIN, $1), push (@one, $2),push (@two, $3),push (@thr, $4),push (@fou, $5), if($_ =~ /(\S+)\s+?(\S+)\s+?(\S+)\s+?(\S+)\s+?(\S+)/);
}
foreach (1..4)
{
    open FH, ">FILE$_".".txt";
    my @ARR;
    for(my $i = 0;$i<@MAIN;$i++)
    {
        if($_ == 1){@ARR = @one;}
        if($_ == 2){@ARR = @two;}
        if($_ == 3){@ARR = @thr;}
        if($_ == 4){@ARR = @fou;}
        print FH $MAIN[$i],"\t",$ARR[$i],"\n";
    }
}

答案 1 :(得分:1)

也许以下内容会有所帮助:

perl script.pl dataFile

用法:import maya.cmds as cmds from PySide import QtGui import maya.OpenMayaUI as mui import shiboken class UI(object): def __init__(self): self.constraintMaster_UI() def getMayaWindow(self): pointer = mui.MQtUtil.mainWindow() # This is Maya's main window QtGui.QMainWindow.styleSheet(shiboken.wrapInstance(long(pointer), QtGui.QWidget)) return shiboken.wrapInstance(long(pointer), QtGui.QWidget) def clickedButton(self): print "You just clicked the button!" def constraintMaster_UI(self): objectName = "pyConstraintMasterWin" # Check to see if the UI exists, if so delete it if cmds.window("pyConstraintMasterWin", exists = True): cmds.deleteUI("pyConstraintMasterWin", wnd = True) # Create the window, parent it to the main Maya window (parent -> window). # Assign the object name (window name string) to the window parent = self.getMayaWindow() window = QtGui.QMainWindow(parent) window.setObjectName(objectName) window.setWindowTitle("Constraint Master") window.setMinimumSize(400, 125) window.setMaximumSize(400, 125) # Create the main widget to contain all the stuff, parent it to the main Widget mainWidget = QtGui.QWidget() window.setCentralWidget(mainWidget) # Create the main vertical layout, add the button and its command verticalLayout = QtGui.QVBoxLayout(mainWidget) button = QtGui.QPushButton("Create Constraint") verticalLayout.addWidget(button) button.clicked.connect(self.clickedButton) window.show() UI()

此方法一次只能从数据集中读取一行,因此处理“非常大的数据集”时应该没有问题。

答案 2 :(得分:1)

你可以试试这个

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php/$1 [L] 
</IfModule>

更好的可读版本

awk -vOFS="\t" '{for(i=2;i<=NF;i++){ f=sprintf("file_%d.txt",i-1); if(f in F){ print $1,$i >>f }else{ print $1,$i >f; F[f]} close(f) }}' file