在git repo中查找超过x兆字节的文件,这些文件在HEAD中不存在

时间:2008-11-18 10:00:45

标签: git

我有一个Git存储库我存储了随机内容。主要是随机脚本,文本文件,我设计的网站等。

我有一些大型二进制文件随着时间的推移而被删除(通常为1-5MB),它们会增加存储库的大小,这在修订历史中是不需要的。

基本上我希望能够做到......

me@host:~$ [magic command or script]
aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old
6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old

..然后能够查看每个结果,检查是否不再需要然后删除它(可能使用filter-branch

10 个答案:

答案 0 :(得分:53)

这是the git-find-blob script I posted previously的改编:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }

@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();

my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp; 

sub walk_tree {
    my ( $tree, @path ) = @_;
    my @subtree;
    my @r;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
            if ( $type eq 'tree' ) {
                push @subtree, [ $sha1, $name ];
            }
            elsif ( $type eq 'blob' and $size >= $cutoff ) {
                push @r, [ $size, @path, $name ];
            }
        }
    }

    push @r, walk_tree( $_->[0], @path, $_->[1] )
        for @subtree;

    return @r;
}

memoize 'walk_tree';

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
    or die "Couldn't open pipe to git-log: $!\n";

my %seen;
while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $age ) = split " ", $_, 3;
    my $is_header_printed;
    for ( walk_tree( $tree ) ) {
        my ( $size, @path ) = @$_;
        my $path = join '/', @path;
        next if $seen{ $path }++;
        print "$commit $age\n" if not $is_header_printed++;
        print "\t$size\t$path\n";
    }
}

答案 1 :(得分:44)

更紧凑的红宝石脚本:

#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte

big_files = {}

IO.popen("git rev-list #{head}", 'r') do |rev_list|
  rev_list.each_line do |commit|
    commit.chomp!
    for object in `git ls-tree -zrl #{commit}`.split("\0")
      bits, type, sha, size, path = object.split(/\s+/, 5)
      size = size.to_i
      big_files[sha] = [path, size, commit] if size >= treshold
    end
  end
end

big_files.each do |sha, (path, size, commit)|
  where = `git show -s #{commit} --format='%h: %cr'`.chomp
  puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end

用法:

ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M  example/blah.psd  (aad2981: 4 months ago)
1.1M  another/big.file  (6e73ca2: 2 weeks ago)

答案 2 :(得分:15)

Python脚本做同样的事情(基于this post):

#!/usr/bin/env python

import os, sys

def getOutput(cmd):
    return os.popen(cmd).read()

if (len(sys.argv) <> 2):
    print "usage: %s size_in_bytes" % sys.argv[0]
else:
    maxSize = int(sys.argv[1])

    revisions = getOutput("git rev-list HEAD").split()

    bigfiles = set()
    for revision in revisions:
        files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
        for file in files:
            if file == "":
                continue
            splitdata = file.split()
            commit = splitdata[2]
            if splitdata[3] == "-":
                continue
            size = int(splitdata[3])
            path = splitdata[4]
            if (size > maxSize):
                bigfiles.add("%10d %s %s" % (size, commit, path))

    bigfiles = sorted(bigfiles, reverse=True)

    for f in bigfiles:
        print f

答案 3 :(得分:6)

哎呀......第一个剧本(亚里士多德)非常慢。在git.git repo上,查找文件&gt; 100k,它会占用CPU大约6分钟。

它似乎还打印了几个错误的SHA - 通常会打印出与下一行中提到的文件名无关的SHA。

这是一个更快的版本。输出格式不同,但它非常快,而且 - 据我所知 - 正确。

程序 有点长,但很多都是措辞。

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );

my $min = shift;
$min =~ /^\d+$/ or die "need a number";

# ----------------------------------------------------------------------

my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;

# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";

my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
    next unless / ./;    # no commits or top level trees
    ( $blob, $name ) = split;
    $name{$blob} = $name;
    say $blobfile $blob;
}
close($blobfile);

# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";

my ( $dummy, $size );
while (<$sizes>) {
    ( $blob, $dummy, $size ) = split;
    next if $size < $min;
    $size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}

my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;

say "
The size shown is the largest that file has ever attained.  But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";

# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
    say "$size{$name}\t$name";

    for my $r (@refs) {
        system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
    }
    print "\n";
}
print "\n";

答案 4 :(得分:6)

您希望使用BFG Repo-Cleaner,这是git-filter-branch的一种更快,更简单的替代方案,专门用于从Git repos中删除大文件

下载BFG jar(需要Java 6或更高版本)并运行此命令:

$ java -jar bfg.jar  --strip-blobs-bigger-than 1M  my-repo.git

任何超过1M的文件(不在最新提交中)都将从Git存储库的历史记录中删除。然后,您可以使用git gc清除死数据:

$ git gc --prune=now --aggressive

BFG通常比git-filter-branch 10-50x快{{3}},并且这些选项围绕这两种常见用例进行了定制:

  • 删除疯狂大文件
  • 删除密码,凭据&amp;其他私人数据

完全披露:我是BFG Repo-Cleaner的作者。

答案 5 :(得分:4)

亚里士多德的剧本会告诉你你想要什么。您还需要知道已删除的文件仍占用仓库中的空间。

默认情况下,Git会将更改保留30天,然后才能进行垃圾回收。如果你想现在删除它们:

$ git reflog expire --expire=1.minute refs/heads/master
     # all deletions up to 1 minute  ago available to be garbage-collected
$ git fsck --unreachable 
     # lists all the blobs(file contents) that will be garbage-collected 
$ git prune 
$ git gc

方面评论:虽然我是Git的忠实粉丝,但Git并没有为存储“随机脚本,文本文件,网站”和二进制文件的集合带来任何好处。 Git跟踪内容的变化,特别是许多文本文件之间协调变化的历史记录,并且非常高效和有效地进行。正如您的问题所示,Git没有用于跟踪单个文件更改的好工具。并且它不会跟踪二进制文件中的更改,因此任何修订版本都会在repo中存储另一个完整副本。

当然,使用Git是熟悉其工作原理的绝佳方式。

答案 6 :(得分:3)

#!/bin/bash
if [ "$#" != 1 ]
then
  echo 'git large.sh [size]'
  exit
fi

declare -A big_files
big_files=()
echo printing results

while read commit
do
  while read bits type sha size path
  do
    if [ "$size" -gt "$1" ]
    then
      big_files[$sha]="$sha $size $path"
    fi
  done < <(git ls-tree --abbrev -rl $commit)
done < <(git rev-list HEAD)

for file in "${big_files[@]}"
do
  read sha size path <<< "$file"
  if git ls-tree -r HEAD | grep -q $sha
  then
    echo $file
  fi
done

Source

答案 7 :(得分:1)

我对https://stackoverflow.com/a/10099633/131881

的python简化
#!/usr/bin/env python
import os, sys

bigfiles = []
for revision in os.popen('git rev-list HEAD'):
    for f in os.popen('git ls-tree -zrl %s' % revision).read().split('\0'):
        if f:
            mode, type, commit, size, path = f.split(None, 4)
            if int(size) > int(sys.argv[1]):
                bigfiles.append((int(size), commit, path))

for f in sorted(set(bigfiles)):
    print f

答案 8 :(得分:1)

这个bash&#34; one-liner&#34;显示存储库中大于10 MiB且在HEAD中不存在的从最小到最大的所有blob对象。

非常快,易于复制&amp;粘贴,只需要标准的GNU实用程序。

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk -v min_mb=10 '/^blob/ && $3 >= min_mb*2^20 {print substr($0,6)}' \
| grep -vF "$(git ls-tree -r HEAD | awk '{print $3}')" \
| sort --numeric-sort --key=2 \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

这将生成如下输出:

2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

有关更多信息,包括更适合进一步脚本处理的输出格式,请参阅我的original answer类似问题。

答案 9 :(得分:-1)

派对有点晚,但git-fat内置了此功能。

只需使用pip安装它并运行git fat -a find 100000,其中末尾的数字以字节为单位。