Question

我有很多网址需要处理。我在一个哈希集中存储了大约20'000'000。这会造成一些记忆问题。

我试图创建一个压缩的字符串类：

import java.io.*;//file writer
import java.util.*;
import java.util.zip.*;

class CompressedString2 implements Serializable{
    private int originalSize;
    private byte[] cstring;



    public CompressedString2 (){
        compress("");
    }


    public CompressedString2 (String string){
        compress(string);
    }


    public void compress(String str){
        try {
            byte[] bytes = str.getBytes("UTF-8");
            originalSize = bytes.length;

            ByteArrayOutputStream deflatedBytes = new ByteArrayOutputStream();
            DeflaterOutputStream dos = new DeflaterOutputStream(deflatedBytes,new Deflater(Deflater.DEFAULT_COMPRESSION));
            dos.write(bytes);
            dos.finish();
            cstring=deflatedBytes.toByteArray();
        }catch(Exception e){e.printStackTrace();}

    }


    public String decompress() throws Exception{
        String result="";
        try{
            ByteArrayOutputStream deflatedBytes=new ByteArrayOutputStream();
            deflatedBytes.write(cstring);
            deflatedBytes.close();


            InflaterInputStream iis = new InflaterInputStream(new ByteArrayInputStream(deflatedBytes.toByteArray()));
            byte[] inflatedBytes = new byte[originalSize];
            iis.read(inflatedBytes);
            result= new String(inflatedBytes, "UTF-8");
        }catch(Exception e){e.printStackTrace();}
        return result;
    }
}

但事实上当我用这样的东西存储它们时：

HashSet<String> urlStr=new HashSet<String>();
HashSet<CompressedString> urlComp=new HashSet<CompressedString>();


        String filePath=new String();

            filePath=args[0];

        int num=0;

        try{
            BufferedReader br = new BufferedReader(new FileReader(filePath));

            String line = br.readLine();
            while (line != null) {

                num++;
                urlStr.add(line);
                urlComp.add(new CompressedString(line));

            line = br.readLine();
            }
        } catch(Exception e){
        System.out.println("fehler..:");
            e.printStackTrace();
        }

ObjectOutputStream oos1 = new ObjectOutputStream(new FileOutputStream("testDeflator_rawurls.obj"));
oos1.writeObject(urlStr);
ObjectOutputStream oos4 = new ObjectOutputStream(new FileOutputStream("testDeflator_compressed2.obj"));
oos4.writeObject(urlComp);

“压缩”的网址更大......

有没有人知道如何成功压缩网址？

Answer 1

好吧，如果他们在一个集合中，那么你所能做的就是添加/删除/查找。您还可以在“角色森林”上执行这些操作，它可以是更紧凑的表示。我在想一个节点树，每个节点都有一个字符，彼此相连。森林的根将包含“h”，“f”等等。在“h”节点下将是“t”节点，并且在该节点下是另一个“t”，并且在该节点下面是“p”等。“f”节点将具有“t”和“i”子节点。最终树会分支，但根部附近可能会有很多共享。然后你只需走过森林，看看是否有URL。

我认为一个节点需要一个布尔成员来指示集合中的一个URL终止，一个用于保存该字符的成员，以及一个指向其他节点的链接数组。

Answer 2

你考虑过不同的方法吗？散列集中有2000万个字符串。你能将它们存储在数据库中并从那里进行处理吗？

Answer 3

一般来说，只需要考虑压缩效果，字符串必须更长，因为它的工作原理基于所述字符串中的模式。

Answer 4

短字符串可能不会压缩到小于未压缩的字符串。您是否尝试过{@ 1}}默认情况下某些版本的Java 6。

Answer 5

您可以一次压缩n个URL，其中n可以是10到100。这将使压缩器以重复的字符串和倾斜的字符概率分布的方式工作。缺点是每次访问都必须将10到100的URL解压缩。因此，实现了这一点，在内存使用和速度之间进行交换，并选择您喜欢的折衷方案。

Answer 6

如果您的许多网址都有共同基础，例如http://www.mysite.com/，那么您应该考虑使用Ropes（project page），以便每个字符串的第一部分代表一次。< / p>

另见this wikipedia page

Answer 7

您可以使用tinyurl缩短长度然后存储它。
您可以找到java实用程序类到小URL here

Answer 8

例如，将100个链接连接在一起（由特殊字符分隔）并尝试将它们压缩为一个CompressedString怎么样？压缩可能需要最小长度才能有效。 CompressedString类可以恢复Collection中的100个字符串。

Answer 9

由于包装类的额外开销，压缩URL不一定会节省您任何内存。一种替代方法是使用前缀映射来缩短URL。但是，如果使用包装器类，则必须实现hashCode和equals方法。没有它们，哈希集将无法按预期工作（将允许重复）。对于CompressedString2，这些可以实现为：

@Override
public int hashCode() {
    return Arrays.hashCode(cstring);
}

public boolean equals(Object other){
    if(other instanceof CompressedString){
        return Arrays.equals(cstring, ((CompressedString) other).cstring);
    }
    return false;
}

可以大大减少内存占用的另一件事是使用例如Trove的THashSet。由于您知道URL的大致数量，因此您还可以增加负载因子并设置哈希集的初始大小，这将节省大量的哈希运算，并使您可以更有效地使用分配的空间。

压缩java字符串（urls）

9 个答案: