Question

我有104k字符串值，其中89k是唯一的。我想检查一下该列表中是否存在字符串。

这是我的班级及其保存所有这些记录的方法。

public class TestClass {
    private static TestClass singletonObj = null;
    private List<String> stringList= null;

    public static synchronized TestClass getInstance() {
        if(singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }


    public boolean isValidString(String token) {
        if(stringList == null) {
            init();
        }
        if(stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private init() {
     stringList = new ArrayList<String>();
     // put all 104k values in this data structure.
    }
}

我的应用程序尝试同时使用此isValidString()方法，每秒约20个请求。这工作正常，但当我尝试将数据结构更改为HashSet时，CPU使用率非常高。根据我的理解，Hashset应该比ArrayList [o（n）]执行更好的[o（1）]。有谁能解释我为什么会这样？

Answer 1

我创建了一个简单的类来生成20个线程，这些线程每秒都按照这篇文章的底部访问你的字典检查器。

我无法复制您的结果 - 但这可能是因为我有权访问的输入数据。我使用了TestClass实现从英语开放词汇表（EOWL）导入~130,000个单词。使用ArrayList或HashSet作为stringList的类型时，不会出现持续的高CPU使用率。

我的猜测是你的问题是由你的输入数据引起的。我尝试将输入字典添加两次以创建重复 - 显然使用ArrayList这只会使列表长两倍，但是使用HashSet，这意味着代码会丢失重复项。您注意到大约1/5的输入数据是重复的。在我的测试中有1/2重复项，我确实看到轻微增加了CPU大约2秒钟，然后在stringList初始化后几乎没有任何内容

这＆＃34; blip＆＃34;如果您的输入字符串比我使用的单个单词更复杂，则可能会更长。也许这就是你的问题。或者 - 也许你有一些其他代码包含这个占用CPU的部分。

N.B。 我会提醒您，其他人在评论您对init的实施时也会如此。在我的实验中，我看到许多线程可以在字典完全初始化之前调用字典检查，从而为相同的测试字提供不一致的结果。为什么不从构造函数中调用它，如果它是一个单例对象？

代码清单

您的TestClass包含一些输入数据代码：

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;

public class TestClass {
    private static TestClass singletonObj = null;
    //private List<String> stringList = null;

    private HashSet<String> stringList = null;

    public static synchronized TestClass getInstance() {
        if (singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }

    public boolean isValidString(String token) {
        if (stringList == null) {
            init();
        }
        if (stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private void init() {
        String dictDir = "C:\\Users\\Richard\\Documents\\EOWL_CSVs";
        File[] csvs = (new File(dictDir)).listFiles();
        stringList = new HashSet<String>();
        Scanner inFile = null;

        for (File f : csvs) {
            try {
                inFile = new Scanner(new FileReader(f));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
                System.exit(1);
            }

            while (inFile.hasNext()) {
                stringList.add(inFile.next().toLowerCase()
                        .replaceAll("[^a-zA-Z ]", ""));
            }
            inFile.close();
        }

        System.out.println("Dictionary initialised with " + stringList.size()
                + " members");
    }
}

访问它的线程：

import java.io.FileNotFoundException;

public class DictChecker extends Thread {

    TestClass t = null;
    public static int classId = 0;
    String className = null;


    public void doWork()
    {
        String testString = "Baby";
        if (t.isValidString(testString))
        {
            System.out.println("Got a valid string " + testString + " in class " + className);
        }
        else
        {
            System.out.println(testString + " not in the dictionary");
        }
    }

    public void run()
    {
        while (true)
        {
            try {
                DictChecker.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            doWork();
        }
    }

    public DictChecker()
    {
        t = TestClass.getInstance();
        className = "dChecker" + classId;
        classId += 1;
        System.out.println("Initialised " + className + " in thread " + this.getName());
    }

    public static void main(String[] args) throws FileNotFoundException
    {
        for (int i = 0; i < 20; i++)
        {
             (new DictChecker()).start();
             try {
                DictChecker.sleep(50);//simply to distribute load over the second
            } catch (InterruptedException e) {
                e.printStackTrace();
            } 
        }
    }
}

Answer 2

我的猜测是public/examples/avatars/是一个基于散列的结构，从插入的瞬间计算每个String 的hashCode，即在方法{{ 1}}。这可能是CPU变高的时期，也是我们在迭代结构值时获得更好吞吐量所付出的代价的一部分。

如果我是对的，在方法"NetworkError: 404 Not Found - http://foosballtr.herokuapp.com/system/players/avatars/000/000/007/medium/7.jpg?1438851753"结束后，CPU应该下降，程序的速度应该大大增加，这就是使用HashSet的好处。

顺便说一句：确定优化的方法是预先调整结构：

ArrayList的初始大小应等于将包含的最大元素数。

和Hash设置的初始大小比最大值大1.7。

BTW：HashSet的标准哈希算法计算字符串的所有字符。也许你可能满足于计算前100个字符，例如（取决于你正在处理的数据的性质，是corse）。然后，您可以将您的字符串封装到您自己的类中，使用您自己的哈希算法覆盖init方法，并覆盖init方法以执行严格的比较。

Answer 3

JDK HashSet建立在HashMap<T, Object>之上，其中value是单个'present'对象。这意味着HashSet的内存消耗与HashMap相同：为了存储SIZE值，您需要32 * SIZE + 4 * CAPACITY个字节（加上值的大小）。

对于ArrayList，它是java.util.ArrayList的容量乘以引用大小(4 bytes on 32bit, 8bytes on 64bit) + [Object header + one int and one references]。

所以 HashSet 绝对不是一个对内存有用的集合。

取决于您使用32-bit还是64-bit VM。也就是说，HashSet受8-byte引用的伤害比ArrayList更糟 - 根据链接的内存消耗图表，每个引用添加额外的4 bytes，使每个元素的ArrayList最多~12个字节和HashSet每个元素最多~52个字节。）

ArrayList是使用Objects数组实现的。下图显示了32位Java运行时上ArrayList的内存使用情况和布局：

32位Java运行时上的ArrayList的内存使用和布局

上图显示，当创建ArrayList时，结果是使用ArrayList内存的32 bytes对象，以及默认大小为{{1}的Object数组为空10总计88 bytes内存。这意味着ArrayList的大小不准确，因此具有默认容量，恰好是ArrayList。

ArrayList

的属性

10 entries - 10

Default capacity - 88字节

Empty size - 48个字节加上每个条目4个字节

Overhead代表10K集合 - ~40K

Overhead - O（n） - 所花费的时间与元素数量呈线性关系

HashSet比HashMap具有更少的功能，因为它不能包含多个空条目，并且不能有重复的条目。该实现是HashMap的包装器，HashSet对象管理允许放入HashMap对象的内容。限制HashMap功能的附加功能意味着HashSets的内存开销略高。

32位Java运行时上的HashSet的内存使用和布局

上图显示了以字节为单位的浅堆（单个对象的内存使用情况），以及java.util.HashSet对象的保留堆（单个对象及其子对象的内存使用情况）（以字节为单位）。浅堆大小为Search/insert/delete performance，保留堆大小为16 bytes。创建HashSet时，其默认容量（可以放入集合中的条目数）为144 bytes。当以默认容量创建HashSet并且没有条目放入集合时，它将占用16 entries。这是对HashMap的内存使用的额外144 bytes。

下表显示了HashSet的属性：

HashSet的属性

16 bytes - 16个条目

Default capacity - 144字节

Empty size - 16个字节加上HashMap开销

Overhead - 16个字节加上HashMap开销

Overhead for a 10K collection - O（1） - 无论元素数量多少，所用时间都是恒定时间（假设没有哈希冲突）

HashSet与ArrayList的CPU使用率很高

3 个答案:

代码清单

ArrayList

HashSet的属性