public class Dico 
   private String m_term; // term
   private double m_weight; // weight of term
   private int m_Id_doc; // id of doc that contain term

   public Dico(int Id_Doc,String Term,double tf_ief ) 
      this.m_Id_doc = Id_Doc;
      this.m_term = Term;
      this.m_weight = tf_ief;
   public String getTerm()
      return this.m_term;

   public double getWeight()
     return this.m_weight;

   public void setWeight(double weight)
     this.m_weight= weight;

   public int getDocId()
     return this.m_Id_doc;


 public List<Dico> merge_list_map(List<Dico> list,Map<String,Double> map)
    // in map each term is unique but in list i have redundancy

   List<Dico> list_term_weight = new ArrayList <>();

   for (Map.Entry<String,Double> entrySet : map.entrySet())
       String key = entrySet.getKey();
       Double value = entrySet.getValue();

       for(Dico dic : list)
          String term =dic.getTerm();
          double weight = dic.getWeight();

             double new_weight =weight*value;                
             list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
    return list_term_weight;



public List<Dico> merge_list_map(List<Dico> list, Map<String, Double> map)
    // in map each term is unique but in list i have redundancy
    List<Dico> list_term_weight = new ArrayList<>();

    for (Dico dic : list)
        String term = dic.getTerm();
        double weight = dic.getWeight();

        Double value = map.get(term);  // <== fetch weight from Map
        if (value != null)
            double new_weight = weight * value;

            list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));

    return list_term_weight;


List<Dico> list = Arrays.asList(new Dico(1, "foo", 1), new Dico(2, "bar", 2), new Dico(3, "baz", 3));
Map<String, Double> weights = new HashMap<String, Double>();
weights.put("foo", 2d);
weights.put("bar", 3d);
System.out.println(merge_list_map(list, weights));


[Dico [m_term=foo, m_weight=2.0, m_Id_doc=1], Dico [m_term=bar, m_weight=6.0, m_Id_doc=2]]

时间测试 - 10,000个元素

List<Dico> list = new ArrayList<Dico>();
Map<String, Double> weights = new HashMap<String, Double>();
for (int i = 0; i < 1e4; i++) {
    list.add(new Dico(i, "foo-" + i, i));
    if (i % 3 == 0) {
        weights.put("foo-" + i, (double) i);  // <== every 3rd has a weight

long t0 = System.currentTimeMillis();
List<Dico> result1 = merge_list_map_original(list, weights);
long t1 = System.currentTimeMillis();
List<Dico> result2 = merge_list_map_fast(list, weights);
long t2 = System.currentTimeMillis();

System.out.println(String.format("Original: %d ms", t1 - t0));
System.out.println(String.format("Fast:     %d ms", t2 - t1));

// prove results equivalent, just different order
// requires Dico class to have hashCode/equals() - used eclipse default generator
System.out.println(new HashSet<Dico>(result1).equals(new HashSet<Dico>(result2)));


Original: 1005 ms
Fast:     16 ms  <=== loads quicker

另外,检查Map的初始化。 (http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html)地图的重演在性能上是昂贵的。


作为一般规则,默认加载因子(.75)提供了一个好处   时间和空间成本之间的权衡。值越高,值越低   空间开销,但增加了查找成本(反映在大多数   HashMap类的操作,包括get和put)。预期的   应该考虑地图中的条目数量及其加载因子   帐户设置其初始容量时,以便最小化   重新运算的次数。如果初始容量大于   条目的最大数量除以加载因子,没有重新哈希   操作将永远发生。


如果要将多个映射存储在HashMap实例中,请创建它   具有足够大的容量将允许映射   存储比让它执行自动重组更有效   需要增长表。


Map<String, Double> foo = new HashMap<String, Double>(maxSize * 2);


为了使merge_list_map函数有效,您需要实际使用Map它是什么:一个有效的键查找数据结构。 正如您所做的那样,循环Map条目并在List中查找匹配项,算法为O(N * M),其中M是地图的大小,N是大小的名单。这肯定是你能得到的最糟糕的。

如果您首先遍历List,然后对每个Term进行循环,请使用MapMap$get(String key)中进行查找,您将获得时间复杂度O(N)因为地图查找可以被认为是O(1)。

在设计方面,如果你可以使用Java8,你的问题可以用Stream s来翻译:

public static List<Dico> merge_list_map(List<Dico> dico, Map<String, Double> weights) {
    List<Dico> wDico = dico.stream()
            .filter  (d -> weights.containsKey(d.getTerm()))
            .map     (d -> new Dico(d.getTerm(), d.getWeight()*weights.get(d.getTerm())))
            .collect (Collectors.toList());
    return wDico;


  1. stream():将列表作为Dico元素的
  2. filter():仅保留Dico位于term地图
  3. 中的weights个元素
  4. map():对于每个已过滤的元素,使用计算的权重创建一个new Dico()实例。
  5. collect():收集新列表中的所有新实例
  6. 使用新权重返回包含已过滤的Dico的新列表。

  7. 表现明智,我针对来自E.A.Poe的一些文字the narrative of Arthur Gordon Pym对其进行了测试:

    String text = null;
    try (InputStream url = new URL("http://www.gutenberg.org/files/2149/2149-h/2149-h.htm").openStream())  {
        text = new Scanner(url, "UTF-8").useDelimiter("\\A").next();    
    String[] words = text.split("[\\p{Punct}\\s]+");
    System.out.println(words.length); // => 108028


    List<Dico> dico = initDico(words);
    List<Dico> bigDico = new ArrayList<>(10*dico.size());
    for (int i = 0; i < 10; i++) {
    System.out.println(bigDico.size()); // 1080280


    Map<String, Double> weights = initWeights(words);
    System.out.println(weights.size()); // 9449 distinct words

    测试合并 1M单词与权重图:

    long start = System.currentTimeMillis();
    List<Dico> wDico = merge_list_map(bigDico, weights);
    long end = System.currentTimeMillis();
    System.out.println("===== Elapsed time (ms): "+(end-start)); 
    // => 105 ms

    权重图明显小于你的权重,但它不应影响时间,因为查找操作处于准常数 时间。



    private static List<Dico> initDico(String[] terms) {
        List<Dico> dico = Arrays.stream(terms)
                .map(s -> new Dico(s, 1.0))
        return dico;
    // weight of a word is the frequency*1000
    private static Map<String, Double> initWeights(String[] terms) {
        Map<String, Long> wfreq = termFreq(terms);
        long total = wfreq.values().stream().reduce(0L, Long::sum);
        return wfreq.entrySet().stream()
                .collect(Collectors.toMap(Map.Entry::getKey, e -> (double)(1000.0*e.getValue()/total)));
    private static Map<String, Long> termFreq(String[] terms) {
        Map<String, Long> wfreq = Arrays.stream(terms)
                .collect(groupingBy(Function.identity(), counting()));
        return wfreq;

