高效的元组搜索算法

时间:2015-08-31 23:58:33

标签: algorithm search data-structures tuples

鉴于存在3元组,其中:

  • 所有元素都是数字ex :( 1,3,4)(1300,3,15)(1300,3,15)......
  • 删除并经常添加元组
  • 商店通常不超过100,000个元素
  • 所有元组都在内存中
  • 该应用程序是交互式的,每秒需要100次搜索。

执行通配符(*)搜索的最有效算法/数据结构是什么,例如:

(1, *, 6)  (3601, *, *)  (*, 1935, *)

目标是在应用程序级别上拥有像元组空间一样的Linda

2 个答案:

答案 0 :(得分:2)

嗯,只有8种可能的通配符排列方式,因此您可以轻松构建6个多图和一组作为索引:一个用于查询中每个通配符的排列。您不需要第8个索引,因为查询(*,*,*)通常会返回所有元组。该集用于没有通配符的元组;在这种情况下,只需要进行成员资格测试。

multimap将密钥带到集合中。在您的示例中,例如,查询(1,*,6)会查询多重映射以查找(X,*,Y)形式的查询,该查询将<X,Y>键与X中所有元组的集合相关联Y第一个位置,第三个位置X。在这种情况下,Y = 1且<a,b,c> = 6。

使用任何合理的基于散列的多图实现,查找应该非常快。几百秒应该是容易的,并且每秒几千个可行(例如当代x86 CPU)。

插入和删除需要更新地图和设置。同样,这应该相当快,但当然不如查找快。每秒几百次应该是可行的。

只有大约10 ^ 5个元组,这种方法对于记忆也应该没问题。您可以通过技巧节省一些空间,例如:将每个元组的单个副本保存在数组中并将索引存储在map / set中以表示键和值。使用空闲列表管理数组插槽。

为了使这个具体,这里是伪代码。我将使用尖括号# Definitions For a query Q <k2,k1,k0> where each of k_i is either * or an integer, Let I(Q) be a 3-digit binary number b2|b1|b0 where b_i=0 if k_i is * and 1 if k_i is an integer. Let N(i) be the number of 1's in the binary representation of i Let M(i) be a multimap taking a tuple with N(i) elements to a set of tuples with 3 elements. Let t be a 3 element tuple. Then T(t,i) returns a new tuple with only the elements of t in positions where i has a 1. For example T(<1,2,3>,0) = <> and T(<1,2,3>,6) = <2,3> Note that function T works fine on query tuples with wildcards. # Algorithm to insert tuple T into the database: fun insert(t) for i = 0 to 7 add the entry T(t,i)->t to M(i) # Algorithm to delete tuple T from the database: fun delete(t) for i = 0 to 7 delete the entry T(t,i)->t from M(i) # Query algorithm fun query(Q) let i = I(Q) return M(i).lookup(T(Q, i)) # lookup failure returns empty set 来表示元组,以避免太多的问题:

M(0)

请注意,为简单起见,我没有显示&#34;优化&#34;适用于M(7)M(0)。对于i=0,上面的算法将创建一个多图,将空元组取为数据库中所有3元组的集合。您可以通过将M(7)视为特例来避免这种情况。类似地,fun insert(t) for i = 1 to 6 add the entry T(t,i)->t to M(i) add t to set S fun delete(t) for i = 1 to 6 delete the entry T(t,i)->t from M(i) remove t from set S fun query(Q) let i = I(Q) if i = 0, return S elsif i = 7 return if Q\in S { Q } else {} else return M(i).lookup(T(Q, i)) 会将每个元组带到仅包含自身的集合中。

&#34;优化&#34;版本:

package hacking;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Random;
import java.util.Scanner;
import java.util.Set;

public class Hacking {
  public static void main(String [] args) {
    TupleDatabase db = new TupleDatabase();
    int n = 200000;
    long start = System.nanoTime();
    for (int i = 0; i < n; ++i) {
      db.insert(db.randomTriple());
    }
    long stop = System.nanoTime();
    double elapsedSec = (stop - start) * 1e-9;
    System.out.println("Inserted " + n + " tuples in " + elapsedSec
        + " seconds (" + (elapsedSec / n * 1000.0) + "ms per insert).");
    Scanner in = new Scanner(System.in);
    for (;;) {
      System.out.print("Query: ");
      int a = in.nextInt();
      int b = in.nextInt();
      int c = in.nextInt();
      System.out.println(db.query(new Tuple(a, b, c)));
    }
  }
}

class Tuple {
  static final int [] N_ONES = new int[] { 0, 1, 1, 2, 1, 2, 2, 3 };
  static final int STAR = -1;

  final int [] vals;

  Tuple(int a, int b, int c) {
    vals = new int[] { a, b, c };
  }

  Tuple(Tuple t, int code) {
    vals = new int[N_ONES[code]];
    int m = 0;
    for (int k = 0; k < 3; ++k) {
      if (((1 << k) & code) > 0) {
        vals[m++] = t.vals[k];
      }
    }
  }

  @Override 
  public boolean equals(Object other) {
    if (other instanceof Tuple) {
      Tuple triple = (Tuple) other;
      return Arrays.equals(this.vals, triple.vals);
    }
    return false;
  }

  @Override
  public int hashCode() {
    return Arrays.hashCode(this.vals);
  }

  @Override
  public String toString() {
    return Arrays.toString(vals);
  }

  int code() {
    int c = 0;
    for (int k = 0; k < 3; k++) {
      if (vals[k] != STAR) {
        c |= (1 << k);
      }
    }
    return c;
  }

  Set<Tuple> setOf() {
    Set<Tuple> s = new HashSet<>();
    s.add(this);
    return s;
  }
}

class Multimap extends HashMap<Tuple, Set<Tuple>> {
  @Override
  public Set<Tuple> get(Object key) {
    Set<Tuple> r = super.get(key);
    return r == null ? Collections.<Tuple>emptySet() : r;
  }

  void put(Tuple key, Tuple value) {
    if (containsKey(key)) {
      super.get(key).add(value);
    } else {
      super.put(key, value.setOf());
    }
  }

  void remove(Tuple key, Tuple value) {
    Set<Tuple> set = super.get(key);
    set.remove(value);
    if (set.isEmpty()) {
      super.remove(key);
    }
  }
}

class TupleDatabase {
  final Set<Tuple> set;
  final Multimap [] maps;

  TupleDatabase() {
    set = new HashSet<>();
    maps = new Multimap[7];
    for (int i = 1; i < 7; i++) {
      maps[i] = new Multimap();
    }
  }

  void insert(Tuple t) {
    set.add(t);
    for (int i = 1; i < 7; i++) {
      maps[i].put(new Tuple(t, i), t);
    }
  }

  void delete(Tuple t) {
    set.remove(t);
    for (int i = 1; i < 7; i++) {
      maps[i].remove(new Tuple(t, i), t);
    }
  }

  Set<Tuple> query(Tuple q) {
    int c = q.code();
    switch (c) {
    case 0: return set;
    case 7: return set.contains(q) ? q.setOf() : Collections.<Tuple>emptySet();
    default: return maps[c].get(new Tuple(q, c));
    }
  }

  Random gen = new Random();

  int randPositive() {
    return gen.nextInt(1000);
  }

  Tuple randomTriple() {
    return new Tuple(randPositive(), randPositive(), randPositive());
  }
}

<强>加成

为了好玩,Java实现:

Inserted 200000 tuples in 2.981607358 seconds (0.014908036790000002ms per insert).
Query: -1 -1 -1
[[504, 296, 987], [500, 446, 184], [499, 482, 16], [488, 823, 40], ...
Query: 500 446 -1
[[500, 446, 184], [500, 446, 762]]
Query: -1 -1 500
[[297, 56, 500], [848, 185, 500], [556, 351, 500], [779, 986, 500], [935, 279, 500], ...

一些输出:

{{1}}

答案 1 :(得分:0)

如果您将元组视为IP地址,那么基数树(trie)类型结构可能会起作用。基数树用于IP发现。

另一种方法可能是计算使用位操作并为元组计算位哈希,并在搜索中执行位(或,和)以便快速发现。