Question

我有一个用例，我想使用ElasticSearch进行实时分析。在此范围内，我希望能够计算一些简单的亲和力分数。

与完整的用户群相比，目前使用按标准过滤的用户群执行的交易数量来定义这些交易。

根据我的理解，我需要做以下事情：

获取已过滤用户群的不同交易
在完整的用户群中查询这些交易（类型）
进行计算（规范等）

要获得已过滤用户群的“不同交易”，我目前使用带有分面的条款过滤查询，该分页返回所有条款（交易类型）。据我了解，我需要将此结果用作条款筛选查询的输入，以便能够接收我想要的结果。

我读到GitHub上有一个拉取请求似乎实现了这个（https://github.com/elasticsearch/elasticsearch/pull/3278），但是对于我来说这是否已经在当前版本中已经可用时并不是很明显。

如果没有，是否有一些解决方法可以实现这个？

作为附加信息，这是我的示例映射：

curl -XPUT 'http://localhost:9200/store/user/_mapping' -d '
{
  "user": {
    "properties": {
      "user_id": { "type": "integer" },
      "gender": { "type": "string", "index" : "not_analyzed" },
      "age": { "type": "integer" },
      "age_bracket": { "type": "string", "index" : "not_analyzed" },
      "current_city": { "type": "string", "index" : "not_analyzed" },
      "relationship_status": { "type": "string", "index" : "not_analyzed" },
      "transactions" : {
        "type": "nested",
        "properties" : {
          "t_id": { "type": "integer" },
          "t_oid": { "type": "string", "index" : "not_analyzed" },
          "t_name": { "type": "string", "index" : "not_analyzed" },
          "tt_id": { "type": "integer" },
          "tt_name": { "type": "string", "index" : "not_analyzed" },
        }
      }
    }
  }
}'

因此，对于我的示例用例的实际预期结果，我有以下内容：

我的过滤后的用户群将使用此示例过滤器：“gender”：“male”＆amp; “relationship_status”：“单身”。对于这些，我想获得不同的事务类型（嵌套文档的字段“tt_name”）并计算不同user_id的数量。
接下来，我想查询我的完整用户群（除了1中的事务类型列表之外没有过滤器）并计算不同user_ids的数量
执行“亲和力”计算

Answer 1

以下是可运行示例的链接：

http://sense.qbox.io/gist/9da6a30fc12c36f90ae39111a08df283b56ec03c

它假设文件看起来像：

{ "transaction_type" : "some_transaction", "user_base" : "some_user_base_id" }

查询设置为不返回任何结果，因为聚合负责计算您正在查找的统计信息：

{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "distinct_transactions" : {
      "terms" : {
        "field" : "transaction_type",
        "size" : 20
      },
      "aggs" : {
        "by_user_base" : {
          "terms" : {
            "field" : "user_base",
            "size" : 20
          }
        }
      }
    }
  }
}

以下是结果：

  "aggregations": {
      "distinct_transactions": {
         "buckets": [
            {
               "key": "subscribe",
               "doc_count": 4,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "2",
                        "doc_count": 3
                     },
                     {
                        "key": "1",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "purchase",
               "doc_count": 3,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "1",
                        "doc_count": 2
                     },
                     {
                        "key": "2",
                        "doc_count": 1
                     }
                  ]
               }
            }
         ]
      }
   }

因此，在“聚合”中，您将拥有“distinct_transactions”列表。密钥将是事务类型，doc_count将代表所有用户的总事务。

在每个“distinct_transaction”的内部，有“by_user_base”，这是另一个术语agg（嵌套）。就像交易一样，密钥将代表用户基本名称（或ID或其他），doc_count将代表该唯一用户群的交易数量。

那是你想要做的吗？希望我帮忙。

Answer 2

使用当前版本的ElasticSerach，可以使用新的significant_terms聚合类型，可以更简单的方式计算我的用例的亲和度分数。

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html#_recommending_based_on_statistics

然后可以一步计算所有与我相关的指标，这非常好！

ElasticSearch Join Filter：使用子查询结果作为过滤器输入吗？

2 个答案: