Question

我必须在ElasticSearch中存储一些与我的python程序集成的消息。现在我尝试存储消息的是：

d={"message":"this is message"}
    for index_nr in range(1,5):
        ElasticSearchAPI.addToIndex(index_nr, d)
        print d

这意味着如果我有10条消息，那么我必须重复我的代码10次。所以我想做的是尝试制作脚本文件或批处理文件。我已经检查了ElasticSearch Guide，可以使用BULK API。格式应如下所示：

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }

我做的是：

{"index":{"_index":"test1","_type":"message","_id":"1"}}
{"message":"it is red"}
{"index":{"_index":"test2","_type":"message","_id":"2"}}
{"message":"it is green"}

我还使用curl工具来存储doc。

$ curl -s -XPOST localhost:9200/_bulk --data-binary @message.json

现在我想使用我的Python代码将文件存储到弹性搜索中。

Answer 1

from datetime import datetime

from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch()

actions = [
  {
    "_index": "tickets-index",
    "_type": "tickets",
    "_id": j,
    "_source": {
        "any":"data" + str(j),
        "timestamp": datetime.now()}
  }
  for j in range(0, 10)
]

helpers.bulk(es, actions)

Answer 2

虽然@justinachen的代码帮助我开始使用py-elasticsearch，但在查看源代码之后，让我做一个简单的改进：

es = Elasticsearch()
j = 0
actions = []
while (j <= 10):
    action = {
        "_index": "tickets-index",
        "_type": "tickets",
        "_id": j,
        "_source": {
            "any":"data" + str(j),
            "timestamp": datetime.now()
            }
        }
    actions.append(action)
    j += 1

helpers.bulk(es, actions)

helpers.bulk()已为您进行细分。通过分段，我的意思是每次发送到服务器的chucks。如果要减少已发送文档的块，请执行以下操作：helpers.bulk(es, actions, chunk_size=100)

一些方便的信息开始：

helpers.bulk()只是helpers.streaming_bulk的包装，但第一个接受了一个便于使用的列表。

helpers.streaming_bulk基于Elasticsearch.bulk()，因此您无需担心要选择的内容。

因此，在大多数情况下，helpers.bulk()应该就是您所需要的。

Answer 3

（此线程中提到的其他方法使用python列表进行ES更新，这在今天不是一个好的解决方案，特别是当您需要向ES添加数百万个数据时）

更好的方法正在使用 python生成器 - 处理数据，而不会出现内存不足或速度受损。

以下是实际用例的示例代码段 - 将数据从nginx日志文件添加到ES进行分析。

public class binTree {

    public static class TreeNode{
        public int val;
        public TreeNode left;
        public TreeNode right;

        public TreeNode(int val){
            this.val = val;
            this.left = null;
            this.right = null;
        }
    }

    public TreeNode root;

    public binTree(){
        this.root = null;

    }

    public void insert(int data){
        root = insert(root,data);
    }
    public TreeNode insert(TreeNode node,int data){

        if(node == null){
            node = new TreeNode(data);
            //root = node;
        }
        else{
            if(node.left == null){
                node.left = insert(node.left,data);
            }
            else{
                node.right = insert(node.right,data);
            }
        }
        return node;

    }


    public static void main(String args[]){
        binTree obj = new binTree();

        obj.insert(5);
        obj.insert(11);
        obj.insert(13);
        obj.insert(1);
        obj.insert(7);
        obj.insert(21);
        obj.insert(35);
        System.out.println(obj.root.right.left.val);
        System.out.println(obj.root.left.right.val); // this throws null pointer exception
    }

}

此框架演示了生成器的用法。如果需要，您甚至可以在裸机上使用它。您可以继续扩展，以便快速满足您的需求。

Python Elasticsearch参考here。

Answer 4

为每个实体定义索引名称和文档类型：

es_client = Elasticsearch()

body = []
for entry in entries:
    body.append({'index': {'_index': index, '_type': 'doc', '_id': entry['id']}})
    body.append(entry)

response = es_client.bulk(body=body)

使用以下方法提供默认的索引和文档类型：

es_client = Elasticsearch()

body = []
for entry in entries:
    body.append({'index': {'_id': entry['id']}})
    body.append(entry)

response = es_client.bulk(index='my_index', doc_type='doc', body=body)

适用于：

ES版本：6.4.0

ES python库：6.3.1

如何使用Bulk API通过Python将关键字存储在ES中

4 个答案: