《Elasticsearch:权威指南》Document APIs -- Update API

最新推荐文章于 2024-07-27 16:18:34 发布

翻译最新推荐文章于 2024-07-27 16:18:34 发布 · 464 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-update.html#_scripted_updates

文章标签：

#update #_update #更新

ES 专栏收录该内容

30 篇文章

订阅专栏

本文阐述了Elasticsearch中利用脚本进行文档更新的方法，包括增量更新计数器、添加或删除数组元素、条件性操作及部分文档更新。探讨了_upsert、scripted_upsert和doc_as_upsert参数的使用场景，以及retry_on_conflict、timeout等配置选项的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

更新API允许根据提供的script来更新文档。该操作从索引中获取文档（与分片并置），运行script（具有可选的脚本语言和参数），并重建索引（还允许删除或忽略该操作）。它使用版本控制来确保在“获取”和“重新索引”期间未发生任何更新。

请注意，此操作仍然意味着对文档进行完全重新索引，它只是消除了一些网络往返，并减少了获取和索引之间版本冲突的可能性。需要启用_source字段，此功能才能起作用。

例如，让我们为一个简单的文档编制索引：

PUT test/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]     //tags字段是个集合
}

Scripted updates(脚本更新)

现在，我们可以执行一个脚本来增加计数器：

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",    //表示给_source里面的counter属性赋值，并且可以使用params中的一些临时参数
        "lang": "painless",
        "params" : {            //临时参数
            "count" : 4
        }
    }
}

查看修改后的document：

GET test/_doc/1/

{
    "_index": "test",
    "_type": "_doc",
    "_id": "1",
    "_version": 2,      //version++
    "_seq_no": 1,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "counter": 5,      //新增了一个字段，并且值为5（4+1计算后的）
        "tags": [
            "red"
        ]
    }
}

给tags属性添加一个值（由于是一个数组，因此可以调用add方法，添加一个元素）

{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}

看下更新后的document：

{
    "_index": "test",
    "_type": "_doc",
    "_id": "1",
    "_version": 3,
    "_seq_no": 2,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "counter": 5,
        "tags": [
            "red"    
            ,
            "blue"   //多个blue项
        ]
    }
}

我们可以从tags属性中删除一个元素。请注意，为了避免运行时错误需要先判断是否存在待删除的元素。请注意，如果tag在tags列表中出现的次数超过一次，则只会删除一次该标签：

{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag)) }",   //先判断是否存在
        "lang": "painless",
        "params" : {
            "tag" : "blue"   //删除blue,此时只剩下red了
        }
    }
}

查看此时的document，只剩red项了

反面例子，直接删除元素，报错误：

{
    "script" : {
        "source": "ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag)) ",   //直接删除
        "lang": "painless",
        "params" : {
            "tag" : "aaa"    //不存在aaa项
        }
    }
}

输出结果：

{
	"error": {
		"root_cause": [{
			"type": "remote_transport_exception",
			"reason": "[wKUeBA7][10.40.164.63:9300][indices:data/write/update[s]]"
		}],
		"type": "illegal_argument_exception",
		"reason": "failed to execute script",
		"caused_by": {
			"type": "script_exception",
			"reason": "runtime error",
			"script_stack": [
				"java.util.ArrayList.elementData(ArrayList.java:422)",
				"java.util.ArrayList.remove(ArrayList.java:499)",
				"ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag)) ",
				" ^---- HERE"
			],
			"script": "ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag)) ",
			"lang": "painless",
			"caused_by": {
				"type": "array_index_out_of_bounds_exception",
				"reason": "-1"
			}
		}
	},
	"status": 400
}

除了_source，下列变量通过 ctx map 都是可用的：_index，_type，_id，_version，_routing，_parent，和_now（当前的时间戳）。

我们也可以将新字段添加到文档：

POST test/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}

或者从document中删除字段：

POST test/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}

而且，我们甚至可以改变已执行的操作。这个例子就是删除文档，如果 tags包含 green，则删除这条数据，否则就什么也不做（noop）：

POST test/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",   
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}

由于tags里面不存在green，上面的代码实际未执行；如果改为red，则该条数据会被删除；ctx.op是专有名词，ctx是环境对象，op是操作类型，test/_doc/1/_update会在内部被替换为test/_doc/1/_delete

Updates with a partial document(部分文档更新)

可以参考更多部分更新文档的使用

更新 API 还支持document的部分更新，将合并到现有的document（简单的递归合并，内合并，更换“键/值”对和数组）。例如：

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}

如果同时doc 和 script被指定，那么 doc将被忽略。最好是把部分文件对应脚本本身。

注意部分更新功能，前提是索引和该条数据已经存在，否则会抛出对应的异常，只要任何一个不满足，都会更新失败。

注意：url如果不使用_update, 则会直接覆盖掉源文档, 导致原文档丢失部分数据；
test/_doc/1/_update是部分更新， test/_doc/1/是覆盖

Detecting noop updates

详情可以单靠单独的文章 Detecting noop updates

Upserts

如果指定的document不存在，就执行upsert中的初始化操作；
如果指定的document存在，就执行doc或者script指定的partial update操作
也就是说两种操作只能2选一，即doc+Upserts或script+Upserts

我先删掉 index test，再执行下面的代码：

POST test/_doc/1/_update
{
    "script" : {           //由于文档不存在，不执行script逻辑
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1      //由于文档不存在，执行upsert逻辑
    }
}

执行结果，显示新增了一个document：

{
    "_index": "test",
    "_type": "_doc",
    "_id": "1",
    "_version": 1,
    "result": "created",     //和新增操作返回结果一样
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

用GET命令查看文档

{
    "_index": "test",
    "_type": "_doc",
    "_id": "1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "counter": 1    //与upsert中的内容一致
    }
}

scripted_upsert

如果你想无论文档是否存在都执行脚本操作，那么可以使用参数scripted_upsert为true。

POST sessions/session/dh3sgudg8gsrgl/_update
{
    "scripted_upsert":true,
    "script" : {
        "id": "my_web_session_summariser",
        "params" : {
            "pageViewEvent" : {
                "url":"foo.com/bar",
                "response":404,
                "time":"2014-01-01 12:32"
            }
        }
    },
    "upsert" : {}   //既然想制执行script，为了不去掉upsert？
}

官方例子执行失败，理论上既然想强制执行script，为了不直接去掉upsert？干吗又设置 scripted_upsert为true ，矛盾。

执行结果：

{
    "error": {
        "root_cause": [
            {
                "type": "remote_transport_exception",
                "reason": "[wKUeBA7][10.40.164.63:9300][indices:data/write/update[s]]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "failed to execute script",
        "caused_by": {
            "type": "resource_not_found_exception",
            "reason": "unable to find script [my_web_session_summariser] in cluster state"
        }
    },
    "status": 400
}

doc_as_upsert

doc_as_upsert可以把doc的内容复制一份到upsert中，作用就是简化代码，避免再拷贝相同的结构的数据。

我们看个例子：

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "doc_as_upsert" : true     //如果不加该行代码，id=1的数据不存在时，执行失败，类似报空指针错误
}

等价于：

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "upsert" : {
        "name" : "new_name"
    }
}

Parameters

除了上面介绍的一些参数外，还有一些参数可以用在 _update中

retry_on_conflict 发生冲突后的重试次数。
(1) 客户端A、B几乎同时获取同一个文档, 一并获得_version版本信息, 假设此时_version=1;

(2) 客户端A修改文档中的部分内容, 将修改写入索引;

(3) Elasticsearch在写入索引时, 检查客户端A提交的文档的版本信息(这里仍然是1) 和现存的文档的版本信息(这里也是1), 发现相同后, 执行写入操作, 并修改版本号_version=2;

(4) 客户端B也修改文档中的部分内容, 其操作写回索引的速度稍慢. 此时同样执行过程(3): ES发现客户端B提交的文档的版本为1, 而现存文档的版本为2 ===> 发生冲突, 此次partial update将失败;

(5) partial update操作失败后, 将重复(1) - (3) 过程, 重复的次数, 就是retry_on_conflict参数的值.
routing 用于upsert
如果更新的文档不存在，执行upsert语句，插入数据，根据routing将更新请求路由到正确的分片上。

已存在的文档不能更新路由（routing）
timeout 当分片不可用的时候，等待多长时间
wait_for_active_shards
在处理更新请求操作之前，副本分片必须存活的数量。参考 ES wait_for_active_shards参数作用
refresh 当执行操作的时候，会自动刷新索引。
_source
在响应中控制是否和如何控制更新返回的source字段。默认情况下，更新的source是不返回的
version
更新操作会使用版本号来确定拿到文档到执行更新期间，文档是否被修改过。也可以通过特定的版本号，更新文档。如果使用force作为版本号，那么更新操作将不会再改变版本号。注意，这样就无法保证文档是否被修改了。