转自霸爷的博客:
转载自系统技术非业余研究
本文链接地址: Erlang节点重启导致的incarnation问题
遇到个问题,
=ERROR REPORT==== 10-Mar-2016::09:44:07 ===
Discarding message {'$gen_cast',close_be_covered} from <0.771.0> to <0.774.0> in an old incarnation (2) of this node (3)
网上查阅了霸爷的博客,对该问题脑洞大开.
今天晚上mingchaoyan同学在线上问以下这个问题:
152489 =ERROR REPORT==== 2013-06-28 19:57:53 ===
152490 Discarding message {send,<<19 bytes>>} from <0.86.1> to <0.6743.0> in an old incarnation (1 ) of this node (2)
152491
152492
152493 =ERROR REPORT==== 2013-06-28 19:57:55 ===
152494 Discarding message {send,<<22 bytes>>} from <0.1623.1> to <0.6743.0> in an old incarnation (1) of this node (2
我们中午服务器更新后,日志上满屏的这些错误,请问您有遇到过类似的错误吗?或者提过些定位问题,解决问题的思路,谢谢
这个问题有点意思,从日志提示来再结合源码来看,马上我们就可以找到打出这个提示的地方:
do_send(Process *p, Eterm to, Eterm msg, int suspend) {
|
} else if (is_external_pid(to)) {
|
dep = external_pid_dist_entry(to); |
if (dep == erts_this_dist_entry) {
|
erts_dsprintf_buf_t *dsbufp = erts_create_logger_dsbuf(); |
"Discarding message %T from %T to %T in an old " |
"incarnation (%d) of this node (%d)\n" , |
external_pid_creation(to), |
erts_this_node->creation); |
erts_send_error_to_logger(p->group_leader, dsbufp); |
触发这句警告提示必须满足以下条件:
1. 目标Pid必须是external_pid。
2. 该pid归宿的外部节点所对应的dist_entry和当前节点的dist_entry相同。
通过google引擎,我找到了和这个描述很相近的问题:参见 这里 ,该作者很好的描述和重现了这个现象,但是他没有解释出具体的原因。
好,那我们顺着他的路子来重新下这个问题.
但演示之前,我们先巩固下基础,首先需要明白pid的格式:
可以参见这篇文章:
pid的核心内容摘抄如下:
Printed process ids < A.B.C > are composed of [6]:
A, the node number (0 is the local node, an arbitrary number for a remote node)
B, the first 15 bits of the process number (an index into the process table) [7]
C, bits 16-18 of the process number (the same process number as B) [7]
再参见Erlang External Term Format 文档的章节9.10
描述了PID_EXT的组成:
1 N 4 4 1
103 Node ID Serial Creation
Table 9.16:
Encode a process identifier object (obtained from spawn/3 or friends). The ID and Creation fields works just like in REFERENCE_EXT, while the Serial field is used to improve safety. In ID, only 15 bits are significant; the rest should be 0.
我们可以看到一个字段 Creation, 这个东西我们之前怎么没见过呢?
参考erlang的文档 我们可以知道:
creation
Returns the creation of the local node as an integer. The creation is changed when a node is restarted. The creation of a node is stored in process identifiers, port identifiers, and references. This makes it (to some extent) possible to distinguish between identifiers from different incarnations of a node. Currently valid creations are integers in the range 1..3, but this may (probably will) change in the future. If the node is not alive, 0 is returned.
追踪这个creation的来源,我们知道这个变量来自epmd. 具体点的描述就是每次节点都会像epmd注册名字,epmd会给节点返回这个creation. net_kernel会把这个creation通过set_node这个bif登记到该节点的erts_this_dist_entry->creation中去:
erts_set_this_node(Eterm sysname, Uint creation) |
erts_this_dist_entry->sysname = sysname; |
erts_this_dist_entry->creation = creation; |