【Redis-6.0.8】Redis主从复制的源码分析(下)

本文详细解析了Redis主从复制中PSYNC命令的发送与接收过程,包括slaveTryPartialResynchronization的用法,以及master如何处理来自slave的PSYNC请求,涉及状态转换、部分同步与全量同步策略。

0.阅读引用

Redis主从复制机制分析

redis源码分析之十六复制

解释了PSYNC_CONTINUE等枚举值的含义

解析了slaveTryPartialResynchronization的用法

1974主从复制 Master 节点流程分析-rdbSaveToSlavesSockets等内容

1.slave部分源码继续分析及master部分源码分析

1.1 slave发送PSYNC命令给master

1.1.1 syncWithMaster中调用PSYNC命令

/* 复制状态为REPL_STATE_SEND_PSYNC,尝试进行部分重同步 */

void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
     ...
    /* Try a partial resynchonization. If we don't have a cached master
     * slaveTryPartialResynchronization() will at least try to use PSYNC
     * to start a full resynchronization so that we get the master run id
     * and the global offset, to try a partial resync at the next
     * reconnection attempt. */
    /*
      如果我们没有缓冲主节点的信息,slaveTryPartialResynchronization()函数将会至少尝试使用
      PSYNC去进行一个全同步,这样就能得到主节点的运行runid和全局复制偏移量,并且在下次重连接时可
      以尝试进行部分重同步.
    */
    if (server.repl_state == REPL_STATE_SEND_PSYNC) {
        /* 向主节点发送一个部分重同步命令PSYNC, 参数0表示不读主节点的回复, 只获取主节点的运
           行runid和全局复制偏移量 
        */
        if (slaveTryPartialResynchronization(conn,0) == PSYNC_WRITE_ERROR) {
           /* 如果返回值是PSYNC_WRITE_ERROR, 则记录错误信息并进入write_error流程 */
            err = sdsnew("Write error sending the PSYNC command.");
            goto write_error;
        }
        server.repl_state = REPL_STATE_RECEIVE_PSYNC;
        return;
    }
   ...
}

 从节点尝试去进行部分同步,并且根据server.cached_master是否为NULL决定发给主节点的命令是:"PSYNC <server.cached_master->replid> <server.cached_master->reploff+1>"还是"PSYNC ? -1",之后将server.repl_state置为REPL_STATE_RECEIVE_PSYNC,接着便开始等待主节点相应的反应. 在本步骤进行完毕后,  此处的从节点的复制状态变更情况为:

  • REPL_STATE_SEND_PSYNC--->REPL_STATE_RECEIVE_PSYNC

注意:PSYNC ? -1  表示的是进行一次全量同步

1.1.2 slaveTryPartialResynchronization

/* Try a partial resynchronization with the master if we are about to reconnect.
 * If there is no cached master structure, at least try to issue a
 * "PSYNC ? -1" command in order to trigger a full resync using the PSYNC
 * command in order to obtain the master run id and the master replication
 * global offset.
 *
 * This function is designed to be called from syncWithMaster(), so the
 * following assumptions are made:
 *
 * 1) We pass the function an already connected socket "fd".
 * 2) This function does not close the file descriptor "fd". However in case
 *    of successful partial resynchronization, the function will reuse
 *    'fd' as file descriptor of the server.master client structure.
 *
 * The function is split in two halves: if read_reply is 0, the function
 * writes the PSYNC command on the socket, and a new function call is
 * needed, with read_reply set to 1, in order to read the reply of the
 * command. This is useful in order to support non blocking operations, so
 * that we write, return into the event loop, and read when there are data.
 *
 * When read_reply is 0 the function returns PSYNC_WRITE_ERR if there
 * was a write error, or PSYNC_WAIT_REPLY to signal we need another call
 * with read_reply set to 1. However even when read_reply is set to 1
 * the function may return PSYNC_WAIT_REPLY again to signal there were
 * insufficient data to read to complete its work. We should re-enter
 * into the event loop and wait in such a case.
 *
 * The function returns:
 *
 * PSYNC_CONTINUE: If the PSYNC command succeeded and we can continue.
 * PSYNC_FULLRESYNC: If PSYNC is supported but a full resync is needed.
 *                   In this case the master run_id and global replication
 *                   offset is saved.
 * PSYNC_NOT_SUPPORTED: If the server does not understand PSYNC at all and
 *                      the caller should fall back to SYNC.
 * PSYNC_WRITE_ERROR: There was an error writing the command to the socket.
 * PSYNC_WAIT_REPLY: Call again the function with read_reply set to 1.
 * PSYNC_TRY_LATER: Master is currently in a transient error condition.
 *
 * Notable side effects:
 *
 * 1) As a side effect of the function call the function removes the readable
 *    event handler from "fd", unless the return value is PSYNC_WAIT_REPLY.
 * 2) server.master_initial_offset is set to the right value according
 *    to the master reply. This will be used to populate the 'server.master'
 *    structure replication offset.
 */

#define PSYNC_WRITE_ERROR 0
#define PSYNC_WAIT_REPLY 1
#define PSYNC_CONTINUE 2
#define PSYNC_FULLRESYNC 3
#define PSYNC_NOT_SUPPORTED 4
#define PSYNC_TRY_LATER 5
/*
PSYNC_WRITE_ERROR 表示从节点在写的时候出错
PSYNC_WAIT_REPLY  表示Master还没有响应,程序在下一次响应读事件时,会继续调用此函数
PSYNC_CONTINUE    表示开始部分同步操作, 之后, 程序也不会再调用syncWithMaster了
PSYNC_FULLRESYNC  表示Master向Slave发送了全部同步的响应,在返回PSYNC_FULLRESYNC之前,此函数
已经执行了此socket的可读事件属性,之后,程序将不再调用syncWithMaster
PSYNC_NOT_SUPPORTED 表示Master不支持PSYNC命令,程序之后会向Master发送SYNC命令,之后开始完全同步
PSYNC_TRY_LATER   表示之后再尝试
*/

/*
第一个参数是对应的连接,
第二个参数是说用的是读还是写,0-写,1-读
函数的意思是让从节点尝试着看看能不能进行部分重同步.
slaveTryPartialResynchronization函数有两种应用场景对应两种状态,
(1)发送psync等待回复的状态(写状态,第二个参数是0);
(2)等待读取主节点发送回来的需要复制的内容(读状态,第二个参数是1);
*/
int slaveTryPartialResynchronization(connection *conn, int read_reply) {
    char *psync_replid;
    char psync_offset[32];
    sds reply;

    /* Writing half */
    /* 写同步 */
    if (!read_reply) {
        /* Initially set master_initial_offset to -1 to mark the current
         * master run_id and offset as not valid. Later if we'll be able to do
         * a FULL resync using the PSYNC command we'll set the offset at the
         * right value, so that this information will be propagated to the
         * client structure representing the master into server.master. */
        /*
           将master_initial_offset设置成-1表示当前主节点的run_id和offset都是无效的,
           之后如果我们的一用PSYNC命令进行全量同步,我们将会把正确的偏移量设置到offset中,
           之后这个信息会得以写入到struct client类型的变量master的中去.
         */
        server.master_initial_offset = -1;
        
        /* 如果server.cached_master不为NULL,则用它的信息尝试一次部分同步 */
        if (server.cached_master) {
            psync_replid = server.cached_master->replid;
            snprintf(psync_offset,sizeof(psync_offset),"%lld", server.cached_master->reploff+1);
            serverLog(LL_NOTICE,"Trying a partial resynchronization (request %s:%s).", psync_replid, psync_offset);
        } 
        else 
        /* 如果server.cached_master为NULL,则部分同步是不可能的了,
           将psync_replid赋值为“?”,将psync_offset赋值为“-1”;
           主节点的缓存为空, 发送【PSYNC ? -1】请求全量同步
        */
        {
            serverLog(LL_NOTICE,"Partial resynchronization not possible (no cached master)");
            psync_replid = "?";
            memcpy(psync_offset,"-1",3);
        }

        /* Issue the PSYNC command */
        /* 调用PSYNC写命令 */
        reply = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"PSYNC",psync_replid,psync_offset,NULL);
        if (reply != NULL) {
            serverLog(LL_WARNING,"Unable to send PSYNC to master: %s",reply);
            sdsfree(reply);
            connSetReadHandler(conn, NULL);
            return PSYNC_WRITE_ERROR;
        }
        return PSYNC_WAIT_REPLY;
    }

    /* Reading half */
    reply = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
    if (sdslen(reply) == 0) {
        /* The master may send empty newlines after it receives PSYNC
         * and before to reply, just to keep the connection alive. */
        sdsfree(reply);
        return PSYNC_WAIT_REPLY;
    }

    connSetReadHandler(conn, NULL);

    if (!strncmp(reply,"+FULLRESYNC",11)) {
        char *replid = NULL, *offset = NULL;

        /* FULL RESYNC, parse the reply in order to extract the run id
         * and the replication offset. */
        replid = strchr(reply,' ');
        if (replid) {
            replid++;
            offset = strchr(replid,' ');
            if (offset) offset++;
        }
        if (!replid || !offset || (offset-replid-1) != CONFIG_RUN_ID_SIZE) {
            serverLog(LL_WARNING,
                "Master replied with wrong +FULLRESYNC syntax.");
            /* This is an unexpected condition, actually the +FULLRESYNC
             * reply means that the master supports PSYNC, but the reply
             * format seems wrong. To stay safe we blank the master
             * replid to make sure next PSYNCs will fail. */
            memset(server.master_replid,0,CONFIG_RUN_ID_SIZE+1);
        } else {
            memcpy(server.master_replid, replid, offset-replid-1);
            server.master_replid[CONFIG_RUN_ID_SIZE] = '\0';
            server.master_initial_offset = strtoll(offset,NULL,10);
            serverLog(LL_NOTICE,"Full resync from master: %s:%lld",
                server.master_replid,
                server.master_initial_offset);
        }
        /* We are going to full resync, discard the cached master structure. */
        replicationDiscardCachedMaster();
        sdsfree(reply);
        return PSYNC_FULLRESYNC;
    }

    if (!strncmp(reply,"+CONTINUE",9)) {
        /* Partial resync was accepted. */
        serverLog(LL_NOTICE,
            "Successful partial resynchronization with master.");

        /* Check the new replication ID advertised by the master. If it
         * changed, we need to set the new ID as primary ID, and set or
         * secondary ID as the old master ID up to the current offset, so
         * that our sub-slaves will be able to PSYNC with us after a
         * disconnection. */
        char *start = reply+10;
        char *end = reply+9;
        while(end[0] != '\r' && end[0] != '\n' && end[0] != '\0') end++;
        if (end-start == CONFIG_RUN_ID_SIZE) {
            char new[CONFIG_RUN_ID_SIZE+1];
            memcpy(new,start,CONFIG_RUN_ID_SIZE);
            new[CONFIG_RUN_ID_SIZE] = '\0';

            if (strcmp(new,server.cached_master->replid)) {
                /* Master ID changed. */
                serverLog(LL_WARNING,"Master replication ID changed to %s",new);

                /* Set the old ID as our ID2, up to the current offset+1. */
                memcpy(server.replid2,server.cached_master->replid,
                    sizeof(server.replid2));
                server.second_replid_offset = server.master_repl_offset+1;

                /* Update the cached master ID and our own primary ID to the
                 * new one. */
                memcpy(server.replid,new,sizeof(server.replid));
                memcpy(server.cached_master->replid,new,sizeof(server.replid));

                /* Disconnect all the sub-slaves: they need to be notified. */
                disconnectSlaves();
            }
        }

        /* Setup the replication to continue. */
        sdsfree(reply);
        replicationResurrectCachedMaster(conn);

        /* If this instance was restarted and we read the metadata to
         * PSYNC from the persistence file, our replication backlog could
         * be still not initialized. Create it. */
        if (server.repl_backlog == NULL) createReplicationBacklog();
        return PSYNC_CONTINUE;
    }

    /* If we reach this point we received either an error (since the master does
     * not understand PSYNC or because it is in a special state and cannot
     * serve our request), or an unexpected reply from the master.
     *
     * Return PSYNC_NOT_SUPPORTED on errors we don't understand, otherwise
     * return PSYNC_TRY_LATER if we believe this is a transient error. */

    if (!strncmp(reply,"-NOMASTERLINK",13) ||
        !strncmp(reply,"-LOADING",8))
    {
        serverLog(LL_NOTICE,
            "Master is currently unable to PSYNC "
            "but should be in the future: %s", reply);
        sdsfree(reply);
        return PSYNC_TRY_LATER;
    }

    if (strncmp(reply,"-ERR",4)) {
        /* If it's not an error, log the unexpected event. */
        serverLog(LL_WARNING,
            "Unexpected reply to PSYNC from master: %s", reply);
    } else {
        serverLog(LL_NOTICE,
            "Master does not support PSYNC or is in "
            "error state (reply: %s)", reply);
    }
    sdsfree(reply);
    replicationDiscardCachedMaster();
    return PSYNC_NOT_SUPPORTED;
}

1.2 master执行来自slave的psync命令

1.2.1 粗看syncCommand

1. redisCommandTable中对于psync命令来说调用的是syncCommand函数
struct redisCommand redisCommandTable[] = {   
    ...
    {"sync",syncCommand,1,
     "admin no-script",
     0,NULL,0,0,0,0,0,0},

    {"psync",syncCommand,3,
     "admin no-script",
     0,NULL,0,0,0,0,0,0},
    ...
}

2.查看syncCommand的实现
/* SYNC and PSYNC command implemenation. */
void syncCommand(client *c) {
    /* ignore SYNC if already slave or in monitor mode */
    /* 如果slave请求的master本身就是别的master的slave,就直接返回 */
    if (c->flags & CLIENT_SLAVE) return;

    /* Refuse SYNC requests if we are a slave but the link with our master
     * is not ok... */
    /* 如果当前进程本身就是一个从节点且它与它的主节点之间的连接出问题了则拒绝SYNC请求 */
    if (server.masterhost && server.repl_state != REPL_STATE_CONNECTED) {
        addReplySds(c,sdsnew("-NOMASTERLINK Can't SYNC while not connected with my master\r\n"));
        return;
    }

    /* SYNC can't be issued when the server has pending data to send to
     * the client about already issued commands. We need a fresh reply
     * buffer registering the differences between the BGSAVE and the current
     * dataset, so that we can copy to other slaves if needed. */
    /*
       如果当前进程有已经处理过的命令但是需要回复的数据还没有发给客户端的时候来了这个同步请求,
       那么这个SYNC请求是不能被处理的.我们需要记录刷新一下BGSAVE与当前数据集之间的不同,
       这样的话我们就可以在其他从节点需要拷贝数据的时候正确地将数据拷贝给它.
    */

    if (clientHasPendingReplies(c)) {
        addReplyError(c,"SYNC and PSYNC are invalid with pending output");
        return;
    }

    serverLog(LL_NOTICE,"Replica %s asks for synchronization",
        replicationGetSlaveName(c));

    /* Try a partial resynchronization if this is a PSYNC command.
     * If it fails, we continue with usual full resynchronization, however
     * when this happens masterTryPartialResynchronization() already
     * replied with:
     *
     * +FULLRESYNC <replid> <offset>
     *
     * So the slave knows the new replid and offset to try a PSYNC later
     * if the connection with the master is lost. */
     /*
       如果是一个PSYNC命令,尝试看看能否使用部分同步的功能.如果失败了的话我们继续使用平时的
       全同步,但是当我们调用masterTryPartialResynchronization后返回了
       +FULLRESYNC <replid> <offset>的情况下从节点即使在与主机断开联系的情况下也能知道
       相应的replid和offset,之后它就可以尝试用这些信息看看能不能请求部分同步.
     */
     /* 是psync命令 */
    if (!strcasecmp(c->argv[0]->ptr,"psync")) {
        /* 如果尝试部分同步返回C_OK, masterTryPartialResynchronization源码看下文*/
        if (masterTryPartialResynchronization(c) == C_OK) {
           /* 统计可以使用部分重同步功能的数据要加一 */
            server.stat_sync_partial_ok++;
            /* 不需要全同步, 返回*/
            return; /* No full resync needed, return. */
        }   
        /* 如果尝试部分同步没有返回C_OK */
        else {
            /* 给master_replid赋值 */
            char *master_replid = c->argv[1]->ptr;

            /* Increment stats for failed PSYNCs, but only if the
             * replid is not "?", as this is used by slaves to force a full
             * resync on purpose when they are not albe to partially
             * resync. */
            /*
              如果master_replid[0]不为'?',统计不可以使用部分重同步功能的数据要加一,
              因为这个常常被用来在从节点不能进行部分同步的情况下去执行全同步的
              一个标志.
            */
            if (master_replid[0] != '?') server.stat_sync_partial_err++;
        }
    } 
    /* 如果命令不是PSYNC,在这里如果不是PSYNC就是SYNC */
    else {
        /* If a slave uses SYNC, we are dealing with an old implementation
         * of the replication protocol (like redis-cli --slave). Flag the client
         * so that we don't expect to receive REPLCONF ACK feedbacks. */
     
        /* 将客户端的加上CLIENT_PRE_PSYNC标签(表示该客户端只支持旧版本的同步方式)*/
        c->flags |= CLIENT_PRE_PSYNC;
    }

    /* Full resynchronization. */
    /* 统计全量同步的指标参数要加一 */
    server.stat_sync_full++;

    /* Setup the slave as one waiting for BGSAVE to start. The following code
     * paths will change the state if we handle the slave differently. */
    
    /* 将从节点的复制状态设置成等待BGSAVE开始的状态 */
    c->replstate = SLAVE_STATE_WAIT_BGSAVE_START;
    /* 主节点如果设置了tcp_nodelay则让此设置失效 */
    if (server.repl_disable_tcp_nodelay)
        connDisableTcpNoDelay(c->conn); /* Non critical if it fails. */
    /* 将客户端的repldbfd设置成-1*/
    c->repldbfd = -1;
    /* 给客户端加上CLIENT_SLAVE的标签*/
    c->flags |= CLIENT_SLAVE;
    /* 将这个客户端所代表的从节点的信息加入到当前主节点的server.slaves这个链表当中*/
    listAddNodeTail(server.slaves,c);

    /* Create the replication backlog if needed. */
    /* 如果需要创建复制的备份日志 */
    /* 当当前主节点只有一个从节点的且当前主节点没有复制备份日记的时候开始创建备份日记 */
    if (listLength(server.slaves) == 1 && server.repl_backlog == NULL) {
        /* When we create the backlog from scratch, we always use a new
         * replication ID and clear the ID2, since there is no valid
         * past history. */
        changeReplicationId();
        clearReplicationId2();
        createReplicationBacklog();
        serverLog(LL_NOTICE,"Replication backlog created, my new "
                            "replication IDs are '%s' and '%s'",
                            server.replid, server.replid2);
    }

    /* CASE 1: BGSAVE is in progress, with disk target. */
    /* 第一种情况:BFGAVE正在进行,且其目的是要持久化到磁盘上,待持久化完成后再
      将该文件发送给从服务器 */
    if (server.rdb_child_pid != -1 &&
        server.rdb_child_type == RDB_CHILD_TYPE_DISK)
    {
        /* Ok a background save is in progress. Let's check if it is a good
         * one for replication, i.e. if there is another slave that is
         * registering differences since the server forked to save. */
        client *slave;
        listNode *ln;
        listIter li;

        listRewind(server.slaves,&li);
        /* 遍历从节点 */
        while((ln = listNext(&li))) {
            slave = ln->value;
            /* 如果有从节点的状态是等待BGSAVE结束,就退出循环 */
            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END) break;
        }
        /* To attach this slave, we check that it has at least all the
         * capabilities of the slave that triggered the current BGSAVE. */
        /*
            我们检查一下这个这个从节点所拥有的同步能力看看它能否触发当前的BGSAVE操作.
            判断条件是当前从节点不为NULL且客户端所代表从节点所拥有的同步能力与当前遍历
            到的从节点所拥有的同步能力是一样的.
        */
        if (ln && ((c->slave_capa & slave->slave_capa) == slave->slave_capa)) {
            /* Perfect, the server is already registering differences for
             * another slave. Set the right state, and copy the buffer. */
             /*
                非常好,主节点已经比较好了当前从节点与其他从节点(循环遍历到的当前节点)
                (的同步能力)是否不同了,现在要开始设置正确的状态并且拷贝缓冲区了.  
             */
            /*将slave的输出缓冲区所有内容拷贝给c的所有输出缓冲区中*/
            copyClientOutputBuffer(c,slave);
            /*开始进行从节点的全同步工作,后文会有具体介绍 */
            replicationSetupSlaveForFullResync(c,slave->psync_initial_offset);
            serverLog(LL_NOTICE,"Waiting for end of BGSAVE for SYNC");
        } else {
            /* No way, we need to wait for the next BGSAVE in order to
             * register differences. */
            /* 不能进行全量同步了,不能将从节点与当前BGSAVE关联起来,等待下一次以全量同步为
               为目的的BGSAVE发生.
             */
            serverLog(LL_NOTICE,"Can't attach the replica to the current BGSAVE. Waiting for next BGSAVE for SYNC");
        }

    /* CASE 2: BGSAVE is in progress, with socket target. */
    /* 第二种情况:BFGAVE正在进行,直接通过socket将内容发给从节点
    */
    } else if (server.rdb_child_pid != -1 &&
               server.rdb_child_type == RDB_CHILD_TYPE_SOCKET)
    {
        /* There is an RDB child process but it is writing directly to
         * children sockets. We need to wait for the next BGSAVE
         * in order to synchronize. */
        serverLog(LL_NOTICE,"Current BGSAVE has socket target. Waiting for next BGSAVE for SYNC");

    
    } 
    else 
    /* CASE 3: There is no BGSAVE is progress. */
    /* 没有BGSAVE正在运行 */
    {
        /*
           server.repl_diskless_sync是通过配置文件中的参数repl-diskless-sync进行设置,
           默认为0; 即默认情况下, 主服务器都是先持久化数据到本地文件, 再将该文件发送给从服
           务器.
           SLAVE_CAPA_EOF 表示可以解析RDB的二进制流文件,其他能力可以看下文中的定义
           repl-diskless-sync no       ---对应server.repl_diskless_sync  
           repl-diskless-sync-delay 5  ---对应server.repl_diskless_sync_delay
        */
       /* 如果服务器配置成无盘复制且客户端对应的那个从节点具备解析RDB的二进制流文件的能力 */
        if (server.repl_diskless_sync && (c->slave_capa & SLAVE_CAPA_EOF)) {
            /* Diskless replication RDB child is created inside
             * replicationCron() since we want to delay its start a
             * few seconds to wait for more slaves to arrive. */
            /*
               无盘同步复制的子进程是在replicationCron()中创建的,因为这样的话也许可以
               等待到更多需要同步的从节点.
               
            */
            if (server.repl_diskless_sync_delay)
                serverLog(LL_NOTICE,"Delay next BGSAVE for diskless SYNC");
        }
        /* 不符合上面条件的情况(服务器没有配置成无盘复制或着服务器都开始不具备析RDB的二进制流文
           件的能力),对于服务器来说,应该只有“服务器没有配置成无盘复制”这一种情况(很早版本的服务
           器都开始具备析RDB的二进制流文件的能力),也就是说要同步文件的时候先持久化到磁盘上
        */ 
        else 
        {
            /* Target is disk (or the slave is not capable of supporting
             * diskless replication) and we don't have a BGSAVE in progress,
             * let's start one. */
            /* 如果没有任何子进程,那么开始写用于同步的RDB文件 */
            if (!hasActiveChildProcess()) {
                startBgsaveForReplication(c->slave_capa);
            } else {
            /* 没有BGSAVE子进程,但是有其他子进程正在运行,写用于同步的RDB文件的工作将会延后
             */
                serverLog(LL_NOTICE,
                    "No BGSAVE in progress, but another BG operation is active. "
                    "BGSAVE for replication delayed");
            }
        }
    }
    return;
}

1.2.2 Slave-capabilities枚举值

/* Slave capabilities. */
#define SLAVE_CAPA_NONE 0
#define SLAVE_CAPA_EOF (1<<0)    /* Can parse the RDB EOF streaming format. */
#define SLAVE_CAPA_PSYNC2 (1<<1) /* Supports PSYNC2 protocol. */

1.2.3  masterTryPartialResynchronization的实现

/* This function handles the PSYNC command from the point of view of a
 * master receiving a request for partial resynchronization.
 *
 * On success return C_OK, otherwise C_ERR is returned and we proceed
 * with the usual full resync. */
/*
  这个函数是来探测发送PSYNC命令的节点是否可以进行部分同步,
  如果可以部分同步则返回C_OK,否则返回C_ERR并且我们将进行全同步.
*/
int masterTryPartialResynchronization(client *c) {
    long long psync_offset, psync_len;
    char *master_replid = c->argv[1]->ptr; /* 获取客户端代表的从节点的主节点运行id*/
    char buf[128];
    int buflen;

    /* Parse the replication offset asked by the slave. Go to full sync
     * on parse error: this should never happen but we try to handle
     * it in a robust way compared to aborting. */
    /*
       解析由从节点传来的复制偏移量.当解析错误的时候就去进行全量同步,但是这个应该是不会
       发生的,但是我们为了可能出现这种问题时程序被终止,我们选择了这个相对来说更健壮的方式.
    */
    if (getLongLongFromObjectOrReply(c,c->argv[2],&psync_offset,NULL) !=
       C_OK) goto need_full_resync;

    /* Is the replication ID of this master the same advertised by the wannabe
     * slave via PSYNC? If the replication ID changed this master has a
     * different replication history, and there is no way to continue.
     *
     * Note that there are two potentially valid replication IDs: the ID1
     * and the ID2. The ID2 however is only valid up to a specific offset. */
    /*
       主节点的运行id与想要通过PSYNC命令进行信息同步的从节点中记录的主节点运行id是一致的吗?
       如果与主节点的历史上的运行id也匹配不上,就没法再继续了.(这句英文好像不通?)
       需要注意的是: 有两个可能有效的运行id-->id1和id2, 但是id2是否有效还取决于偏移量是否
       有效.
    */
    /* strcasecmp以忽略大小写的方式比较两个字符串是否一样,如果一样则返回0 */
    /* 
       以下两种情况都满足的时候不能进行部分同步:
       (1)客户端所代表的从节点记录的主节点的运行id与主节点的运行id不一样;
       (2)客户端所代表的从节点记录的主节点的运行id与主节点的运行id2不一样或
          从节点传过来的偏移量大于主节点中记录在second_replid_offset中的偏移量    
    */
    if (strcasecmp(master_replid, server.replid) &&
        (strcasecmp(master_replid, server.replid2) ||
         psync_offset > server.second_replid_offset))
    {
        /* 这里是不能进行部分同步的分支 */
        /* Run id "?" is used by slaves that want to force a full resync. */
        /* 判断从节点有没有用"?"来强行进行全量同步 */
        /* 如果从节点没有用"?"来强行进行全量同步 */
        if (master_replid[0] != '?') {
            /* 如果从节点记录的主节点的运行id于主节点实际的运行id和运行id2都不一样
               则记录相关日志*/
            if (strcasecmp(master_replid, server.replid) &&
                strcasecmp(master_replid, server.replid2))
            {
                serverLog(LL_NOTICE,"Partial resynchronization not accepted: "
                    "Replication ID mismatch (Replica asked for '%s', my "
                    "replication IDs are '%s' and '%s')",
                    master_replid, server.replid, server.replid2);
            } 
            else 
            /* 否则记录:部分同步请求不被接受,请求的运行id匹配到了运行id2,但是对应的
               请求偏移量当前主节点无法满足
            */
            {
                serverLog(LL_NOTICE,"Partial resynchronization not accepted: "
                    "Requested offset for second ID was %lld, but I can reply "
                    "up to %lld", psync_offset, server.second_replid_offset);
            }
        } else {
            /* 请求全量同步 */
            serverLog(LL_NOTICE,"Full resync requested by replica %s",
                replicationGetSlaveName(c));
        }
        /* 需要进行全同步 */
        goto need_full_resync;
    }

    /* We still have the data our slave is asking for? */
    /* 我们真的还保存着从节点的请求所需要的数据吗?(这里注意复制缓冲区的概念,它是有限大的)*/
    /* 以下三种情况满足一种就不能再进行部分同步:    
      (1)主节点的复制缓冲区为NULL(那么【!server.repl_backlog】为真);
      (2)请求的偏移量小于主节点记录的复制缓冲区中记录的偏移量(说明backlog所备份的数据的已经太新
         了,有一些数据被覆盖,需要进行全量复制);
      (3)psync_offset大于(server.repl_backlog_off + server.repl_backlog_histlen)
         (说明当前backlog的数据不够全,需要进行全量复制).

      待研究:server.master_repl_offset, server.repl_backlog_off和  
             server.repl_backlog_histlen之间的关系.
    */
    if (!server.repl_backlog ||
        psync_offset < server.repl_backlog_off ||
        psync_offset > (server.repl_backlog_off + server.repl_backlog_histlen))
    {
        serverLog(LL_NOTICE,
            "Unable to partial resync with replica %s for lack of backlog (Replica request was: %lld).", replicationGetSlaveName(c), psync_offset);
        if (psync_offset > server.master_repl_offset) {
            serverLog(LL_WARNING,
                "Warning: replica %s tried to PSYNC with an offset that is greater than the master replication offset.", replicationGetSlaveName(c));
        }
        goto need_full_resync;
    }


    /* 如果到达了这里,我们可以开始进行部分同步了:
       1.设置client状态为从节点
       2.向从节点发送 +CONTINUE 表示接受 partial resync 被接受
       3.发送backlog的数据给从节点
    */  
 
    /* If we reached this point, we are able to perform a partial resync:
     * 1) Set client state to make it a slave.
     * 2) Inform the client we can continue with +CONTINUE
     * 3) Send the backlog data (from the offset to the end) to the slave. */
    /* 设置client状态为从节点 */
    c->flags |= CLIENT_SLAVE;
    /* 置复制状态为在线, 此状态表示此时RDB文件传输完成,之后只需要发送差异数据 */
    c->replstate = SLAVE_STATE_ONLINE;
    /* 设置从节点收到ack的时间 */
    c->repl_ack_time = server.unixtime;
    /* 将repl_put_online_on_ack设置为0
       repl_put_online_on_ack官方给出的解释是 
       Install slave write handler on first ACK.
       应该是是否在第一次ACK的时候设置了针对从节点写事件的处理函数.
   */
    c->repl_put_online_on_ack = 0;
    /* 将客户端代表的从节点加入到server.slaves这个链表中 */
    listAddNodeTail(server.slaves,c);
    
    /* We can't use the connection buffers since they are used to accumulate
     * new commands at this stage. But we are sure the socket send buffer is
     * empty so this write will never fail actually. */
    /* 我们不能用连接的缓存因为它们在此阶段被用于累计新的命令, 但是我们确定的是现在的socket
       的发送缓冲区是空的,并且此时对socket的发送缓冲区写内容是不会出错的.
    */
    /* 如果从节点支持PSYNC2,则将buf置为“+CONTINUE <server.replid>\r\n”,
       如果从节点不支持PSYNC2,则将buf置为“+CONTINUE\r\n“
    */
    if (c->slave_capa & SLAVE_CAPA_PSYNC2) {
        buflen = snprintf(buf,sizeof(buf),"+CONTINUE %s\r\n", server.replid);
    } else {
        buflen = snprintf(buf,sizeof(buf),"+CONTINUE\r\n");
    }
    /* 将buf中的内容发送给从节点 */
    if (connWrite(c->conn,buf,buflen) != buflen) {
        freeClientAsync(c);
        return C_OK;
    }
    /* 将要发送的数据写入到client的【char buf[PROTO_REPLY_CHUNK_BYTES]】和
     【list *reply】中,注意这里并没有将复制积压缓冲区中的内容立刻发给从节点,而是通过下文中的
      moduleFireServerEvent将事件触发 */
    psync_len = addReplyReplicationBacklog(c,psync_offset);
    serverLog(LL_NOTICE,
        "Partial resynchronization request from %s accepted. Sending %lld bytes of backlog starting from offset %lld.",
            replicationGetSlaveName(c),
            psync_len, psync_offset);
    /* Note that we don't need to set the selected DB at server.slaveseldb
     * to -1 to force the master to emit SELECT, since the slave already
     * has this state from the previous connection with the master. */
    /* 计算延迟值小于min-slaves-max-lag的从节点的个数,1.2.3.4是具体实现
       刷新运行良好的从节点的个数
    
     */
    refreshGoodSlavesCount();

    /* Fire the replica change modules event. */
    /* 触发REDISMODULE_EVENT_REPLICA_CHANGE事件*/
    moduleFireServerEvent(REDISMODULE_EVENT_REPLICA_CHANGE,
                          REDISMODULE_SUBEVENT_REPLICA_CHANGE_ONLINE,
                          NULL);

    return C_OK; /* The caller can return, no full resync needed. */

need_full_resync:
    /* We need a full resync for some reason... Note that we can't
     * reply to PSYNC right now if a full SYNC is needed. The reply
     * must include the master offset at the time the RDB file we transfer
     * is generated, so we need to delay the reply to that moment. */
    return C_ERR;
}

1.2.3.1 addReplyReplicationBacklog

/* Feed the slave 'c' with the replication backlog starting from the
 * specified 'offset' up to the end of the backlog. */
long long addReplyReplicationBacklog(client *c, long long offset) {
    long long j, skip, len;

    serverLog(LL_DEBUG, "[PSYNC] Replica request offset: %lld", offset);

    if (server.repl_backlog_histlen == 0) {
        serverLog(LL_DEBUG, "[PSYNC] Backlog history len is zero");
        return 0;
    }

    serverLog(LL_DEBUG, "[PSYNC] Backlog size: %lld",
             server.repl_backlog_size);
    serverLog(LL_DEBUG, "[PSYNC] First byte: %lld",
             server.repl_backlog_off);
    serverLog(LL_DEBUG, "[PSYNC] History len: %lld",
             server.repl_backlog_histlen);
    serverLog(LL_DEBUG, "[PSYNC] Current index: %lld",
             server.repl_backlog_idx);

    /* Compute the amount of bytes we need to discard. */
    skip = offset - server.repl_backlog_off;
    serverLog(LL_DEBUG, "[PSYNC] Skipping: %lld", skip);

    /* Point j to the oldest byte, that is actually our
     * server.repl_backlog_off byte. */
    j = (server.repl_backlog_idx +
        (server.repl_backlog_size-server.repl_backlog_histlen)) %
        server.repl_backlog_size;
    serverLog(LL_DEBUG, "[PSYNC] Index of first byte: %lld", j);

    /* Discard the amount of data to seek to the specified 'offset'. */
    j = (j + skip) % server.repl_backlog_size;

    /* Feed slave with data. Since it is a circular buffer we have to
     * split the reply in two parts if we are cross-boundary. */
    len = server.repl_backlog_histlen - skip;
    serverLog(LL_DEBUG, "[PSYNC] Reply total length: %lld", len);
    while(len) {
        long long thislen =
            ((server.repl_backlog_size - j) < len) ?
            (server.repl_backlog_size - j) : len;

        serverLog(LL_DEBUG, "[PSYNC] addReply() length: %lld", thislen);
        addReplySds(c,sdsnewlen(server.repl_backlog + j, thislen));
        len -= thislen;
        j = 0;
    }
    return server.repl_backlog_histlen - skip;
}

void addReplySds(client *c, sds s) {
    if (prepareClientToWrite(c) != C_OK) {
        /* The caller expects the sds to be free'd. */
        sdsfree(s);
        return;
    }
    if (_addReplyToBuffer(c,s,sdslen(s)) != C_OK)
        _addReplyProtoToList(c,s,sdslen(s));
    sdsfree(s);
}

1.2.3.3 moduleFireServerEvent 

/* intercept v.拦截;【数】(在两点或两线间)截取;窃听;相交
             n.拦截;【数】截距;侦听
*/
/* This is called by the Redis internals every time we want to fire an
 * event that can be interceppted by some module. The pointer 'data' is useful
 * in order to populate the event-specific structure when needed, in order
 * to return the structure with more information to the callback.
 *
 * 'eid' and 'subid' are just the main event ID and the sub event associated
 * with the event, depending on what exactly happened. */
void moduleFireServerEvent(uint64_t eid, int subid, void *data) {
    /* Fast path to return ASAP if there is nothing to do, avoiding to
     * setup the iterator and so forth: we want this call to be extremely
     * cheap if there are no registered modules. */
    if (listLength(RedisModule_EventListeners) == 0) return;

    listIter li;
    listNode *ln;
    listRewind(RedisModule_EventListeners,&li);
    while((ln = listNext(&li))) {
        RedisModuleEventListener *el = ln->value;
        if (el->event.id == eid) {
            RedisModuleCtx ctx = REDISMODULE_CTX_INIT;
            ctx.module = el->module;

            if (ModulesInHooks == 0) {
                ctx.client = moduleFreeContextReusedClient;
            } else {
                ctx.client = createClient(NULL);
                ctx.client->flags |= CLIENT_MODULE;
                ctx.client->user = NULL; /* Root user. */
            }

            void *moduledata = NULL;
            RedisModuleClientInfoV1 civ1;
            RedisModuleReplicationInfoV1 riv1;
            RedisModuleModuleChangeV1 mcv1;
            /* Start at DB zero by default when calling the handler. It's
             * up to the specific event setup to change it when it makes
             * sense. For instance for FLUSHDB events we select the correct
             * DB automatically. */
            selectDb(ctx.client, 0);

            /* Event specific context and data pointer setup. */
            if (eid == REDISMODULE_EVENT_CLIENT_CHANGE) {
                modulePopulateClientInfoStructure(&civ1,data,
                                                  el->event.dataver);
                moduledata = &civ1;
            } else if (eid == REDISMODULE_EVENT_REPLICATION_ROLE_CHANGED) {
                modulePopulateReplicationInfoStructure(&riv1,el->event.dataver);
                moduledata = &riv1;
            } else if (eid == REDISMODULE_EVENT_FLUSHDB) {
                moduledata = data;
                RedisModuleFlushInfoV1 *fi = data;
                if (fi->dbnum != -1)
                    selectDb(ctx.client, fi->dbnum);
            } else if (eid == REDISMODULE_EVENT_MODULE_CHANGE) {
                RedisModule *m = data;
                if (m == el->module)
                    continue;
                mcv1.version = REDISMODULE_MODULE_CHANGE_VERSION;
                mcv1.module_name = m->name;
                mcv1.module_version = m->ver;
                moduledata = &mcv1;
            } else if (eid == REDISMODULE_EVENT_LOADING_PROGRESS) {
                moduledata = data;
            } else if (eid == REDISMODULE_EVENT_CRON_LOOP) {
                moduledata = data;
            }

            ModulesInHooks++;
            el->module->in_hook++;
            el->callback(&ctx,el->event,subid,moduledata);
            el->module->in_hook--;
            ModulesInHooks--;

            if (ModulesInHooks != 0) freeClient(ctx.client);
            moduleFreeContext(&ctx);
        }
    }
}

1.2.3.4 refreshGoodSlavesCount

/* ------------------------- MIN-SLAVES-TO-WRITE  --------------------------- */

/* This function counts the number of slaves with lag <= min-slaves-max-lag.
 * If the option is active, the server will prevent writes if there are not
 * enough connected slaves with the specified lag (or less). */
void refreshGoodSlavesCount(void) {
    listIter li;
    listNode *ln;
    int good = 0;

    if (!server.repl_min_slaves_to_write ||
        !server.repl_min_slaves_max_lag) return;

    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        client *slave = ln->value;
        time_t lag = server.unixtime - slave->repl_ack_time;

        if (slave->replstate == SLAVE_STATE_ONLINE &&
            lag <= server.repl_min_slaves_max_lag) good++;
    }
    server.repl_good_slaves_count = good;
}

 

1.2.4 从master角度看slave的运行状态

/* POV --point of the view */

/* State of slaves from the POV of the master. Used in client->replstate.
 * In SEND_BULK and ONLINE state the slave receives new updates
 * in its output queue. In the WAIT_BGSAVE states instead the server is waiting
 * to start the next background saving in order to send updates to it. */
#define SLAVE_STATE_WAIT_BGSAVE_START 6 /* We need to produce a new RDB file. */
#define SLAVE_STATE_WAIT_BGSAVE_END 7 /* Waiting RDB file creation to finish. */
#define SLAVE_STATE_SEND_BULK 8 /* Sending RDB file to slave. */
#define SLAVE_STATE_ONLINE 9 /* RDB file transmitted, sending just updates. */

1.2.5 replicationSetupSlaveForFullResync

/* Send a FULLRESYNC reply in the specific case of a full resynchronization,
 * as a side effect setup the slave for a full sync in different ways:
 *
 * 1) Remember, into the slave client structure, the replication offset
 *    we sent here, so that if new slaves will later attach to the same
 *    background RDB saving process (by duplicating this client output
 *    buffer), we can get the right offset from this slave.
 * 2) Set the replication state of the slave to WAIT_BGSAVE_END so that
 *    we start accumulating differences from this point.
 * 3) Force the replication stream to re-emit a SELECT statement so
 *    the new slave incremental differences will start selecting the
 *    right database number.
 *
 * Normally this function should be called immediately after a successful
 * BGSAVE for replication was started, or when there is one already in
 * progress that we attached our slave to. */
int replicationSetupSlaveForFullResync(client *slave, long long offset) {
    char buf[128];
    int buflen;

    slave->psync_initial_offset = offset;
    slave->replstate = SLAVE_STATE_WAIT_BGSAVE_END;
    /* We are going to accumulate the incremental changes for this
     * slave as well. Set slaveseldb to -1 in order to force to re-emit
     * a SELECT statement in the replication stream. */
    server.slaveseldb = -1;

    /* Don't send this reply to slaves that approached us with
     * the old SYNC command. */
    if (!(slave->flags & CLIENT_PRE_PSYNC)) {
        buflen = snprintf(buf,sizeof(buf),"+FULLRESYNC %s %lld\r\n",
                          server.replid,offset);
        if (connWrite(slave->conn,buf,buflen) != buflen) {
            freeClientAsync(slave);
            return C_ERR;
        }
    }
    return C_OK;
}

1.3 从机接受并解析来自主机的对于PSYNC命令的回复

 

 

 

评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值