CheckPoint没有自动执行[TimesTen运维基础]

CheckPoint没有自动执行:
今天接到一个客户的电话,说他们有一个库的CheckPoint历史时间比较奇怪,而且事务日志一致都没有删除。
1、看了一下事务持有日志,确实有点奇怪持有日志的是CheckPoint文件,而且也没有主备复制和长事务。
Command> call ttlogholds;
< 11302, 54794696, Checkpoint                   , ocstt.ds0 >
< 11831, 3753696, Checkpoint                    , ocstt.ds1 >
2 rows found.
2、查看了一下CheckPoint历史:
Command> select sysdate from dual;
< 2014-05-19 17:28:54 >
1 row found.
Command> call ttckpthistory;
< 2014-06-09 09:40:32.625312, 2014-06-09 09:40:33.600128, Fuzzy           , Completed       , Checkpointer    , <NULL>, 0, 11032, 54794784, 94, 629145600, 80, 5586352, 40, 2354992, 2346248, <NULL> >
< 2014-06-09 09:30:32.410709, 2014-06-09 09:30:32.531441, Fuzzy           , Completed       , Checkpointer    , <NULL>, 1, 11032, 13583360, 94, 629145600, 80, 5586352, 40, 2354992, 2346248, <NULL> >
< 2014-06-06 16:16:18.869326, 2014-06-06 16:16:19.013980, Fuzzy           , Completed       , Checkpointer    , <NULL>, 0, 10864, 12890096, 94, 629145600, 80, 5586352, 35, 2198336, 2243848, <NULL> >
< 2014-06-06 16:06:18.671965, 2014-06-06 16:06:18.853700, Fuzzy           , Completed       , Checkpointer    , <NULL>, 1, 10863, 50556048, 91, 629145600, 74, 5296896, 27, 1934448, 1879304, <NULL> >
< 2014-06-06 15:56:18.577550, 2014-06-06 15:56:18.659042, Fuzzy           , Completed       , Checkpointer    , <NULL>, 0, 10863, 21179784, 91, 629145600, 76, 5504656, 25, 1689048, 1867016, <NULL> >
< 2014-06-06 15:46:18.444551, 2014-06-06 15:46:18.564260, Fuzzy           , Completed       , Checkpointer    , <NULL>, 1, 10862, 58908472, 91, 629145600, 76, 5504656, 33, 2225416, 2215176, <NULL> >
< 2014-06-06 15:36:18.280709, 2014-06-06 15:36:18.431814, Fuzzy           , Completed       , Checkpointer    , <NULL>, 0, 10862, 29515448, 91, 629145600, 76, 5504656, 33, 2225416, 2215176, <NULL> >
< 2014-05-18 10:22:21.430088, 2014-05-18 10:22:23.745732, Static          , Completed       , Subdaemon       , <NULL>, 1, 11831, 3753784, 775, 629145600, 763, 53787560, 775, 629145600, 53789960, <NULL> >
8 rows found.
发现CheckPoint的历史中前面几行的时间都是2014-06-09和2014-06-06,但是今天才2014-05-19号,难怪CheckPoint一直都没有执行。
3、检查sys.odbc.ini的配置和configuration配置,并允许ttconfig存储过程查看,CheckPoint的配置均正常,与客户确认操作系统时间有做过调整。
怀疑是客户修改操作系统时间引起,最后通过MetaLink文档 ID 1379020.1(Checkpointing Not Occurring)确认。

4、手动执行两次CheckPoint,CheckPoint正常执行,而且事务日志并清除;但是下次CheckPoint时间仍然不能自动执行CheckPoint。需要等到操作系统时间大于CkptHistory的时间才能正常自动执行CheckPoint。
5、解决办法:
有三种解决办法:
a)、通过ttBulkCp或ttMigrate导出数据,重建DSN后再重启导入数据。

b)、采用Crontab调度定时任务,定时执行CheckPoint。

c)、修改CheckPoint的方式为按照事务日志变化量自动发起CheckPoint,(如:call ttCkptConfig (0,1000,0); )。

----------------------------------End---------------------------------------------
参考文档:

Checkpointing Not Occurring (文档 ID 1379020.1)

Applies to:
TimesTen Data Server - Version 7.0.5.0.0 to 11.2.1 [Release 7.0 to 11.2]
Information in this document applies to any platform.
This problem could potentially occur in any TimesTen data store.


Symptoms

-Customer reported that automatic checkpointing was not occurring in a production application. All attempts to restart checkpointing failed.

-The problem was occurring in 2 different data stores existing on the same server. The problem has been previously observed by another customer and  occured on 8 different data stores running on the same server.

-The customer was using time-interval based checkpointing, which is the default TimesTen checkpoint configuration. The default is for the checkpointer to execute a checkpoint once every 600 seconds (10 minutes).
Changes

No changes were specifically made to either of the data stores themselves. However it turns out that an FE was making changes to they system clock while TimesTen was operational.
Cause


Checkpoint histories from both Nodes showed  a checkpoint entry to be 8 years in the future. Both data stores had a checkpoint history showing that a checkpoint was performed on 11-Nov-2019:
.

/* NODE_1 */

Command> call ttCkptHistory;
< 2019-11-11 17:03:00.711478, 2019-11-11 17:03:23.156722, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746589, 30757968, 212794, 3221225472, 210492, 2312255864, 212794, 3221225472, 2315922824, <NULL> >
< 2011-11-17 01:19:31.976204, 2011-11-17 01:19:32.489286, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746498, 3216136, 212794, 3221225472, 210478, 2311956160, 28, 1373176, 1224072, <NULL> >
< 2011-11-17 01:14:31.758497, 2011-11-17 01:14:54.678123, Fuzzy , Completed , Checkpointer , <NULL>, 1, 746498, 3215984, 212794, 3221225472, 210478, 2311956160, 104516, 941230496, 1181552008, <NULL> >
< 2011-11-17 01:09:31.809785, 2011-11-17 01:09:56.464715, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746498, 3148336, 212794, 3221225472, 210492, 2312255864, 119230, 1106716584, 1348410760, <NULL> >
< 2011-11-17 01:04:31.866130, 2011-11-17 01:04:56.486027, Fuzzy , Completed , Checkpointer , <NULL>, 1, 746493, 8825888, 212794, 3221225472, 210492, 2312255864, 122153, 1135240656, 1373654408, <NULL> >
< 2011-11-17 00:59:31.964454, 2011-11-17 00:59:56.541392, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746487, 61057424, 212794, 3221225472, 210492, 2312255864, 124353, 1161729472, 1404247432, <NULL> >
< 2011-11-17 00:54:31.498796, 2011-11-17 00:54:58.654976, Fuzzy , Completed , Checkpointer , <NULL>, 1, 746482, 38247720, 212794, 3221225472, 210492, 2312255864, 153016, 1407583008, 1602108808, <NULL> >
< 2011-11-17 22:40:01.549312, 2011-11-17 22:40:25.057321, Fuzzy , Completed , User , <NULL>, 1, 749263, 14956640, 212793, 3221225472, 210490, 2312125496, 193103, 1889295352, 1961270664, <NULL> >
8 rows found.


/* NODE_2 */

Command> call ttCkptHistory;
< 2019-11-11 17:03:00.711478, 2019-11-11 17:03:23.156722, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746589, 30757968, 212794, 3221225472, 210492, 2312255864, 212794, 3221225472, 2315922824, <NULL> >
< 2011-11-17 01:19:31.976204, 2011-11-17 01:19:32.489286, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746498, 3216136, 212794, 3221225472, 210478, 2311956160, 28, 1373176, 1224072, <NULL> >
< 2011-11-17 01:14:31.758497, 2011-11-17 01:14:54.678123, Fuzzy , Completed , Checkpointer , <NULL>, 1, 746498, 3215984, 212794, 3221225472, 210478, 2311956160, 104516, 941230496, 1181552008, <NULL> >
< 2011-11-17 01:09:31.809785, 2011-11-17 01:09:56.464715, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746498, 3148336, 212794, 3221225472, 210492, 2312255864, 119230, 1106716584, 1348410760, <NULL> >
< 2011-11-17 01:04:31.866130, 2011-11-17 01:04:56.486027, Fuzzy , Completed , Checkpointer , <NULL>, 1, 746493, 8825888, 212794, 3221225472, 210492, 2312255864, 122153, 1135240656, 1373654408, <NULL> >
< 2011-11-17 00:59:31.964454, 2011-11-17 00:59:56.541392, Fuzzy , Completed , Checkpointer , <NULL>, 0, 746487, 61057424, 212794, 3221225472, 210492, 2312255864, 124353, 1161729472, 1404247432, <NULL> >
< 2011-11-17 00:54:31.498796, 2011-11-17 00:54:58.654976, Fuzzy , Completed , Checkpointer , <NULL>, 1, 746482, 38247720, 212794, 3221225472, 210492, 2312255864, 153016, 1407583008, 1602108808, <NULL> >
< 2011-11-17 22:40:01.777759, 2011-11-17 22:40:28.797874, Fuzzy , Completed , User , <NULL>, 0, 749261, 15723336, 213130, 3221225472, 210827, 2315085008, 195219, 2827231536, 1997065608, <NULL> >
8 rows found.



Customer subsequently determined that changes had been made to the server system clock which resulted in a checkpoint being registered as having been performed on Nov 11, 2019, i.e., 8 years in the future.

Because a checkpoint resides in the checkpoint history data structure with a date 8 years in the future, the next time-interval based checkpoint will not occur until  <ckpt time interval> + Nov 11, 2019. In this case that means that no automatic checkpoint will occur which is to say it won't happen until  about 17:09 on Nov 11, 2019. Because of the logic  used to update the internal checkpoint history structure, this bad checkpoint date will not be flushed out of the structure until 8 checkpoints at a time later than it have been performed. So unless customer chooses to rebuild his data stores, customer will have to operate for the next 8 years with a corrupted checkpoint history structure in the data store header.
Solution

Customer has the following possible solutions and workarounds:

(1) Rebuild the affected data stores. Export the data from the data store using ttBulkCp or ttMigrate, destroy the current data store, create a new data store with identical attributes as the old data store and import the data back in. This is the safest solution and also the most time-consuming solution.                                                                 
(2) Enable a cron job which wakes up at a defined interval, connects to the data store and performs a manual checkpoint by calling 'ttCkpt' .

(3) Modify the automatic checkpointing algorithm of the data store so that it is dependent on accumulated transaction log volume instead of a time interval. If customer executes the following command in ttisql:
 
call ttCkptConfig (0,1000,0);


then the checkpointer will automatically execute a checkpoint each time the amount of transaction log data generated since the last checkpoint exceeds 1000 megabytes (1 gigabyte). Enabling a checkpoint algorithm based on accumulated log volume causes the checkpointer thread to ignore date stamp information in the checkpoint history structure, thus working around the date corruption in the checkpoint history. Check the TimesTen Reference Manual for more information on the use of 'ttCkptconfig' to change default checkpointing behavior.

References
BUG:13402829 - CORRUPTED DATES IN CHECKPOINT HISTORY ARE BLOCKING AUTOMATIC CHECKPOINTING
=======================End=================================================================
### 回答1: Flink的checkpoint自动恢复可以通过调用StreamExecutionEnvironment.enableCheckpointing() 方法来实现,并且可以配置checkpoint的频率、检查点策略等。 ### 回答2: Flink是一个分布式流处理框架,它具有故障恢复的能力。Checkpoint是Flink中用于实现故障恢复机制的基本概念之一。当Flink程序执行过程中发生故障时,可以利用Checkpoint自动恢复程序的执行状态。 在Flink程序中,可以通过调用`env.enableCheckpointing(interval)`来开启Checkpoint功能,并指定Checkpoint的时间间隔。当Checkpoint开启后,Flink会周期性地将当前程序的运行状态保存到可靠的存储系统中,例如HDFS。 当程序发生故障时,Flink会自动从最近的一个成功的Checkpoint开始恢复。具体的恢复过程如下: 1. Flink首先会从外部的存储系统中(如HDFS)读取最近的一个成功的Checkpoint文件。 2. 然后,Flink通过反序列化Checkpoint文件中的状态信息,恢复任务的运行状态。 3. 接下来,Flink会重新分配任务的执行,并从已恢复的状态开始继续执行。 需要注意的是,Flink会保存Checkpoint的元数据,记录每一个成功的Checkpoint的位置和版本号。这样,在发生故障时,Flink可以根据这些元数据快速地确定从哪个Checkpoint开始恢复。 总的来说,Flink的Checkpoint机制能够自动将程序的状态保存到可靠的存储系统中,并在发生故障时自动恢复状态,保证数据处理的一致性和容错性。这使得Flink能够处理大规模和长时间运行的流式应用。 ### 回答3: Flink是一个流处理引擎,它提供了checkpoint机制来实现故障恢复和容错性。Checkpoint是Flink在流处理过程中的一种机制,它会周期性地记录整个流处理任务的状态,并将状态存储到可靠的持久化存储系统中,以便在发生故障时能够恢复任务的状态。 Flink提供了两种类型的checkpoint:独立的和保存点。独立的checkpoint通过触发checkpoint操作来手动记录任务状态,而保存点是由Flink自动周期性地创建和维护的。 要实现Flink checkpoint自动恢复,我们需要遵循以下步骤: 1. 配置Flink任务的checkpoint参数,包括checkpoint间隔时间和最大并发checkpoint数等。可以通过`ExecutionEnvironment#getCheckpointConfig()`或`StreamExecutionEnvironment#getCheckpointConfig()`方法来获取并设置相关配置。 2. 在Flink任务中需要持久化的状态对象上添加`@OperatorState`或`@KeyedState`注解,以便在checkpoint时将状态对象进行持久化。 3. 当Flink任务运行时,它会自动创建和维护保存点。当发生故障导致任务失败时,Flink会检查最近的保存点,并根据保存点中的状态进行任务的恢复。 4. 在任务失败后,Flink会自动检测到故障并启动自动恢复机制。它会将保存点中的状态加载到内存中,并从上一次保存点的位置继续处理数据。 需要注意的是,为了确保Flink任务的checkpoint自动恢复能够正常工作,需要保证任务的所有操作符(例如map、filter和flatMap等)都是可重放的,并且没有依赖于外部系统的操作。 通过在Flink任务中配置和实现上述步骤,就可以实现Flink checkpoint自动恢复代码。这样,即使任务发生故障,也可以保证任务的状态能够恢复到故障之前的状态,从而确保数据处理的连续性和准确性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值