第一章主要介绍了TFA的基本安装和一些问题解决,接下来我们主要看看TFA的使用方法
一、收集日志信息diagcollect
该命令可以用来收集,该命令语法结构如下
tfactl diagcollect [-all | [component_name1] [component_name2] ... [component_nameN]] [-node all|local|n1,n2,..] [-tag description] [-z filename] [-since nh|d| -from time -to time | -for time] [-nocopy] [-notrim] [-silent] [-nocores] [-collectalldirs] [-collectdir dir1,dir2..] [-examples] [-node [node1,node2,nodeN] components:-ips|-database|-asm|-crsclient|-dbclient|-dbwlm|-tns|-rhp|-procinfo|-afd|-crs|-wls|-emagent|-oms|-ocm|-emplugins|-em|-acfs|-install|-cfgtools|-os|-ips|-ashhtml|-ashtext|-awrhtml|-awrtext
看起来很复杂,很长,这些参数的具体含义可以上官网查询
diagcollect参数解释 ,以下是常用的格式
1、收集所有节点的所有trace日志
tfactl> diagcollect -all
Collecting data for the last 12 hours for this component ...
Collecting data for all nodes
Creating ips package in master node ...
Trying ADR basepath /u01/app/oracle
Trying to use ADR homepath diag/rdbms/orcl/orcl ...
Submitting request to generate package for ADR homepath /u01/app/oracle/diag/rdbms/orcl/orcl
Trying to use ADR homepath diag/rdbms/prod1/PROD1 ...
Submitting request to generate package for ADR homepath /u01/app/oracle/diag/rdbms/prod1/PROD1
Master package completed for ADR homepath /u01/app/oracle/diag/rdbms/orcl/orcl
Created package 9 based on time range 2017-03-21 20:26:00.000000 +08:00 to 2017-03-22 08:26:00.000000 +08:00, correlation level basic
Master package completed for ADR homepath /u01/app/oracle/diag/rdbms/prod1/PROD1
Created package 9 based on time range 2017-03-21 20:26:00.000000 +08:00 to 2017-03-22 08:26:00.000000 +08:00, correlation level basic
Collection Id : 2017032208265012cr2
Detailed Logging at : /u01/app/oracle/tfa/repository/collection_Wed_Mar_22_08_26_50_CST_2017_node_all/diagcollect_20170322082650_12cr2.log
2017/03/22 08:27:06 CST : Collection Name : tfa_Wed_Mar_22_08_26_50_CST_2017.zip
2017/03/22 08:27:06 CST : Collecting diagnostics from hosts : [12cr2]
2017/03/22 08:27:06 CST : Scanning of files for Collection in progress...
2017/03/22 08:27:06 CST : Collecting additional diagnostic information...
2017/03/22 08:27:11 CST : Getting list of files satisfying time range [03/21/2017 20:27:06 CST, 03/22/2017 08:27:11 CST]
2017/03/22 08:27:12 CST : Collecting ADR incident files...
2017/03/22 08:27:28 CST : Completed collection of additional diagnostic information...
2017/03/22 08:27:32 CST : Completed Local Collection
.----------------------------------.
| Collection Summary |
+-------+-----------+-------+------+
| Host | Status | Size | Time |
+-------+-----------+-------+------+
| 12cr2 | Completed | 244kB | 26s |
'-------+-----------+-------+------'
Logs are being collected to: /u01/app/oracle/tfa/repository/collection_Wed_Mar_22_08_26_50_CST_2017_node_all
/u01/app/oracle/tfa/repository/collection_Wed_Mar_22_08_26_50_CST_2017_node_all/12cr2.tfa_Wed_Mar_22_08_26_50_CST_2017.zip
如您所见,我有两个实例,都进行了收集。篇幅所限仅下面只说明命令,不赋上打印信息。
2、收集8小时内的所有日志
tfactl diagcollect –all –since 8h
3、收集hrdb、fdb实例1小时内的的trace日志,并将打包的文件命名为foo
tfactl diagcollect -database hrdb,fdb -since 1d -z foo
4、收集6小时内RAC的1、2号节点的TRACE和OS日志
tfactl diagcollect –crs -os -node node1,node2 -since 6h
5、收集ASM的1,2号节点在2016-9-22到2016-9-23 21:00:00之间的告警日志
tfactl diagcollect -asm -node node1 -from Sep/22/2016 -to "Sep/23/2016 21:00:00"
6、收集2016-9-23日的trace日志
tfactl diagcollect -for Sep/23/2016
7、上面的命令加上时间表示在这个时间往前和王后的12小时的日志
下面表示收集2016-9-22 9:00到2016-9-23日 9:00间的trace日志
tfactl diagcollect -for "September/22/2016 21:00:00"
这些日志的关键告警信息都会被截取并打包,默认的情况下存在$ORACLE_BASE/tfa/repository/目录下,收集时您可以根据界面上信息查看该文件位置。这些日志我们可以进入相应目录查看。
这里您也可以根据自己的需求增加或者删除收集目录,自动进行收集,加入或减少节点等这里不过多介绍,详见官网说明
Diagnostic Collection Commands
二、自动分析所有日志analyze命令
如果自动收集无法自动分析,岂不是很没趣,oracle提供了analyze命令来帮助我们分析数据库当前的trace文件,以下就是部分常用命令说明
首先是语法,结构如下,详细的参数见
Analyze参数说明
tfactl analyze [-search "pattern"] [-comp db | asm | crs | acfs | os | osw | oswslabinfo | all] [-type error | warning | generic] [-since nh|d] [-from "MMM/DD/YYYY HH24:MI:SS"] [-to "MMM/DD/YYYY HH24:MI:SS"] [-for "MMM/DD/YYYY HH24:MI:SS"] [-node all | local | n1,n2,...] [-verbose] [-o file]
1、分析最近两天的“error”报错信息
tfactl> analyze -search "error" -since 2d
INFO: analyzing all (Alert and Unix System Logs) logs for the last 2880 minutes... Please wait...
INFO: analyzing host: 12cr2
Report title: Analysis of Alert,System Logs
Report date range: last ~2 day(s)
Report (default) time zone: CST - China Standard Time
Analysis started at: 22-Mar-2017 09:12:44 AM CST
Elapsed analysis time: 5 second(s).
Configuration file: /u01/app/oracle/tfa/12cr2/tfa_home/ext/tnt/conf/tnt.prop
Configuration group: all
Parameter: error
Total message count: 18,645, from 14-Mar-2017 09:47:46 AM CST to 22-Mar-2017 09:10:02 AM CST
Messages matching last ~2 day(s): 4,755, from 20-Mar-2017 09:21:15 AM CST to 22-Mar-2017 09:10:02 AM CST
Matching regex: error
Case sensitive: false
Match count: 228
[Source: /var/log/messages-20170320, Line: 9021]
Mar 20 09:21:25 2017
12cr2 rngd: read error
[Source: /var/log/messages-20170320, Line: 9022]
Mar 20 09:21:25 2017
12cr2 rngd: read error
篇幅所限只显示部分关键信息,如上。最近两台的错误有228个,search 后面跟着是关键字,你可以自己指定,如我们平常关心的ORA报错,都可以轻松找到并发现。
tfactl> analyze -search "ORA-" -since 7d
INFO: analyzing all (Alert and Unix System Logs) logs for the last 10080 minutes... Please wait...
INFO: analyzing host: 12cr2
Report title: Analysis of Alert,System Logs
Report date range: last ~7 day(s)
Report (default) time zone: CST - China Standard Time
Analysis started at: 22-Mar-2017 09:17:48 AM CST
Elapsed analysis time: 5 second(s).
Configuration file: /u01/app/oracle/tfa/12cr2/tfa_home/ext/tnt/conf/tnt.prop
Configuration group: all
Parameter: ORA-
Total message count: 21,276, from 14-Mar-2017 09:47:46 AM CST to 22-Mar-2017 09:17:18 AM CST
Messages matching last ~7 day(s): 11,382, from 15-Mar-2017 09:20:01 AM CST to 22-Mar-2017 09:17:18 AM CST
Matching regex: ORA-
Case sensitive: false
Match count: 373
[Source: /u01/app/oracle/diag/rdbms/prod1/PROD1/trace/alert_PROD1.log, Line: 459]
Mar 15 10:59:47 2017
Dispatchers and shared servers shutdown
ALTER DATABASE CLOSE NORMAL
Stopping Emon pool
ORA-1109 signalled during: ALTER DATABASE CLOSE NORMAL...
2、分析最近两天数据库实例的日志
tfactl> analyze -comp db -since 2d
INFO: analyzing db (DB Alert Logs) logs for the last 2880 minutes... Please wait...
INFO: analyzing host: 12cr2
Report title: DB Alert Logs
Report date range: last ~2 day(s)
Report (default) time zone: CST - China Standard Time
Analysis started at: 22-Mar-2017 09:33:29 AM CST
Elapsed analysis time: 2 second(s).
Configuration file: /u01/app/oracle/tfa/12cr2/tfa_home/ext/tnt/conf/tnt.prop
Configuration group: db
Total message count: 5,514, from 14-Mar-2017 04:20:29 PM CST to 17-Mar-2017 02:10:31 PM CST
Messages matching last ~2 day(s): 0
last ~2 day(s) error count: 0
last ~2 day(s) ignored error count: 0
last ~2 day(s) unique error count: 0
-comp 参数可以指定级别为os、db、asm、acfs、crs、osw、oswslabinfo、all,默认的话是all,表示所有的都收集
3、分析最近5小时的日志
tfactl analyze -since 5h
4、分析最近1h的级别为error的错误
tfactl> analyze -since 1h -type error
INFO: analyzing all (Alert and Unix System Logs) logs for the last 60 minutes... Please wait...
INFO: analyzing host: 12cr2
Report title: Analysis of Alert,System Logs
Report date range: last ~1 hour(s)
Report (default) time zone: CST - China Standard Time
Analysis started at: 22-Mar-2017 09:41:22 AM CST
Elapsed analysis time: 1 second(s).
Configuration file: /u01/app/oracle/tfa/12cr2/tfa_home/ext/tnt/conf/tnt.prop
Configuration group: all
Total message count: 3,467, from 20-Mar-2017 10:48:02 AM CST to 22-Mar-2017 09:40:01 AM CST
Messages matching last ~1 hour(s): 60, from 22-Mar-2017 08:43:16 AM CST to 22-Mar-2017 09:40:01 AM CST
last ~1 hour(s) error count: 0
last ~1 hour(s) ignored error count: 0
last ~1 hour(s) unique error count: 0
Message types for last ~1 hour(s)
Occurrences percent server name type
----------- ------- -------------------- -----
60 100.0% 12cr2 generic
----------- -------
60 100.0%
Unique error messages for last ~1 hour(s)
Occurrences percent server name error
----------- ------- -------------------- -----
----------- -------
0 100.0%
这里的type里的error 和-search 里的error不是一个意思,这里指的是告警类别,type参数有三个,
error、warning、generic
我们可以理解为严重,警告,一般三类错误
Oracle官网的说明如下:
| Argument | Description |
|---|---|
error | Error message patterns for database and Oracle ASM alert logs: .*ORA-00600:.* .*ORA-07445:.* .*IPC Send timeout detected. Sender: ospid.* .*Direct NFS: channel id .* path .* to filer .* PING timeout.* .*Direct NFS: channel id .* path .* to filer .* is DOWN.* .*ospid: .* has not called a wait for .* secs.* .*IPC Send timeout to .* inc .* for msg type .* from opid.* .*IPC Send timeout: Terminating pid.* .*Receiver: inst .* binc .* ospid.* .* terminating instance due to error.* .*: terminating the instance due to error.* .*Global Enqueue Services Deadlock detected Error message patterns for Oracle Grid Infrastructure alert logs: .*CRS-8011:.*,.*CRS-8013:.*,.*CRS-1607:.*,.*CRS-1615:.*, .*CRS-1714:.*,.*CRS-1656:.*,.*PRVF-5305:.*,.*CRS-1601:.*, .*CRS-1610:.*,.*PANIC. CRSD exiting:.*,.*Fatal Error from AGFW Proxy:.* |
warning | Warning message patterns for database and Oracle ASM alert logs: NOTE: process .* initiating offline of disk .* .*WARNING: cache read a corrupted block group.* .*NOTE: a corrupted block from group FRA was dumped to |
generic | Any messages that do not match any of the preceding patterns. |
其实这里还可以对指定时间的日志进行分析,官网说明如下,但是实际上报错了
tfactl> analyze -comp os -for "Mar/21/2017 11:00:00" -search "ORA"
ERROR: Invalid value for -for/-from. Supported format MMM/DD/YYYY HH24:MI:SS
可见还是有很多BUG未及时修复,所以关于时间参数的我们暂时可以跳过吧。以上就是基本的使用,当然TFA还有很多其他的功能和使用命令,但是感觉用的比较少,如果有兴趣可以直接上官网进行查询。
本文详述了Oracle Trace File Analyzer (TFA) 的使用,包括如何使用diagcollect收集不同场景的日志,如全节点、特定实例、告警日志等,以及如何用analyze命令分析错误信息和日志,对数据库实例进行故障排查。
1354

被折叠的 条评论
为什么被折叠?



