Oracle RAC Node Eviction – How to analyze

WHAT IS NODE EVICTION

The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process. The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.

PROCESS ROLES FOR REBOOTS

OCSSD (aka CSS daemon)
Primary responsibility of this daemon is internode health check and RDBMS instance endpoint discovery.
The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files).
OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process).

CSSDAGENT
This process provides following functionality. (These services was formerly (10g and 11.1) provided by oprocd.

  • Monitoring for node hangs (via oprocd functionality)
  • Monitoring to the OCSSD process for hangs (via oclsomon functionality)
  • monitoring vendor clusterware (via vmon functionality)
  • This is a multi-threaded process that runs at an elevated priority and runs as the root user.

CSSDMONITOR

  • This process monitors for node hangs (via oprocd functionality)
  • Monitors the OCSSD process for hangs (via oclsomon functionality)
  • Monitors vendor clusterware (via vmon functionality)
  • This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Review these file to figure out what is going on
Clusterware alert log in $GRID_HOME>/log/nodename
The cssdagent log(s) in $GRID_HOME/log/nodename/agent/ohasd/oracssdagent_root
The cssdmonitor log(s) in $GRID_HOME/log/nodename/agent/ohasd/oracssdmonitor_root
The ocssd log(s) in $GRID_HOME/log//cssd
The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
IPD/OS or OS Watcher data
‘opatch lsinventory -detail’ output for the GRID home

Messages files:
Linux: /var/log/messages
Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
IBM: /bin/errpt -a > messages.out
Common Causes of OCSSD Evictions

  • Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default – determined by the CSS misscount) to cause a node eviction.
  • Problems writing to or reading from the CSS voting disk. If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
  • A member kill escalation. For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism. If this times out it could escalate to a node kill.
  • An unexpected failure of the OCSSD process, this can be caused by any of the above issues or something else.
  • An Oracle bug.

Common Causes of CSSDAgent or CSSDMonitor Evictions

  • An OS scheduler problem. For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine, thus preventing the scheduler from behaving reasonably.
  • A thread(s) within the CSS daemon hung.
  • An Oracle bug.

Trouble shooting oracle 11gR2node eviction issues

This entry was posted in Real Application Clusters. Bookmark the permalink.