While enabling ZK quorum
based HA via ZKFC service you can observed that default fencing method (i.e sshfence ) alone is
not sufficient .Details about scenarios and fencing methods Discussed below:
dfs.ha.fencing.methods - a list of scripts or
Java classes which will be used to fence the Active NameNode during a failover
The fencing methods used during
a failover are configured as a carriage-return-separated list, which will be
attempted in order until one indicates that fencing has succeeded. There are
two methods which ship with Hadoop: shell and sshfence. For
information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.
sshfence - SSH to the Active NameNode and kill the process
The sshfence option
SSHes to the target node and uses fuser to kill the process listening
on the service’s TCP port. In order for this fencing option to work, it must be
able to SSH to the target node without providing a passphrase. Thus, one must
also configure the dfs.ha.fencing.ssh.private-key-files option, which
is a comma-separated list of SSH private key files. For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
shell - run an arbitrary shell command to fence the Active NameNode
The shell fencing
method runs an arbitrary shell command. It may be configured like so:
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>
The string between ‘(’ and ‘)’
is passed directly to a bash shell and may not include any closing parentheses.
If the shell command returns an
exit code of 0, the fencing is determined to be successful. If it returns any
other exit code, the fencing was not successful and the next fencing method in
the list will be attempted.
Note: This fencing method
does not implement any timeout. If timeouts are necessary, they should be
implemented in the shell script itself (eg by forking a subshell to kill its
parent in some number of seconds)
Scenario #1
If namenode process killed or Active Name Node rebooted
Then passive node using ZKFC process able to make passive namenode state ‘active’ using (sshfence)
ssh work fine which allow to fence active namenode
Scenario #2
If namenode shutdown/node cable unplugged/ major
hardware/power failure
Then ssh will not happen and HA will never work
.
SSH fail so fencing of active Namenode
failed
Best Approach
To avoid Scenario #2 one can use both fencing method which will be
tried by ZKFC in the order of: if the first one fails, the second one will be
tried.
if ssh fencing failed then shell(/bin/true)
will always true and fence active namenode.
Example:
<value>sshfence
shell(/bin/true)
</value>
Property TAG:
Property TAG:
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
<name>dfs.ha.fencing.methods</name>
<value>sshfence
shell(/bin/true)
</value>
Tested this approach and it worked fine specially when Primary
NN host down due to major hardware/power failure as mentioned.