I recently came across an error where nodes within a Windows Server 2012 R2 Cluster who stopped being active nodes within the cluster and continually cycled round trying to join again.
We were seeing errors:
Event 1070: Failover cluster nodes must have the ability to start the Cluster service, form a cluster (when a given node starts but no other nodes are up) and join a cluster (when a given node starts and discovers that one or more nodes are already up). This requires that certain conditions be met, for example, failover cluster nodes must run compatible versions of the operating system.
Event 1145: Cluster resource <resource> timed out. If the pending timeout is too short for this resource consider increasing the pending timeout value.
Within the event log over and over again.
We did some basic troubleshooting to check network connectivity, configuration etc. One of our troublshooting steps included evicting a node and using the Clear-ClusterNode Powershell command to try and clear any configuration issues on the node. When we tried to add the node back to the cluster we were presented with a new error “Event ID: 7024 The Cluster Service service terminated with the following service-specific error: Keyset does not exist“.
We did some more digging and found that the permissions on the folder and files within C:ProgramDataMicrosoftCryptoRSAMachineKeys were largely missing. This is the folder that holds the certificate keys that the cluster uses to connect. Rather than change all the 38 files individually within the folder manually we came up with this wee script:
##This grants ownership of the folder and files below it to the administrator group.
takeown /f C:ProgramDataMicrosoftCryptoRSAMachineKeys /R /A
##This grants the System and Administrators accounts Full Access to the machinekey folder and all it’s subfolders/files, and removes any inherited permissions
icacls “C:ProgramDataMicrosoftCryptoRSAMachineKeys” /INHERITANCE:R /GRANT (“SYSTEM” + ‘:(CI)(OI)F’)
icacls “C:ProgramDataMicrosoftCryptoRSAMachineKeys” /INHERITANCE:R /GRANT (“Administrators” + ‘:(CI)(OI)F’)
Once the permissions were set as above we were able to successfully add the node back into the cluster. And all four nodes were active again.