On a few occasions recently we have been investigating an issue, only to find that one of more of our OnPrem SOLR instances have not been running. In this article, I will highlight the steps we have taken to monitor these services, and trigger an alert if any are found to not be running.
The Issue
We are currently running SOLR on our own virtual machines, with one VM providing the master node and a further two providing slave instances, that replicate the data and make it available to our Sitecore CD front ends. We have found that on a small number of occasions, normally after automatic azure patching, that the SOLR services fail to start up again after the machines are rebooted (despite them being set to automatic).
The Solution
Our solution to this problem came in two parts:
- The first part was to create a simple service that will run on each VM with the sole purpose of monitoring the SOLR service. That service periodically (every 10 mins) checks to to see if the service is running, if it is found to be healthy, then it registers a custom event (along the lines of "SOLR on VM1 is RUNNING") in application insights.
- The second part of the solution was to install the Application Insights for Sitecore Module (with Alerting). We then used the predefined 'Custom Event' alert, to constantly check if the given event "SOLR on VM1 is RUNNING" is present in the last 10 minutes. If that check ever returned false, then an alert is triggered and an email sent to subscribers.
With these two pieces of functionality, we are then able to monitor all three of our SOLR instances and quickly be alerted if any of them fail to start. Thus allowing us to investigate and fix any issues before they are reported.