I want to share an issue we faced every time we made a deployment to our production environment. After each deployment, our content management environment would become unresponsive. This often led to the full deployment failing, because this unresponsiveness caused the final unicorn sync steps to fail.
This was a bit of a strange one, because the problem didnt always happen and the level of disruption varied from deployment to deployment. In the event that it caused the deployment to fail, we had to take the decision to run with an incomplete deployment and run the unicorn syncs manually. It also led to amuch longer depoyment window and frustrated marketing team, waiting for approval to get back into sitecore.
Investigating the issue
At the institution that I work, we work on a two week deployment window. With new features being developed in the first week, then release for testing in the second, before sign off to release. With the issue only happening during the deployment, it was obviously hard to investigate before or after. It also left us with a sense of foreboding in advance of a deployment - with little confidence in the deployment process.
There wasnt a lot to see in the sitecore logs, other than lots of entries as below:
"HttpModule is being initialized" and
"A connection was successfully established with the server, but then an error occurred during the pre-login handshake.".
When monitoring the various instances in our environment, we could see a dramatic spike in CPU on the lead machine in our CA SQL Cluster.
We sent of a memory dump to sitecore. There analysis showed a bottleneck resolving media items, with some 600 managed threads waiting to access the web database. They also noted:
No abnormal load on Sitecore instance.. However existing threads block on database access and ThreadPool is forced to create new worker threads to handle incoming requests, which causes throttling
Mariia Hubenko
This provided some very useful insights, but it didnt ultimately tell us what the problem was. Why were all these requests getting stacked up?
The solution
To find the answer to the problem, we needed to take a step back and ask what could be causing these multiple requests to CA web database? The answer to that problem, lay in a slight misconfiguration of our Azure Content Delivery Network (CDN).
When configuring a CDN to work with sitecore, the general approach is to overwrite the xxx. So that when an image is output as part of a web compoonent. Rather than specifying a path to the media library as the source to an image, it specifies the path to the CDN. The web browser makes the request to the CDN. If the CDN does not have that media resource, then it makes a call back to the specified origin for a copy of the resource, stores it and then fulfills the request.
The problem in our situation, was that the origin had been specified as our Content Managed (CM) sitecore instance.
Deployment Scenario - CM instance goes offfline. Requests are made to the CD instances which output images referring to CDN. In a small number of situations, the CDN does not have the resource, so sends a request (or 600) to the CM instance requesting the resource. Sitecore is still initialising at this point. Once threads become available, they are overwhelmed by these outstanding requests and cause the instance to become unresponsive.
Conclusion
This was clearly a misconfiguration issue. However, to someone setting this up for the first time, I can see they might think to put the CA domain as the origin.
The way sitecore was configured to use the CDN was to:
- display images on CD servers with paths to CDN
- display images on CM servers with paths to Media Library
So when faced with the question, what should we use as the origin? Its not so strange to incorrectly think to use CM - because the above shows the CM as the environment using links to Media Library (i.e. where the source images are).
But the point that was missed, was that the CD also has access to the media library, when called via "https://<domain>/~/media/image.jpg". Its only the generated HTML that references the CDN.
Therefore by setting the CDN origin as the CD domain. During a load balanced deployment of green / blue production CD targets. There would always be an origin endpoint waiting to serve requests for the CDN.