Domain not mapping through to Sitecore site in Kubernetes cluster

I came across an issue a while back when deploying Sitecore to Kubernetes, that had me stumped for several weeks. It took a support ticket with Sitecore and a Sitecore Stack exchange bounty to find a solution, so I thought it might be worth sharing again in case someone else has a similar issue.

Background

I had previously played around with Kubernetes, first of all containerizing our solution and then nailing down the necessary scripts to spin up a K8s cluster that delivered our main Sitecore solution, using the latest copies of both SOLR and SQL from backup. As far as I was concerned it worked… but then months went by, before the powers that be decided that they wanted to explore Kubernetes further. I then came back to start tinkering again and couldn’t get it to work?!?!?

Some issues

There were a few issues, relating to Kubernetes versions (not being available in azure) and API versions in spec files (and the formatting issues that come with it). But despite fixing these, I could not get the domain to map through the running container within my cluster.

Vanilla Install

I tried taking a step back and deploying the vanilla Sitecore instance, rather than our custom solution. Again, everything looked to be setup properly:

cm pod was deployed and shows TRUE for (Initialized / Ready / Containers Ready / Pod Scheduled)
kubectl get ingress returns (cd.globalhost, cm.globalhost, id.globalhost)
kubectl get services returns a entry for (nginx-ingress-ingress-nginx-controller LoadBalancer (valid External IP) 80:20269/TCP 443:32419)
Ran the script form deployment document to generate TLS .crt and .key files
Deployed the secrets
Created a local host entry mapping cm.globalhost (and others) to the external load balancer IP address for nginx
All other pods are healthy

I found a great article that talks through some tips on how to troubleshoot connectivity to your Kubernetes cluster:

https://medium.com/@ManagedKube/kubernetes-troubleshooting-ingress-and-services-traffic-flows-547ea867b120

Based on this article, everything seemed fine until the point of exec’ing the ingress pod and running:

curl -H "HOST: cm.globalhost" localhost

Which resulted in:

<html>
<head><title>308 Permanent Redirect</title></head>
<body>
<center><h1>308 Permanent Redirect</h1></center>
<hr><center>nginx</center>
</body>
</html>

Interestingly if I added the following to the ingress.yaml specification file:

nginx.ingress.kubernetes.io/ssl-redirect: "false"

I got a successful 200 response, with the "Welcome to sitecore" html.

This all seemed to point to an issue surrounding the ingress controller, but still I had no real idea as to what the cause was.

The solution

At this stage I went back to the installation guide and went through it with a fine tooth-comb, looking for anything that I might have missed… and that’s when I noticed a line missing from the PowerShell script that creates the ingress controller:

--set controller.service.externalTrafficPolicy=Local

Adding this additional setting to the install script, magically got everything working again!

When I initially worked on the solution (months / years prior), this setting was not mentioned in the Sitecore install guides, therefore it was missing from the resulting solution. Yet when reading through the most recent guide (10.3) it has been added without my spotting it.

I queried this with Sitecore support and received the following response:

After our investigation, we can confirm that "externalTrafficPolicy = Local" is needed for correct traffic transferring.
ExternalTrafficPolicy parameter was added to the 2.15.0 NGINX ingress controller version in the PR. This should be the explanation why previously it was not needed to adjust this setting during deployment.
Tamara Proshenko

So what does the externalTrafficPolicy do?

When looking for more information on the Kubernetes website, it shows that the setting can be set to cluster or local. The difference being a weighing up of preservation of IP vs a more evenly balanced cluster.

.spec.externalTrafficPolicy - denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints. There are two available options: Cluster (default) and Local. Cluster obscures the client source IP and may cause a second hop to another node, but should have good overall load-spreading. Local preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type Services, but risks potentially imbalanced traffic spreading.
https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/

Sitecores own documentation includes adding the setting in its example, then goes onto explain:

To preserve the client source IP address, set the ingress controller externalTrafficPolicy setting to Local.
Sitecore

Both sources seem to imply that the choice of setting is optional… so i am unsure how a failure to set the property resulted in a completely inaccessible Kubernetes cluster.

Summary

If you have landed on this page (and read to the bottom) then I am guessing you may have experienced a similar issue (if so let me know in the comments below). Hopefully this article might help you get traffic flowing into your cluster.

I guess the lesson to be learnt here is to pay close attention to the installation guide, even if you think it looks the same as a previous one you have followed!

If anyone can share some insight into why omitting the setting leads to the result I experienced I would be interested to hear what you think.