The size of our master database has continued to grow at an exceptional rate ever since we initially installed our Sitecore 8.2 solution. Over the years (and two upgrades) the size of the database has become unmanegable.
I believe intially the issue may have been a poorly planned batch process that took a snapshot PDF of a large group of pages, which were stored in the media library for reporting / legal purposes. This daily process was leaving large numbers of orphaned files, as new ones were created and old ones not handled properly.
At some point in time, this process was refactored and the PDFs were instead saved to Azure Data Lake, which is a far more sensible and cost effective place to store large swaths of files/data that does not need to be access quickly.
database that was in excess of 1.3 Terrabytes
With the original problem solved, we still left with a database that was in excess of 1.3 Terrabytes (with the data having to be split across multiple data drives). In a new Sitecore world of containers and kubernetes, this simply was not feasible, so something needed to be done.
These blog posts will look at the steps we took to clean up the media library.
The first step was to do a bit of spring cleaning in the media library. This invloved running two scripts to Remove all Items that are not referenced by others and then Remove all items that dont have media attached.
The next step was to cleanup the database and remove orphaned blobs. This didnt prove as straight forward as running the OTB commands from the control panel.
Finally, with all the media items removed that we no longer needed, we looked to move all our media into blob storage. This involved using the migration tools provided by Sitecore, but again, it wasnt as straight forward as you initially thought.
Summary
The result after taking the above actions was a master database that when backed up to a .BAK file was less than 2 Gigabytes. This as you might imagine is far more manageable and now allows us to create a SQL snapshot image of our production master database.
This is very handy for replicating the production environment locally, but essential for another development. To have available a set of images to rapidly deploy to Kubernetes cluster as a 'contingency' in the event of a major disruption or cyber attack on the production environment.