A: Become a Grandmaster at Whack-a-mole!
“…used with reference to a situation in which attempts to solve a problem are piecemeal or superficial, resulting only in temporary or minor improvement…”
Much (very much!) has been written regarding the security threats posed by the discovery of system processors being exposed to the Spectre and Meltdown vulnerabilities. And as if these vulnerabilities weren’t already serious enough a quick Google search yields headlines like: “Meltdown-Spectre flaws: We’ve found new attack variants, say researchers – Intel and AMD may need to revisit their microcode fixes for Meltdown and Spectre.”
A paper by Caroline Trippe, et. al., MeltdownPrime and SpectrePrime: Automatically synthesized attacks exploiting invalidation-based coherence protocols explains that new variants of Meltdown and Spectre are easily created and leak the same type of information with the same level of precision. In fact, “Averaged over 100 runs, [they] observed SpectrePrime to achieve the same average accuracy as Spectre on the same hardware — 97.9 percent for Spectre and 99.95 percent for SpectrePrime”.
As most of us know by now, the latest software and hardware fixes from Microsoft, Intel and others can cause performance overheads, and while these fixes may be effective now (despite the performance hit), Intel and AMD are going to have to respond to variants likely to appear soon.
The Linux bad news
Increasingly, it seems that, when it comes to cyber security – there’s never a dull moment. The news for those over on the Linux side of things also took a turn for the worse, as this headline implies: “Linux Meltdown patch: ‘Up to 800 percent CPU overhead’, Netflix tests show”.
Like the Intel and AMD fixes, it seems that the Linux fix – known as KPTI – may also result in an unacceptable performance overhead – the largest CPU performance hit ever seen by Netflix performance analyst Brendan Gregg.
Apparently, exactly how much various systems are impacted depends upon the characteristics of the application. Applications with higher system call rates, such as proxies and databases that do lots of tiny I/O, will suffer the largest losses. The impact also rises with higher context switch and page fault rates.
Running with Windows based servers?
No good news here to report either – Microsoft recently had to issue an emergency update to disable an Intel microcode update that was causing server reboots and instability(!) This is starting to look like mayhem. Meanwhile, Microsoft also issued a patch which made possible registry tweaks available – for a reason; paraphrasing, “…the slowdown can be particularly bad with IO (input-output) applications on a Windows Server system. On Windows Server systems, Microsoft says “you want to be careful to evaluate the risk of untrusted code for each Windows Server instance, and balance the security versus performance tradeoff for your environment.” In other words, you may want to disable the patch on some server systems if you’re sure they won’t run untrusted code. Bear in mind that even JavaScript code running in a web browser or code running inside a virtual machine could exploit these bugs. The usual sandboxes that restrict what this code can do won’t fully protect your computer…”
Not sure about you – but I can foresee more than a few issues going forward with this approach. For example – if in the future you’re looking at having to load some software/security patches – but are new/unfamiliar to the environments – which ones will you install – not install? Have you identified and tested those environments where you’ve previously applied Registry tweaks?
Running with an IBM Mainframe?
As is typical of IBM, their approach to such serious matters is to stay pretty much tight-lipped until they are as certain as possible that they have made any and all necessary changes. Based upon my own efforts to get a feel for which of their systems are affected – and which of them are not – it seems like there’s some good – and some not so good news. The following seems to (currently at least) summarize their stance:
IBM strongly advises that our customers make performance assessments of patched operating systems which run any messaging workload, including MQ queue managers, clients, and Managed File Transfer agents, to ensure that the systems are still capable of achieving their peak processing objectives prior to applying any operating system patches to messaging systems. End-to-end assessments should also be completed where messaging throughput depends on other components such as application processing or databases, or systems which do not require patching…
They also explain that their recent program temporary fixes (PTFs) are needed to overcome the Meltdown and Spectre vulnerabilities on Power-8 systems running the IBM-i operating system. However, it doesn’t seem that the fixes are needed on other Power System models.
The wider picture.
It’s chaotic! So many moving parts – not just the chips, but firmware & software components as well – are affected by these vulnerabilities that fixes applied to one component have not been properly tested inasmuch as they need to interact with others, especially given the less than cohesive let alone coordinated approach from the respective vendors.
The internet is full of information on the scathing legacy of this whole affair – here’s one: Meltdown and Spectre FAQ. There have been many attempts to address the issues, with all of them failing to provide a satisfactory solution. You can also find extensive lists of all of the processors affected – here’s one: The Complete List Of CPUs Affected By Meltdown + Spectre. It doesn’t paint a good picture.
Really, it looks like a game of whack-a-mole that you can’t ever win…
A way forward?
Is there an immediate and perfect solution to this problem? No. All we can hope to do in the enterprise datacenter at this time is to mitigate. Mitigate and measure the effects of mitigation. We’ve talked about mitigation here, and it’s not a very warm-and-fuzzy discussion – nothing works to everyone’s satisfaction.
We haven’t talked about measurement yet – for any Meltdown/Spectre mitigation you apply, you need to measure its effects on your everyday processes. Do you have monitoring tools right now? If so, that’s great; hopefully you’re using them now to chart the effects of each Meltdown/Spectre “fix” that you apply in the datacenter. The truth is that you need to measure the effects on all systems – not just your Linux systems or your Microsoft servers, and not just your mainframe systems – you need to measure effects on any and all systems for which fixes are applied.
If you’re armed to the teeth with monitoring tools for every platform in your datacenter, then great; you’re in good shape. If not, you need a multiplatform monitoring solution, or better yet, a multiplatform monitoring/reporting and business intelligence solution.
A multiplatform monitoring/reporting and business intelligence solution can tell you what is happening performance-wise for any Meltdown/Spectre fix, and it can tell you how much that fix is costing you over and above your pre-mitigation costs.
There are a few of these solutions out there; you should look into it; knowledge is power!
Neil’s focus is on developing cloud technology and big data. You can often find him advising CxOs on cloud strategy.