Curious about what happened while you were celebrating the holidays? Well, the worst CPU bug ever found was identified. It affects everyone and every computing device, including laptops, desktops, tablets, smartphones and even cloud computing systems.
Happy New Year, CIO! Your agenda has been hijacked… at least for a little while.
How It Impacts You
Here’s a paraphrase from Botmetric about the bug:
A group of researchers reported a serious vulnerability in the CPU architecture of Intel, ARM, and AMD which has impacted most of the devices across the world.
“Most of the devices around the world.” Talk about newsworthy.
Meltdown breaks the most fundamental isolation between the user application and the operating system. This attack allows a program to access the memory and all the secrets of other programs and the operating system.
Spectre, on the other hand, breaks the isolation between different applications. It allows an attacker to trick error-free programs (which follow best practices) into leaking the secrets. In fact, the safety checks of the said best practices actually increase the attack surface.
Spectre is significantly harder to exploit than Meltdown.
So, what does this mean in layman’s terms? Well, imagine that you check into a hotel room. The door has a lock and you have four walls. Safe right? Well, these exploits would allow someone to see through all the walls of the hotel. They could see what is going on in each and every room.
The good news is that it is a very complex activity to take advantage of this exploit. The bad news is that it impacts just about every machine within your health system.
What to Do
Upon becoming the CIO of a 5,000 server, 18,000 computer system in 2011, there were several things that had to be taken care of before any strategic work could even be discussed. Over 50% of the servers in our data center were end-of-life and the data centers themselves needed upwards of $5M in repairs and replacement.
This is similar to what the healthcare CIO is facing today with the revelation of these bugs. Strategy has to be put on hold to simply fix the hole in the ship.
At one point, the release said that every CPU would have to be replaced in order to remediate the security issue. That was published on January 3; it has since been removed. I’m not sure if it was removed because another method was found, or if there was a realization by Intel and others that this would bankrupt one of the country’s largest chip manufacturers.
I’d like to suggest a longer view of the situation. Consider this an opportunity. If right now you are looking at significant man-hours and purchases to remediate this issue, you will want to address not only this problem but also plan for the next one… and there will be a next one.
In our particular case, prior to 2011, every server we added only increased the complexity, workload, and vulnerability of the system. Maybe you feel the same way. So, choose to do things differently moving forward.
You are going to purchase servers this year. So, make it part of a strategic plan to change the way you handle compute and storage in your environment. Create an infrastructure plan. Every time you add a server it should be taking you closer to an environment that improves service, security, and reliability.
Build a Private Internal Cloud
I don’t see a time in the near future that any sizeable health system will be able to move completely to the cloud. The legacy systems make this a little longer road for healthcare. Therefore, you will need a highly-efficient, onsite compute and storage plan. There is nothing more efficient than the cloud today.
We architected a hyperconverged environment that leveraged increased server densities, virtualization, and software-defined characteristics. We made sure that everything in the environment was addressable programmatically and through a browser. Automation and agility were at the core of the design.
The key point here is don’t just patch the problem at hand. Also plan for the next one in your design.
Adopt External Cloud Infrastructure
This vulnerability impacted cloud providers, but I would like to make two points about this before I proceed: (1) Cloud is more prepared for this, and (2) Your budget needs the cloud.
Here is the bulletin on Amazon’s response to the vulnerability: https://aws.amazon.com/security/security-bulletins/AWS-2018-013/. I challenge you to find a health system infrastructure team that is as on top of this situation as Amazon. They are addressing this problem much faster than even the health system with only a few servers. I would love to have this team on my indirect payroll.
The second point is that if this does actually require new servers, your budget for 2018 is shot. I doubt anyone dropped a multi-million dollar contingency in their budget to replace all of their servers this year. If the problem does, in fact, end up being this kind of liability, I trust Amazon to remediate the problem with less impact on my overall budget.
With all that out of the way, a cloud infrastructure partner will provide you access to much more than meets the eye. Yes, there are compute and storage environments but there is also access to a whole lot more.
Cloud is the gateway to a host of new options for disaster recovery and business continuity, as well as new technologies such as automation, database, big data, machine learning, and AI frameworks. Look to create an environment where you can leverage the best of Internal and External cloud environments with the same skills and toolsets.
Finally, no matter how this goes down, this will require man-hours to remediate it. You have to plan to minimize this in the future, as well. Most of us are on this journey but it has lead to an explosion of tools. This too requires some planning and architecture.
We built a matrix of all the tools we had in our environment, what services we utilized them for, and what the capabilities were. The list was alarming in its size. We then consolidated our toolset down to a handful of highly-effective tools. We chose those that took our server provisioning from days down to minutes, and patching and fixing of servers to scripted functions as opposed to people-dependent endeavors.
These kinds of events are challenging, but see them for the opportunity that they are.