Meltdown and Spectre created something of a meltdown in the cloud computing world. And by translation, the flaws found in the processors at the heart of much of the world's computing infrastructure have had a direct or indirect effect on the interconnected services driving today's Internet. That is especially true for one variant of the Spectre vulnerability revealed abruptly by Google on January 3, since this particular vuln could allow malware running in one user's virtual machine or other "sandboxed" environment to read data from another—or, from the host server itself.
In June 2017, Intel learned of these threats from researchers who kept the information under wraps so hardware and operating system vendors could furiously work on fixes. But while places like Amazon, Google, and Microsoft were clued in early because of their "Tier 1" nature, most smaller infrastructure companies and data center operators were left in the dark until the news broke on January 3. This sent many organizations immediately scrambling: no warning of the exploits came before proof-of-concept code for exploiting them was already public.
Tory Kulick, Director of Operations and Security at the hosting company Linode, described this as chaos. "How could something this big be disclosed like this without any proper warning? We were feeling out of the loop, like 'What did we miss? Which of the POCs [proofs of concept of the vulnerabilities] are out there now?' All that was going through my mind."
"When this stuff broke, nobody had heard a peep from Intel or from anybody else directly," Zachary Smith, CEO of the hosting service Packet, told Ars. "All we could see is what was going on Google's blog about how to exploit this stuff. So we were all scrambling. The big guys—Google, Amazon, and Microsoft—have had 60 days at least of prep time, and we've had negative prep time."
Even the teams behind some operating system distributions—including developers of BSD distributions—weren't aware of the flaws until Google published the Project Zero blog. "Only Tier-1 companies received advance information, and that is not responsible disclosure—it is selective disclosure," said Theo de Raadt, the lead on the OpenBSD project, when talking to ITWire. "Everyone below Tier-1 has just gotten screwed."
The nature and timing of Google's disclosure, driven at least in part by independent discovery of the vulnerabilities, has made the response even more chaotic and painful for cloud hosts and users. Processor microcode fixes to firmware have been pushed out incomplete, in some cases being recalled afterward. Some applications have taken big performance hits. And no one is really sure how all the permutations of software and firmware patching will affect cloud services as they're rolled out.
So to overcome the chaos, these companies did something kind of novel: they decided to work side-by-side. A group of second-tier service providers banded together to formally share information about patches from various vendors, metrics on their impact, and best practices for rolling them out. Over the past week, this ad hoc war council—a group of at least 25 companies operating over a simple shared Slack—has attracted a number of higher-profile members including Netflix and Amazon Web Services. And this impromptu centralization has even allowed the researchers originally behind the Spectre/Meltdown discovery to interact directly with affected companies.
"Probably one of the best things that came out of the whole ordeal was this cross-cloud hosting collaboration," said Linode's Kulick. "Sharing links and things like that was absolutely critical."
And Kulick, like others in the group, hopes that this episode will lead to a more permanent sort of collaboration across the industry—giving smaller organizations and big cloud customers a seat at the table for future security issues of this magnitude.
"Our industry has grown up," said Smith. "We're not a ragtag team of people running little hosting racks and putting some websites online anymore—we're running major portions of peoples' lives on our infrastructure for them, and it would kind of be a problem if we didn't figure out a way to coordinate."
"Thank god this wasn't a state actor," Smith added.
The dumpster fire begins
As the world was shaking off the hangovers of New Year's Eve, another sort of headache was taking shape amongst the chatter in Slack channels at Packet, a New York-based "bare metal" hosting company.
"Monday night and Tuesday, some of the AMD commits and comments to Kernel.org that were happening came into our internal Slack channels," said Smith (Kernel.org is where contributors push the latest updates to versions of the Linux kernel). "We host Kernel.org, so we watch it pretty carefully. Everyone was like, 'Something is going on.'"
There had been a long discussion in the Kernel.org change logs that dated back to May 2017 about a new feature called KAISER ("Kernel Address Isolation to have Side-channels Efficiently Removed"). This feature was triggered by long-standing concerns about the potential for the kinds of attacks the Meltdown and Spectre proofs of concept are based on. Commits for KAISER started about a month before Meltdown and Spectre were revealed to Intel, so there was already work going on to try to mitigate the potential threat of these classes of attack. By the time Packet and others started monitoring this, the kernel updates related to KAISER were coming with increasing frequency—and with more subtle references to a potential exploit—as the year passed.
"I thought people were seeing things through commits and were starting to piece it together," said Kulick.
A comment accompanying a Linux kernel commit by AMD's Tom Lendacky on December 27 really set off speculation, infuriating executives at several of the companies in the know about the vulnerabilities. The commit comment essentially spelled out AMD's position at that point on the embargoed bugs: the company believed its processors were not subject to the types of attacks that the kernel page table isolation feature protects against. AMD also believed its microarchitecture does not allow memory references, including speculative references, that access higher-privileged data when running in a lesser-privileged mode when that access would result in a page fault.
Of course, AMD's architecture would later turn out to not be as immune to side-channel attacks as Lendacky asserted.
"AMD didn't help with their kind of snarky kernel commits," said Smith, who suggested that the comment may have played a role in Google's early release of the information on Spectre and Meltdown. Even if it did, however, other researchers were starting to independently discover the flaws at the core of Spectre and Meltdown—researcher Anders Fogh had publicly written about what would later be defined as Meltdown in late July last year.
Whatever triggered the ultimate release, Jann Horn of Google's Project Zero security research team published details of Meltdown and Spectre on January 3—a week before the initial set embargo on the vulnerability releases. At that point, according to Smith, "you know, all kind of hell broke loose."
Kulick said he thought Google's disclosure caused problems, but "even if it was disclosed on the ninth as planned, we would have all been in a world of hurt. It would have been a different thing if there had been some lead time."
Given how reliant all sorts of applications have become on cloud services, it's telling that nobody at Intel, Red Hat, AMD, or Google clued in anyone outside the top-tier hardware and operating system vendors.
"The Tier 2 providers that are represented in this little working group we formed control hundreds of thousands, if not millions, of servers," said Smith. "But individually we're too small… Google never thought to call Packet. Intel didn't think to call Packet, and they certainly didn't call OVH or Digital Ocean. And yet we're just as important from a customer standpoint, because our customers need a lot more help."
Once the details were out, communications from Intel, AMD, and other hardware vendors about Spectre and Meltdown were (and have continued to be) spotty. Even today, there's no central communications channel for everyone affected. "My impression is that [Intel's communications with customers] were going through different teams based on regions," Kulick said. "They're getting hit pretty hard, so there have been delays in communication."
"Intel was just behind the eight ball," Smith said. He suggested Intel was too consumed with the public relations problem and not focused on talking with customers like him. "I've encouraged [Intel]… I'm petitioning their data center group to do kind of a fireside chat online to answer questions. We need to have some open conversations, which are not all going to be positive, but we have to work together; people have to be heard. And I generally think our community wants to help—we just need to have more of a more open dialogue."
Of course, the communication issue hasn't been helped by the absence of any sort of established channel for communication. "Frankly, this is exposing how immature of an industry the public cloud is," said Smith. "We don't really have any really good working groups. So where, if you’re Red Hat or you're Intel or you're Supermicro, do you go under some sort of a common code of conduct to work with everybody around a security issue? There's no place."