Using Wall Street secrets to reduce the cost of cloud infrastructure

Using Wall Street secrets to reduce the cost of cloud infrastructure

Stock market investors often rely on financial risk theories that help them maximize returns while minimizing financial loss due to market fluctuations. These theories help investors maintain a balanced portfolio to ensure they’ll never lose more money than they’re willing to part with at any given time.

Inspired by those theories, MIT researchers in collaboration with Microsoft have developed a “risk-aware” mathematical model that could improve the performance of cloud-computing networks across the globe. Notably, cloud infrastructure is extremely expensive and consumes a lot of the world’s energy.

Their model takes into account failure probabilities of links between data centers worldwide — akin to predicting the volatility of stocks. Then, it runs an optimization engine to allocate traffic through optimal paths to minimize loss, while maximizing overall usage of the network.

The model could help major cloud-service providers — such as Microsoft, Amazon, and Google — better utilize their infrastructure. The conventional approach is to keep links idle to handle unexpected traffic shifts resulting from link failures, which is a waste of energy, bandwidth, and other resources. The new model, called TeaVar, on the other hand, guarantees that for a target percentage of time — say, 99.9 percent — the network can handle all data traffic, so there is no need to keep any links idle. During that 0.01 percent of time, the model also keeps the data dropped as low as possible.

In experiments based on real-world data, the model supported three times the traffic throughput as traditional traffic-engineering methods, while maintaining the same high level of network availability. A paper describing the model and results will be presented at the ACM SIGCOMM conference this week.

Better network utilization can save service providers millions of dollars, but benefits will “trickle down” to consumers, says co-author Manya Ghobadi, the TIBCO Career Development Assistant Professor in the MIT Department of Electrical Engineering and Computer Science and a researcher at the Computer Science and Artificial Intelligence Laboratory (CSAIL).

“Having greater utilized infrastructure isn’t just good for cloud services — it’s also better for the world,” Ghobadi says. “Companies don’t have to purchase as much infrastructure to sell services to customers. Plus, being able to efficiently utilize datacenter resources can save enormous amounts of energy consumption by the cloud infrastructure. So, there are benefits both for the users and the environment at the same time.”

Joining Ghobadi on the paper are her students Jeremy Bogle and Nikhil Bhatia, both of CSAIL; Ishai Menache and Nikolaj Bjorner of Microsoft Research; and Asaf Valadarsky and Michael Schapira of Hebrew University.

On the money

Cloud service providers use networks of fiber optical cables running underground, connecting data centers in different cities. To route traffic, the providers rely on “traffic engineering” (TE) software that optimally allocates data bandwidth — amount of data that can be transferred at one time — through all network paths.

The goal is to ensure maximum availability to users around the world. But that’s challenging when some links can fail unexpectedly, due to drops in optical signal quality resulting from outages or lines cut during construction, among other factors. To stay robust to failure, providers keep many links at very low utilization, lying in wait to absorb full data loads from downed links.

Thus, it’s a tricky tradeoff between network availability and utilization, which would enable higher data throughputs. And that’s where traditional TE methods fail, the researchers say. They find optimal paths based on various factors, but never quantify the reliability of links. “They don’t say, ‘This link has a higher probability of being up and running, so that means you should be sending more traffic here,” Bogle says. “Most links in a network are operating at low utilization and aren’t sending as much traffic as they could be sending.”