Use Machine Learning to Predict Data Center Outages
Equinix, Inc. connects the world’s leading businesses to its customers, employees, and partners inside the most-interconnected data centers. On this global platform for digital business, companies come together across more than 50 markets on five continents to reach everywhere, interconnect everyone, and integrate everything they need to create their digital futures.
Equinix’s Data centers continuously produce large amounts of data related to their internal operation. This kind of machine-generated data flows 24x7x365; however, it is seldom exploited to benefit the health of the processes and the business itself. The information usually comes in two flavors: application events or system logs, and periodic measurements of some changing magnitude (processor load, used memory, etc.). Equinix wanted to use Machine Learning to understand historical behavior and predict when an outage may occur in the future.
Google Cloud Implementation
Equinix partnered with Google Cloud Platform (GCP) and SpringML to build a solution that looks at the raw logs generated in data centers and apply a Machine Learning model to predict outages. SpringML used a mix of structured (CSV data) and unstructured data (log messages) to build the model. In this project, we developed an anomaly detection model based on TensorFlow. The results of this model were combined with a supervised model where labels were available. Such an ensemble was able to then predict when the next outage in the data center may occur.