More From Forbes

Lift, shift and drift: when cloud migrations fail miserably.

Forbes Technology Council

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

Vice President of Cloud, Cybersecurity and Digital Infrastructure solutions at  ConvergeOne , a leading cloud solutions and services provider.

Organizations are embarking on a cloud voyage — often a maiden voyage — at a pace faster than most predicted. In fact, 90% of enterprises are accelerating cloud usage due to Covid-19, according to the 2021 State of the Cloud Report .

As organizations embrace this change, some choose to move workloads rapidly and minimize migration costs while others elect a more methodical approach to maximize their cloud adoption benefits.

The rapid migration of workloads is referred to as rehosting, or commonly known as a lift and shift. A typical lift and shift is quick, simple and highly automated and may be appealing to organizations that are early in their cloud journey. RFP evaluations may lean toward a lift and shift selection due to the lower upfront cost and accelerated timeline, and CIOs are under increasing pressure to leverage cloud services faster by their board and executive leadership. Expiring data center equipment leases, aging infrastructure or business requirements for improved SLAs may also drive a tight timeline that showcase a lift and shift as a natural path forward.

Organizations that completed a traditional lift and shift might be patting their backs for completing a fast, low-cost and successful migration, but that celebratory spirit is likely to fade after the first few months. Monthly costs start to spiral out of control due to lack of optimization, important features of cloud are found to be inaccessible due to unmodified workloads and performance management is challenging at best.

It’s at this point of a lift and shift that organizations start to add a third component: drift. CIOs and CFOs team together to plan their lift, shift and drift back to their traditional data centers to close the loop in an unsuccessful cloud journey. This occurrence frequently happens to executive teams from Fortune 100s and SMBs alike with a lightbulb realization that the cloud may not be ready for the demands of their business.

Breaking Down Software Development To Speed It Up With Fastn

Eight emerging trends shaping the future of cloud computing, broadcom’s hock tan declares ‘the future is private’ at vmware explore.

This realization, however, is in stark contrast to the many organizations that have successfully voyaged to cloud and regularly reap rich business benefits. Those experienced in cloud will quickly notate that the journey is not just to cloud (that’s the starting mile) — it’s operationalizing an environment for long-term use of cloud.

Setting a proper course is a critical success factor for any cloud voyage to reach its desired destination.   Savvy IT executives realize they will need to modify their applications to operate efficiently in a cloud environment and utilize native features. After all, the real advantages of cloud center on a fundamental change to the IT operating model that enables organizations to increase their agility, scalability and resiliency.

These modifications can range from minor changes to completely rewriting an application. A popular approach is called minimum viable refactoring , which ensures the application can leverage native cloud functionality while keeping an eye on budget and timeline. Minimum viable refactoring can be utilized as a standard expectation for most applications; however, in a widescale migration, each application should undergo a basic analysis to consider if a rebuild, replacement or retirement strategy is appropriate.

This level of analysis is often referred to as an application-based cloud assessment and migration plan to chart an optimal path to a cloud provider. This engagement should categorize applications against the 6R framework, which evaluates six common options for a migration: rehost (lift and shift), replatform, refactor, repurchase (or replace), retire and retain. One of the most valuable outcomes of categorizing workloads is identifying legacy applications that can be retired rather than moved to cloud or easily replaced with a better SaaS offering.

As you embark on your own voyage to the cloud, remember: The cloud ocean is vast, expansive, rich with beauty and contains many routes to reach the same destination. But well-planned routes with optimized ships will help you arrive faster than others.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?  

Tim Femister

  • Editorial Standards
  • Reprints & Permissions
  • Digital transformation

cloud computing failure case study

Helder Almeida - Fotolia

Why cloud migration failures happen and how to prevent them

Companies are moving more applications than ever to the cloud, but many of these initiatives fail. learn how to avoid making your own cloud migration mistakes..

Mary K. Pratt

  • Mary K. Pratt

The use of cloud computing for enterprise apps continues to grow, as organizations put more of their workloads in public clouds and pursue multi-cloud strategies to generate lower costs , increased agility and greater flexibility.

Not all cloud deployments, however, deliver those benefits -- or any benefits at all. Many IT leaders face failed cloud migration projects projects because they move apps into the cloud only to find that they don't work as well there as they did on premises, which forces a reverse migration.

A recent study from security provider Fortinet, conducted by IHS Markit, found that most companies have moved a cloud-based app back on premises after they failed to see anticipated returns. In the survey of 350 global IT decision makers, 74% reported they had moved an application back to their own infrastructure.

"When companies repatriate workloads, it's often an indication that something has gone wrong," said Yugal Joshi, vice president of information technology services at Everest Group, a management consulting company.

That's far from ideal. Moving workloads is costly and often disruptive , according to experts. There could be performance issues, additional security exposure and work interruptions as well as a drain on IT and business resources. As Joshi noted, " Changing the location of a workload isn't easy, and there is a lot of risk in moving workloads around."

Cloud migration faces challenges

That level of failed cloud migrations doesn't surprise Asif Malik, senior vice president and CIO of SilkRoad Technology. He has been in that situation in the past at a prior company, he said.

We thought we could save a lot of money and get rid of managing infrastructure. But we were wrong. Asif Malik Senior vice president and CIO, SilkRoad Technology

Malik detailed one particular case to illustrate the problems he faced with a move to the cloud . He and his team moved a data analytics application from the company's data center to a public cloud offering, opting to have the application hosted by Microsoft Azure so they could more easily scale up or down as needed at a lower cost.

"We thought it was Capex versus Opex. We thought we could save a lot of money and get rid of managing infrastructure," Malik explained. "But we were wrong."

There were problems from the start. His IT workers noticed latency issues right away, and they identified limitations within their networking equipment that further hindered the app's performance.

"We kept throwing compute resources and storage resources at it," Malik said, and that drove up costs.

Given such problems and no financial benefits, Malik opted to move the app out of the cloud and back on premises. This process presented its own challenges and took about eight months of his team's time to complete.

Why migrations fail

Before you move a workload or full application to the cloud, take stock of the challenges you'll likely encounter that could hamper a smooth cloud migration.

Underestimated performance problems and costs. Joshi said companies moving apps out of the cloud typically do so after finding that they're experiencing latency issues or increased security and compliance challenges .

Those observations track with the results of the Fortinet survey. According to the report, 52% of those who moved workloads from the cloud back on premises said either performance or security issues were the primary reasons for their decision. An additional 21% cited regulatory issues as the driving factor.

"If I think about the times that I've seen people move to the cloud and then move backward, it's been a combination of things," said Scott Buchholz, a managing director with Deloitte Consulting LLP who serves as the government and public services chief technology officer and the national emerging technologies research director.

Some companies see higher costs than they expected . Some find they're not getting the uptime they expected from the cloud vendor. Still others hit complexities that slow down their systems.

Moving workloads from the cloud to on premises

Misunderstood applications and operations. Some very high-volume systems that have particular technical requirements, such as high-volume transactional databases, don't work well in the cloud, Buchholz said. "And there are some apps that we don't think are really connected to other things and they have more connectivity and talk to more things than was realized. So by the time you go through all the hops and links and security, things are much slower in the cloud than you thought it would be," he added.

Know what should go, and what should stay. Malik said that his cloud migration misstep gave him deeper insights into migration best practices . It drove home one point in particular, he said: "Not every application belongs in the cloud."

That, indeed, was what he determined was the main cause of failure with the data analytics app he moved to the cloud -- it wasn't ready to make the move . According to Malik, the problem started with the decision to simply move the app as it was to the cloud -- a straight lift-and-shift project.

"The application wasn't a multi-tenant application, it wasn't an elastic application, and it did not use a virtualized environment very well," he said. Also, the app relied on data that resided within the data center, a factor that contributed to the app's poor performance in the cloud.

Experts said that's a typical scenario for IT departments. "They treat the cloud like a virtual data center and they don't change their operations or procedures when they move to the cloud," Buchholz added.

Application evaluation is crucial

That is changing, though, as more organizations gain experience with cloud migration projects. IT advisors and researchers said they're seeing more CIOs doing a better job evaluating their on-premises applications to determine which can move as they are into the cloud and run successfully, which ones should be modernized and moved to the cloud, and which ones should stay put .

James Fairweather, chief innovation officer at Pitney Bowes, a global technology company offering customer information management, location intelligence, customer engagement, shipping and mailing, and global ecommerce products, said the company embarked on a transformation initiative about five years ago. Part of that involved moving workloads as well as individual capabilities and services into the cloud.

To help smooth those moves to the cloud, Fairweather said the company rigorously evaluated applications to determine which could be shifted as-is to the cloud and which needed to be optimized for the cloud in order to deliver returns.

"In all these workload migrations, we've been very planful (sic) about them," he said, explaining that staff conducts security reviews, code testing and other analyses on applications before mapping out the best path forward .

The company also invested in new technologies, such as automation tools and API management from Apigee, to ensure successful cloud migrations.

Related Resources

  • Infographic: Smarter brings better experiences for employees and IT together –Lenovo MY
  • Infographic: Smarter brings better experiences for employees and IT together –Lenovo Win 11 Intel
  • Infographic: Smarter brings better experiences for employees and IT together –Lenovo
  • Criteria for selecting the right IT Managed Service Provider for your business –Wanstor Ltd

Dig Deeper on Digital transformation

cloud computing failure case study

Guide to lift-and-shift data center migration

JuliaBorgini

What is cloud migration? Essential guide to moving to the cloud

StephenBigelow

lift and shift

NickBarney

4 best practices to avoid cloud vendor lock-in

GeorgeLawton

FinOps tools help organizations optimize cloud spending and use. Review the different native and third-party options to find the ...

Cloud-native applications have become ubiquitous in IT environments. Review key principles, patterns and developmental factors to...

AI is bringing previously unimagined capabilities in automation, optimization and predictive analytics to cloud management while ...

With the right software, almost any mobile device can be a payment terminal. Learn about the mobile point-of-sale options beyond ...

To keep corporate and user data safe, IT must continuously ensure mobile app security. Mobile application security audits are a ...

Dell continues to cut its workforce to become 'leaner,' as it repositions for changes in the enterprise PC market that are ...

Rocky Linux and AlmaLinux are new distributions created after Red Hat announced the discontinuation of CentOS. These ...

The Broadcom CEO says public cloud migration trauma can be cured by private cloud services like those from VMware, but VMware ...

New capabilities for VMware VCF can import and manage existing VMware services through a single console interface for a private ...

ESG metrics measure performance on environmental, social and governance issues. Here's how they can benefit companies, plus tips ...

Various software tools are available to help manage ESG and sustainability initiatives. Here's a look at 18 sustainability ...

ESG initiatives can help boost business success. This guide takes an in-depth look at creating and managing an ESG strategy to ...

Distillery

10 Important Cloud Migration Case Studies You Need to Know

Aug 1, 2019 | Engineering

cloud computing failure case study

For most businesses considering cloud migration, the move is filled with promise and potential. Scalability, flexibility, reliability, cost-effectiveness, improved performance and disaster recovery, and simpler, faster deployment — what’s not to like? 

Find out the Outsourcing and how to choose the best model for your business.
Discover how to , including how to select the right service provider. 

It’s important to understand that cloud platform benefits come alongside considerable challenges, including the need to improve availability and latency, auto-scale orchestration, manage tricky connections, scale the development process effectively, and address cloud security challenges. While advancements in virtualization and containerization (e.g., Docker, Kubernetes) are helping many businesses solve these challenges, cloud migration is no simple matter. 

That’s why, when considering your organization’s cloud migration strategy, it’s beneficial to look at case studies and examples from other companies’ cloud migration experiences. Why did they do it? How did they go about it? What happened? What benefits did they see, and what are the advantages and disadvantages of cloud computing for these businesses? Most importantly, what lessons did they learn — and what can you learn from them? 

With that in mind, Distillery has put together 10 cloud migration case studies your business can learn from. While most of the case studies feature companies moving from on-premise, bare metal data centers to cloud, we also look at companies moving from cloud to cloud, cloud to multi-cloud, and even off the cloud. Armed with all these lessons, ideas, and strategies, you’ll feel readier than ever to make the cloud work for your business.

Challenges for Cloud Adoption: Is Your Organization Ready to Scale and Be Cloud-first?

We examine several of these case studies from a more technical perspective in our white paper on Top Challenges for Cloud Adoption in 2019 . In this white paper, you’ll learn:

  • Why cloud platform development created scaling challenges for businesses
  • How scaling fits into the big picture of the Cloud Maturity Framework
  • Why advancements in virtualization and containerization have helped businesses solve these scaling challenges
  • How companies like Betabrand, Shopify, Spotify, Evernote, Waze, and others have solved these scaling challenges while continuing to innovate their businesses and provide value to users

Download your Top Challenges for Cloud Adoption white paper

#1 Betabrand : Bare Metal to Cloud

Cloud Migration: Betabrand Logo

Betabrand (est. 2005) is a crowd-funded, crowd-sourced retail clothing e-commerce company that designs, manufactures, and releases limited-quantity products via its website. 

Migration Objective 

The company struggled with the maintenance difficulties and lack of scalability of the bare metal infrastructure supporting their operations. 

Planning for and adding capacity took too much time and added costs. They also needed the ability to better handle website traffic surges.

Migration Strategy and Results 

In anticipation of 2017’s Black Friday increased web traffic, Betabrand migrated to a Google Cloud infrastructure managed by Kubernetes (Google Kubernetes Engine, or GKE). They experienced no issues related to the migration, and Black Friday 2017 was a success. 

By Black Friday 2018, early load testing and auto-scaling cloud infrastructure helped them to handle peak loads with zero issues. The company hasn’t experienced a single outage since migrating to the cloud.

Key Takeaways

  • With advance planning, cloud migration can be a simple process. Betabrand’s 2017 on-premise to cloud migration proved smooth and simple. In advance of actual migration, they created multiple clusters in GKE and performed several test migrations, thereby identifying the right steps for a successful launch.
  • Cloud streamlines load testing. Betabrand was able to quickly create a replica of its production services that they could use in load testing. Tests revealed poorly performing code paths that would only be revealed by heavy loads. They were able to fix the issues before Black Friday. 
  • Cloud’s scalability is key to customer satisfaction. As a fast-growing e-commerce business, Betabrand realized they couldn’t afford the downtime or delays of bare metal. Their cloud infrastructure scales automatically, helping them avoid issues and keep customers happy. This factor alone underlines the strategic importance of cloud computing in business organizations like Betabrand. 

#2 Shopify : Cloud to Cloud

Cloud Migration: Shopify Logo

Shopify (est. 2006) provides a proprietary e-commerce software platform upon which businesses can build and run online stores and retail point-of-sale (POS) systems. 

Shopify wanted to ensure they were using the best tools possible to support the evolution needed to meet increasing customer demand. Though they’d always been a cloud-based organization, building and running their e-commerce cloud with their own data centers, they sought to capitalize on the container-based cloud benefits of immutable infrastructure to provide better support to their customers. Specifically, they wanted to ensure predictable, repeatable builds and deployments; simpler and more robust rollbacks; and elimination of configuration management drift. 

By building out their cloud with Google, building a “Shop Mover” database migration tool, and leveraging Docker containers and Kubernetes, Shopify has been able to transform its data center to better support customers’ online shops, meeting all their objectives. For Shopify customers, the increasingly scalable, resilient applications mean improved consistency, reliability, and version control.

  • Immutable infrastructure vastly improves deployments. Since cloud servers are never modified post-deployment, configuration drift — in which undocumented changes to servers can cause them to diverge from one another and from the originally deployed configuration — is minimized or eliminated. This means deployments are easier, simpler, and more consistent.
  • Scalability is central to meeting the changing needs of dynamic e-commerce businesses. Shopify is home to online shops like Kylie Cosmetics, which hosts flash sales that can sell out in 20 seconds. Shopify’s cloud-to-cloud migration helped its servers flex to meet fluctuating demand, ensuring that commerce isn’t slowed or disrupted.

#3 Spotify: Bare Metal to Cloud

Cloud Migration: Spotify Logo

Spotify (est. 2006) is a media services provider primarily focused on its audio-streaming platform, which lets users search for, listen to, and share music and podcasts.

Spotify’s leadership and engineering team agreed: The company’s massive in-house data centers were difficult to provision and maintain, and they didn’t directly serve the company’s goal of being the “best music service in the world.” They wanted to free up Spotify’s engineers to focus on innovation. They started planning for migration to Google Cloud Platform (GCP) in 2015, hoping to minimize disruption to product development, and minimize the cost and complexity of hybrid operation. 

Spotify invested two years pre-migration in preparing, assigning a dedicated Spotify/Google cloud migration team to oversee the effort. Ultimately, they split the effort into two parts, services and data, which took a year apiece. For services migration, engineering teams moved services to the cloud in focused two-week sprints, pausing on product development. For data migration, teams were allowed to choose between “forklifting” or rewriting options to best fit their needs. Ultimately, Spotify’s on-premise to cloud migration succeeded in increasing scalability while freeing up developers to innovate. 

  • Gaining stakeholder buy-in is crucial. Spotify was careful to consult its engineers about the vision. Once they could see what their jobs looked like in the future, they were all-in advocates. 
  • Migration preparation shouldn’t be rushed. Spotify’s dedicated migration team took the time to investigate various cloud strategies and build out the use case demonstrating the benefits of cloud computing to the business. They carefully mapped all dependencies. They also worked with Google to identify and orchestrate the right cloud strategies and solutions. 
  • Focus and dedication pay huge dividends. Spotify’s dedicated migration team kept everything on track and in focus, making sure everyone involved was aware of past experience and lessons already learned. In addition, since engineering teams were fully focused on the migration effort, they were able to complete it more quickly, reducing the disruption to product development.

#4 Evernote : Bare Metal to Cloud

Cloud Migration: Evernote Logo

Evernote (est. 2008) is a collaborative, cross-platform note-taking and task management application that helps users capture, organize, and track ideas, tasks, and deadlines.

Evernote, which had maintained its own servers and network since inception, was feeling increasingly limited by its infrastructure. It was difficult to scale, and time-consuming and expensive to maintain. They wanted more flexibility, as well as to improve Evernote’s speed, reliability, security, and disaster recovery planning. To minimize service disruption, they hoped to conduct the on-premise to cloud migration as efficiently as possible. 

Starting in 2016, Evernote used an iterative approach : They built a strawman based on strategic decisions, tested its viability, and rapidly iterated. They then settled on a cloud migration strategy that used a phased cutover approach, enabling them to test parts of the migration before committing. They also added important levels of security by using GCP service accounts , achieving “encryption at rest,” and improving disaster recovery processes. Evernote successfully migrated 5 billion notes and 5 billion attachments to GCP in only 70 days. 

  • Cloud migration doesn’t have to happen all at once. You can migrate services in phases or waves grouped by service or user. Evernote’s phased cutover approach allowed for rollback points if things weren’t going to according to plan, reducing migration risk. 
  • Ensuring data security in the cloud may require extra steps. Cloud security challenges may require extra focus in your cloud migration effort. Evernote worked with Google to create the additional security layers their business required. GCP service accounts can be customized and configured to use built-in public/private key pairs managed and rotated daily by Google.
  • Cloud capabilities can improve disaster recovery planning. Evernote wanted to ensure that they would be better prepared to quickly recover customer data in the event of a disaster. Cloud’s reliable, redundant, and robust data backups help make this possible. 

#5 Etsy : Bare Metal to Cloud

Cloud Migration: Etsy Logo

Etsy (est. 2005) is a global e-commerce platform that allows sellers to build and run online stores selling handmade and vintage items and crafting supplies.

Etsy had maintained its own infrastructure from inception. In 2018, they decided to re-evaluate whether cloud was right for the company’s future. In particular, they sought to improve site performance, engineering efficiency, and UX. They also wanted to ensure long-term scalability and sustainability, as well as to spend less time maintaining infrastructure and more time executing strategy.

Migration Strategy and Results

Etsy undertook a detailed vendor selection process , ultimately identifying GCP as the right choice for their cloud migration strategy . Since they’d already been running their own Kubernetes cluster inside their data center, they already had a partial solution for deploying to GKE. They initially deployed in a hybrid environment (private data center and GKE), providing redundancy, reducing risk, and allowing them to perform A/B testing. They’re on target to complete the migration and achieve all objectives. 

Key Takeaways 

  • Business needs and technology fit should be periodically reassessed. While bare metal was the right choice for Etsy when it launched in 2005, improvements in infrastructure as a service (IaaS) and platform as a service (PaaS) made cloud migration the right choice in 2018.
  • Detailed analysis can help businesses identify the right cloud solution for their needs. Etsy took a highly strategic approach to assessment that included requirements definition, RACI (responsible, accountable, consulted, informed) matrices, and architectural reviews. This helped them ensure that their cloud migration solution would genuinely help them achieve all their goals.
  • Hybrid deployment can be effective for reducing cloud migration risk. Dual deployment on their private data center and GKE was an important aspect of Etsy’s cloud migration strategy. 

#6 Waze : Cloud to Multi-cloud

Cloud Migration: Waze Logo

Waze (est. 2006; acquired by Google in 2013) is a GPS-enabled navigation application that uses real-time user location data and user-submitted reports to suggest optimized routes.

Though Waze moved to the cloud very early on, their fast growth quickly led to production issues that caused painful rollbacks, bottlenecks, and other complications. They needed to find a way to get faster feedback to users while mitigating or eliminating their production issues.  

Waze decided to run an active-active architecture across multiple cloud providers — GCP and Amazon Web Services (AWS) — to improve the resiliency of their production systems. This means they’re better-positioned to survive a DNS DDOS attack, or a regional or global failure. An open source continuous delivery platform called Spinnaker helps them deploy software changes while making rollbacks easy and reliable. Spinnaker makes it easy for Waze’s engineers to deploy across both cloud platforms, using a consistent conceptual model that doesn’t rely on detailed knowledge of either platform .  

  • Some business models may be a better fit for multiple clouds. Cloud strategies are not one-size-fits-all. Waze’s stability and reliability depends on avoiding downtime, deploying quick fixes to bugs, and ensuring the resiliency of their production systems. Running on two clouds at once helps make it all happen. 
  • Your engineers don’t necessarily have to be cloud experts to deploy effectively. Spinnaker streamlines multi-cloud deployment for Waze such that developers can focus on development, rather than on becoming cloud experts. 

Deploying software more frequently doesn’t have to mean reduced stability/reliability. Continuous delivery can get you to market faster, improving quality while reducing risk and cost.

#7 AdvancedMD : Bare Metal to Cloud

Cloud Migration: AdvancedMD Logo

AdvancedMD (est. 1999) is a software platform used by medical professionals to manage their practices, securely share information, and manage workflow, billing, and other tasks. 

AdvancedMD was being spun off from its parent company, ADP; to operate independently, it had to move all its data out of ADP’s data center. Since they handle highly sensitive, protected patient data that must remain available to practitioners at a moment’s notice, security and availability were top priorities. They sought an affordable, easy-to-manage, and easy-to-deploy solution that would scale to fit their customers’ changing needs while keeping patient data secure and available.

AdvancedMD’s on-premise to cloud migration would avoid the need to hire in-house storage experts, save them and their customers money, ensure availability, and let them quickly flex capacity to accommodate fluctuating needs. It also offered the simplicity and security they needed. Since AdvancedMD was already running NetApp storage arrays in its data center, it was easy to use NetApp’s Cloud Volumes ONTAP to move their data to AWS. ONTAP also provides the enterprise-level data protection and encryption they require.

  • Again, ensuring data security in the cloud may require extra steps. Though cloud has improved or mitigated some security concerns (e.g., vulnerable OS dependencies, long-lived compromised servers), hackers have turned their focus to the vulnerabilities that remain. Thus, your cloud migration strategy may need extra layers of controls (e.g., permissions, policies, encryption) to address these cloud security challenges.
  • When service costs are a concern, cloud’s flexibility may help. AdvancedMD customers are small to mid-sized budget-conscious businesses. Since cloud auto-scales, AdvancedMD never pays for more cloud infrastructure than they’re actually using. That helps them keep customer pricing affordable.

#8 Dropbox : Cloud to Hybrid

Cloud Migration: Dropbox Logo

Dropbox (est. 2007) is a file hosting service that provides cloud storage and file synchronization solutions for customers.

Dropbox had developed its business by using the cloud — specifically, Amazon S3 (Simple Storage Service) — to house data while keeping metadata housed on-premise. Over time, they began to fear they’d become overly dependent on Amazon: not only were costs increasing as their storage needs grew, but Amazon was also planning a similar service offering, Amazon WorkDocs. Dropbox decided to take back their storage to help them reduce costs, increase control, and maintain their competitive edge. 

While the task of moving all that data to an in-house infrastructure was daunting, the company decided it was worth it — at least in the US (Dropbox assessed that in Europe, AWS is still the best fit). Dropbox designed in-house and built a massive network of new-breed machines orchestrated by software built with an entirely new programming language, moving about 90% of its files back to its own servers . Dropbox’s expanded in-house capabilities have enabled them to offer Project Infinite, which provides desktop users with universal compatibility and unlimited real-time data access.

  • On-premise infrastructure may still be right for some businesses. Since Dropbox’s core product relies on fast, reliable data access and storage, they need to ensure consistently high performance at a sustainable cost. Going in-house required a huge investment, but improved performance and reduced costs may serve them better in the long run. Once Dropbox understood that big picture, they had to recalculate the strategic importance of cloud computing to their organization.  
  • Size matters. As Wired lays out in its article detailing the move , cloud businesses are not charities. There’s always going to be margin somewhere. If a business is big enough — like Dropbox — it may make sense to take on the difficulties of building a massive in-house network. But it’s a huge risk for businesses that aren’t big enough, or whose growth may stall.

#9 GitLab : Cloud to Cloud

Cloud Migration: GitLab Logo

GitLab (est. 2011) is an open core company that provides a single application supporting the entire DevOps life cycle for more than 100,000 organizations. 

GitLab’s core application enables software development teams to collaborate on projects in real time, avoiding both handoffs and delays. GitLab wanted to improve performance and reliability, accelerating development while making it as seamless, efficient, and error-free as possible. While they acknowledged that Microsoft Azure had been a great cloud provider, they strongly believed that GCP’s Kubernetes was the future, calling it “a technology that makes reliability at massive scale possible.” 

In 2018, GitLab migrated from Azure to GCP so that GitLab could run as a cloud-native application on GKE. They used their own Geo product to migrate the data, initially mirroring the data between Azure and GCP. Post-migration, GitLab reported improved performance (including fewer latency spikes) and a 61% improvement in availability.    

  • Containers are seen by many as the future of DevOps. GitLab was explicit that they view Kubernetes as the future. Indeed, containers provide notable benefits, including a smaller footprint, predictability, and the ability to scale up and down in real time. For GitLab’s users, the company’s cloud-to-cloud migration makes it easier to get started with using Kubernetes for DevOps.
  • Improved stability and availability can be a big benefit of cloud migration. In GitLab’s case, mean-time between outage events pre-migration was 1.3 days. Excluding the first day post-migration, they’re up to 12 days between outage events. Pre-migration, they averaged 32 minutes of downtime weekly; post-migration, they’re down to 5. 

#10 Cordant Group : Bare Metal to Hybrid

Cloud Migration: Cordant Group Logo

The Cordant Group (est. 1957) is a global social enterprise that provides a range of services and solutions, including recruitment, security, cleaning, health care, and technical electrical.

Over the years, the Cordant Group had grown tremendously, requiring an extensive IT infrastructure to support their vast range of services. While they’d previously focused on capital expenses, they’d shifted to looking at OpEx, or operational expenses — which meant cloud’s “pay as you go” model made increasing sense. It was also crucial to ensure ease of use and robust data backups.

They began by moving to a virtual private cloud on AWS , but found that the restriction to use Windows DFS for file server resource management was creating access problems. NetApp Cloud ONTAP, a software storage appliance that runs on AWS server and storage resources, solved the issue. File and storage management is easier than ever, and backups are robust, which means that important data restores quickly. The solution also monitors resource costs over time, enabling more accurate planning that drives additional cost savings. 

  • Business and user needs drive cloud needs. That’s why cloud strategies will absolutely vary based on a company’s unique needs. The Cordant Group needed to revisit its cloud computing strategy when users were unable to quickly access the files they needed. In addition, with such a diverse user group, ease of use had to be a top priority.
  • Cloud ROI ultimately depends on how your business measures ROI. The strategic importance of cloud computing in business organizations is specific to each organization. Cloud became the right answer for the Cordant Group when OpEx became the company’s dominant lens. 

Which Cloud Migration Strategy Is Right for You?

As these 10 diverse case studies show, cloud strategies are not one-size-fits all. Choosing the right cloud migration strategy for your business depends on several factors, including your:

  • Goals. What business results do you want to achieve as a result of the migration? How does your business measure ROI? What problems are you trying to solve via your cloud migration strategy? 
  • Business model. What is your current state? What are your core products/services and user needs, and how are they impacted by how and where data is stored? What are your development and deployment needs, issues, and constraints? What are your organization’s cost drivers? How is your business impacted by lack of stability or availability? Can you afford downtime? 
  • Security needs. What are your requirements regarding data privacy, confidentiality, encryption, identity and access management, and regulatory compliance? Which cloud security challenges pose potential problems for your business?
  • Scaling needs. Do your needs and usage fluctuate? Do you expect to grow or shrink? 
  • Disaster recovery and business continuity needs. What are your needs and capabilities in this area? How might your business be impacted in the event of a major disaster — or even a minor service interruption? 
  • Technical expertise. What expertise do you need to run and innovate your core business? What expertise do you have in-house? Are you allocating your in-house expertise to the right efforts? 
  • Team focus and capacity. How much time and focus can your team dedicate to the cloud migration effort? 
  • Timeline. What business needs constrain your timeline? What core business activities must remain uninterrupted? How much time can you allow for planning and testing your cloud migration strategy? 

Of course, this list isn’t exhaustive. These questions are only a starting point. But getting started — with planning, better understanding your goals and drivers, and assessing potential technology fit — is the most important step of any cloud migration process. We hope these 10 case studies have helped to get you thinking in the right direction. 

While the challenges of cloud migration are considerable, the right guidance, planning, and tools can lead you to the cloud strategies and solutions that will work best for your business. So don’t delay: Take that first step to helping your business reap the potential advantages and benefits of cloud computing. 

Ready to take the next step on your cloud journey? As a Certified Google Cloud Technology Partner , Distillery is here to help. Download our white paper on top challenges for cloud adoption to get tactical and strategic about using cloud to transform your business.  

Recent Posts

  • Distillery’s Unsung Heroes: Devon Blumeyer
  • Code-Gen AI: A Game-Changer for SaaS Efficiency
  • Best Practices for Software Deployment and Beyond
  • Maximize Your ROI with Strategic Product Management
  • Ensuring Software Quality Through Rigorous Testing

Recent Comments

Cloud computing is often described as a savior for businesses. Early success stories have shown that cloud can be used not only for improving business operations but also as an invaluable tool for driving business growth. Innovative, cloud-based platforms such as customer relationship management (CRM), e-commerce and analytics are making it easier than ever for businesses to experiment and pilot cutting-edge capabilities to increase revenue and gain market share.

These success stories have opened the eyes of CEOs and senior business executives to the value of cloud computing. Our research shows that 86% of CEOs believe that cloud is essential to deliver the results they need over the next 2-3 years.

Yet many organizations are struggling to make the business case for their cloud investments, and upwards of 30% of cloud initiatives fail. The cloud is a new way of doing things, and for most companies, it requires a different set of skills, processes and tools. Many companies apply traditional practices and existing capabilities to the cloud and fail.

In our experience, there are four primary issues that hold businesses back from realizing value:

1. Business and IT misalignment

An overwhelming majority of cloud programs tend to be driven by IT organizations. However, the bulk of cloud value is typically unlocked within business operations. To realize that value, businesses must change the ways they work. But CIOs are rarely positioned to drive these changes themselves. Business executives, on the other hand, are reluctant to take responsibility for these cloud programs because they are uncomfortable working with the cloud (and often with technology in general).

2. Underestimation of technology complexity

CIOs consistently underestimate the technology complexity associated with successful execution of cloud modernization. Cloud undoubtedly appeals to CIOs that wish to “get out of the data center business” and focus on more value-adding capabilities. There is also the undeniable beauty and promise of a highly distributed, event-driven microservices architecture on cloud that can power the next generation of intelligent applications. However, despite the many virtues of cloud, businesses are discovering that they are unable to shed the full responsibility of platform and infrastructure management. Quite the opposite, many companies find that managing hybrid environments adds a significant layer of complexity to platform and infrastructure operations.

3. Over-indexing on organization

Although CIOs appreciate the need to change their operating model to be able to work in this new cloud environment, they often treat these changes as a “boxes and arrows” exercise. In other words, they shuffle and regroup resources rather than breaking down barriers, addressing inefficiencies, and fundamentally changing the way their teams work. The only way technology teams have a hope of keeping up with expectations is if they drive very high levels of automation. Unfortunately, many companies fail to explicitly invest in the automation and AI necessary to transform IT delivery to take advantage of these new technologies.

4. Poor financial discipline

Many IT organizations lack the financial management capabilities to measure and manage cloud value. Our research shows that less than 40% of cloud programs have a well-articulated business and financial case. Some technology leaders are less well versed in IT economics and lack an understanding of how operating decisions can lead to financial outcomes. Further, in many companies, there can be a lack of adequate visibility into assets and metrics. The lack of an analytically-driven culture makes it difficult to derive a clear view of the value and efficiency of their cloud initiatives.

Being aware of these four failure patterns is the first step to being a proactive leader who can lead successful modernization programs and deliver business results.

Make sure your cloud modernization project succeeds. Find your digital transformation strategy.

Advances, Systems and Applications

  • Open access
  • Published: 19 September 2022

Cloud failure prediction based on traditional machine learning and deep learning

  • Tengku Nazmi Tengku Asmawi 1 ,
  • Azlan Ismail 1 , 2 &
  • Jun Shen 3  

Journal of Cloud Computing volume  11 , Article number:  47 ( 2022 ) Cite this article

6329 Accesses

9 Citations

Metrics details

Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service providers, in addition to the loss of productivity suffered by industrial users. Fault tolerance management is the key approach to address this issue, and failure prediction is one of the techniques to prevent the occurrence of a failure. One of the main challenges in performing failure prediction is to produce a highly accurate predictive model. Although some work on failure prediction models has been proposed, there is still a lack of a comprehensive evaluation of models based on different types of machine learning algorithms. Therefore, in this paper, we propose a comprehensive comparison and model evaluation for predictive models for job and task failure. These models are built and trained using five traditional machine learning algorithms and three variants of deep learning algorithms. We use a benchmark dataset, called Google Cloud Traces, for training and testing the models. We evaluated the performance of models using multiple metrics and determined their important features, as well as measured their scalability. Our analysis resulted in the following findings. Firstly, in the case of job failure prediction, we found that Extreme Gradient Boosting produces the best model where the disk space request and CPU request are the most important features that influence the prediction. Second, for task failure prediction, we found that Decision Tree and Random Forest produce the best models where the priority of the task is the most important feature for both models. Our scalability analysis has determined that the Logistic Regression model is the most scalable as compared to others.

Introduction

Cloud computing is at the forefront of a global digital transformation [ 1 ]. It allows a business to provide an extra layer of security in terms of information security and allows them to raise the level of efficiency of operation to a new level. According to Business Fortune Insight, the North American market has spent approximately 78.28 billion dollars on cloud services. The cloud computing market is expected to continue to expand from $219B in 2020 to $791.48B in 2028 [ 2 ].

The implementation and deployment of the cloud system opens up to deal with different types of cloud failure [ 3 ]. Failing to handle these failures will result in degradation of quality of service (QoS), availability, and reliability. It will ultimately lead to an economic loss for both cloud consumers and providers [ 4 ]. This challenge is commonly addressed with fault tolerance management, which offers the ability to detect, identify, and handle faults without damaging the final result of cloud computing [ 5 ]. There are several categories of fault tolerance techniques that include redundancy techniques, fault-aware policies (i.e., reactive and proactive policies), and load balance [ 5 ]. In this paper, we focus on proactive policies that can be implemented with a failure prediction technique trained using machine learning algorithms. Failure prediction is significant in preventing the occurrence of failure and in minimizing the maintenance costs of fault tolerance management. As there are different types of cloud failure, we pay particular attention to the prediction of job and task failure. Both failures are interconnected (i.e., a job contains one or more tasks) and should be tackled simultaneously.

Therefore, in this paper, our aim is to build and evaluate a set of trained models to predict the job and task termination status (i.e. failure or success). For this reason, we have chosen five traditional machine learning algorithms (TML) and three variants of deep learning algorithms (DL). TML algorithms include logistic regression (LR), decision tree (DT), random forest (RF), gradient boost (GB), and extreme gradient boost (XGBoost). Meanwhile, the DL algorithms refer to single-layer long-short-term memory (LSTM), two-layer (bi-layer) LSTM, and three-layer (tri-layer) LSTM. We used the benchmark dataset, Google Cluster Traces (GCT), published in 2011, to train and test the models. We then perform a series of evaluations to find the best models. Therefore, this work contributes threefold. First, an approach to comprehensively produce and evaluate predictive failure models (Section  3 ). Second, the results and findings of four types of analyzes, namely exploratory data analysis, feature analysis, performance analysis, and scalability analysis (Section  4 ). Third, a review of cloud failure prediction and machine learning approaches specifically related to GCT, as well as other datasets (Section  5 ).

The remainder of this paper is organized as follows. Section 2 explains the dataset used and the fundamental background of cloud failures. Section 3 presents the approach to conducting this study. Section 4 provides the results and findings of the analysis. Section 5 summarizes the related works. Section 6 concludes the paper with several future works.

In this section, the dataset used in this study is introduced, followed by a fundamental understanding of cloud failures.

Google Cluster Traces

Up to date, Google has released three public trace datasets for research and academic use. The first one was released in 2007 and contains 7-hour workload details. The dataset contains only basic information, including time, job id, task id, job category, and number of cores and memory. Both the cores and the memory count have been normalized.

The second dataset contains traces of 29 days’ workload from about 12,500 machines in the month of May 2011. The data consist of 672,074 jobs and around 26 million tasks that have been submitted by the user [ 6 ]. Unlike the previous dataset, this one includes more information about each task’s resource utilization, as well as information about the task itself, such as scheduling class and task priority.

The third dataset is the most current, documenting the use of cloud resources from eight separate clusters, with one cluster containing roughly 12,000 computers in May 2019, and was released in early 2020 [ 7 ]. Compared to the 2011 dataset, this dataset focuses on resource requests and utilization and contains no information about end users. The 2019 dataset has three additions: CPU utilization information histograms for each 5 minute period, information regarding shared resources reservation by a job, and job-parent information for master/worker relationships such as MapReduce’s jobs.

2011 Dataset

The data provided in this version can generally be divided into machine details, and job and task details.

The machine details are provided in two tables, which are machines events and machine attributes . Every unique machine in both of these tables is identified by machine ID, a unique 64-bit identifier. The table machine events consists of a timestamp, machine ID, event type, platform ID, and the capacity of each machine. This table records all events related to a machine, such as adding and removing machines from the cluster. Each record has its own timestamp that indicates when the event occurs. There is also information on the machine platform representing the microarchitecture and chip-set version of the machine, and the normalized value of the machine CPU and memory capacity. In the case of the table machine attributes , it contains details of machine properties such as kernel version, machine operation clock speed, and the external IP address that is linked to the machine.

Four tables are provided to describe the details of the job and tasks, which are table job event , table task event , table task constraint , and table resource usage .

Tables job event and task event are used to describe the life cycle of the job or the task. Each job is identified by a unique 64-bit identifier, while each unique task is identified by the combination of job ID and task index. Every event from submission to termination is recorded in these tables. The type of event is identified by their event type column, where the value zero indicates the job or task submitted by the user, the value one indicates that the task has been scheduled to run on a machine, values two to six indicate that the task has been terminated, while values seven and eight indicate that the task details or requirements have been updated.

Furthermore, there is also a column that indicates whether a task or a job has been synthesized as a replacement for a missing record. Alongside all the columns mentioned above, there is a column for scheduling class in both tables. The scheduling class is indicated by a single integer value from zero to three, where zero indicates a nonproduction task, while three indicates a latency-sensitive task. Lastly, there is one column for a username and two more columns for job names. These columns have been anonymized by combining data from several internal name fields.

In the case of the table task event , there are six more columns in addition to all the ones mentioned above. One of the columns is the machine ID, which determines which machine the task is run on. The second column is the priority of the task, which is valued from zero as the least priority to 11 as the most important task. Next, there are three columns detailing the normalized amount of every requested resource, which are CPU resources, memory resources, and disk space resources. Lastly, there is a column that attributes the constraint of running on a different machine. If there is a value in the column, we note that the task must be executed on a machine different from the machine currently running a different task from the same job.

The table task constraint discloses the constraints associated with each task, which may be zero or one or more. Task constraints prevent a task from running on a certain machine. Each record in the task constraint table represents exactly one task event record. Finally, there is a table resource usage . This table discloses the amount of computing resources, such as CPU, memory, and disk space, that have been used for each task. This usage is recorded for each 5-minute or 300-second measurement period.

Failure in Cloud

This section explains the failure and fault tolerance categories in the cloud environment.

Categories of Cloud Failure

The cloud computing, like any other computing system, is susceptible to failure. Cloud computing system fails when it fails to perform its predefined function due to hardware failures or unexpected software failures. The more complex the computing system, the higher the probability that the system will fail. There are two classifications of failures, namely architecture-based and occurrence-based failures [ 3 ]. These failure categories are summarized in Table 1 .

Architecture-based failures include two types of failure, defined as service and resource failures. Service failure usually occurs due to software failure, such as unplanned reboots and cyberattacks, or scheduling failure, such as service timeout. Meanwhile, resource failure is caused by hardware failure, such as power outages, system breakdown, memory problems, and complex circuit design. The occurrence-based failure consists of two types of failure, specifically correlated and independent failures. Correlated failure is a failure that occurs due to another domain of failure. For example, a cloud resource is unavailable due to a power outage that affects the cloud infrastructure. Meanwhile, independent failure is caused by external factors, such as human error and computer overheating.

As we are concerned about the termination status of the job and task, our work can be associated with service failure, where the main cause is driven by scheduling failure.

Fault Tolerance in Cloud Computing

Despite the advancement of cloud computing technology, cloud computing performance is still hampered by its vulnerability to failure. Therefore, fault tolerance is one of the fundamental requirements of cloud computing. There are different ways of classifying fault tolerance approaches [ 5 , 8 , 9 ]. Here, we highlight two major tolerance approaches, namely the reactive and proactive approach.

The reactive fault tolerance approach reduces the effect of failure on the execution of an application. When a failure occurs in the cloud environment, the reactive fault tolerance approach takes effect. There are several techniques used in the reactive fault tolerance approach. One of them is task replication [ 10 ]. This technique is used to duplicate tasks on multiple resources. This replication increases the likelihood that at least one task will be completed correctly. The second technique is the re-submission of tasks [ 11 ]. When a task fails, it will be rerun with the same node or with a different resource.

The proactive fault tolerance approach is based on the principle of preventing the entire failure from occurring. For this approach, the condition of the physical system is constantly monitored, and the occurrence of a system failure must be predicted. If the probability of a fault occurring is high, cloud service providers will take a preventive measure, such as removing hardware from the service cluster or performing corrective measures on the software. Failure prediction can be built using the information collected from previous cloud failures. Machine learning is an excellent tool for predicting software and hardware failures in cloud infrastructures. Failure prediction is considered a proactive fault tolerance approach if it is implemented in the cloud infrastructure [ 12 ].

Existing work has implemented the proactive tolerance approach, such as predicting hardware failure in the cloud farm [ 13 ] and predicting memory failure in a computer system [ 14 ]. Another example is the work of [ 15 ], where a machine learning algorithm is constructed to predict task failure in the cloud system. The prediction model enables the system to efficiently manage the resources available in the cloud system, ensuring that minimum failure occurs, potentially disrupting the availability of the cloud resources.

This section introduces and explains the approach proposed to carry out this study.

Figure  1 shows the approach to implement the comprehensive comparison study of the prediction of job and task failure driven by the GCT dataset. We divide the activities involved into three phases, namely data handling , model generation , and analysis and experiments . Phase data handling comprises the activity of extracting from the published tables (that is, the job event and task event tables) and preparing the two datasets, which we call data set A, which comprises the job-level termination data and dataset B, which comprises the task-level termination data. Table 2 shows the sample of dataset A, while Table 3 shows the sample of dataset B. Phase model generation involves the process of developing and training predictive models based on the TML and DL algorithms. The predictive model is meant to address the classification problem, that is, to classify the termination status of each job or task, either success or failure. Phase analysis and experiments comprises four analyzes, namely exploratory data analysis (EDA), performance analysis (PA), feature importance analysis (FA) and scalability analysis (SA). The EDA is applied to uncover the behavior of the dataset. The input data for the EDA are referred to the published dataset. PA is used to determine the quality of predictive models based on certain metrics to determine the best models. FA is used to identify the feature importance of the predictive models. SA is applied to understand the scalability of predictive models in relation to different data sizes. The input data for PA, FA, and SA are referred to datasets A and B.

figure 1

This figure shows the approach to implement a comprehensive comparison study of the prediction of job and task failure driven by the GCT dataset

Furthermore, Fig.  2 shows the technical workflow and platforms for implementing the proposed approach presented in Fig.  1 . There are two platforms that are used to implement the overall phases. First, we use the Google Cloud Platform (GCP) [ 16 ] to handle data and generate models. Second, we used a local machine to perform analysis and experiments. For the GCP, we configure a virtual machine with 4 CPU cores and 16 GB of memory with a NVIDIA Tesla T4 GPU. Meanwhile, the specification of local machine is based on 2 CPU cores with a clock speed of 2.30 GHz and 20 GB of memory.

figure 2

This figure shows the technical workflow and platforms to implement the proposed approach

Data Handling

In this section, we elaborate on the two main tasks for data handling, namely, data extraction and preparation, as presented in Fig.  1 .

Data Extraction

The sources of data are retrieved from Google Cloud Storage (GCS). Information on the dataset is available at [ 17 ] for the 2011 version. We downloaded the GCT dataset with a size of approximately 400 GB. urllib library was used to access the data stored in the GCS. The files were stored in a zipped file and partitioned. The number of partitions was embedded in the Uniform Resource Locator (URL) of the GCS. Then, all data were extracted using the gzip library to generate the intermediate dataset, namely, the job event table and task event table in csv format. To automate this task, a function was built to load, extract, and save the intermediate dataset into the respective directory.

Data Preparation

This task involves several subtasks, which are data cleaning , data integration , data reduction , data transformation , and class balancing . We utilized the Dask library instead of Pandas to process the data since Dask enables parallelism. Having said that, Dask provides similar capability as Pandas, since it is built on top of Pandas. The intermediate data sets, namely, the job event table and the task event table, are the main input for this task. The aim is to prepare two new datasets, namely, a job-level termination dataset (i.e., dataset A) and a task-level termination dataset (i.e., dataset B). The details of each subtask are as follows.

Data Cleaning

In the case of the GCT dataset, we perform data cleaning, in particular, we handle missing values in the task event table. There are three columns affected by this problem, namely CPU request , memory request , and disk space request . We solve this problem by removing related records that contain missing values using the Pandas function dropna . We then make use of the visualization approach to check the data. We used a boxplot to observe the data distribution for the continuous data chart and a bar and pie chart to observe the categorical data. It is important to note that we also found some outliers in the task event table. However, we have decided to keep these records because they can potentially contribute to identifying the termination status of the job and predicting task failure.

Data Integration

In this study, we need to integrate data from the job event and task event table into a single table to produce the job-level termination data set (i.e., data set A). This task is needed to increase the relevant features for dataset A. To support data integration, we need to identify the most appropriate values from the task event table to be merged with the record in the job event table. This is important since one job event can have multiple task events. For this reason, we apply a data aggregation step in which each set of task event records associated with each job is aggregated based on their three features, namely CPU requests , memory requests , and disk space requests . The purpose is to find the maximum value for each of them. These maximum values are then used to merge with the job event records. The data integration task is implemented using the merge function of the Dask library. Meanwhile, dataset B is not involved in the integration task because the existing features are sufficient for modeling purposes.

Data Reduction

This task is carried out to obtain the subset of the entire dataset. It is needed to enable us to conduct the analysis and experiments using the available platform with a limited budget. Hence, this task involves two objectives. The first is to reduce the number of columns or features, which is supported by the correlation analysis. The second objective is to reduce the number of rows by deciding the subset of the dataset.

For the first objective, we first change the string datatype of the respective features into a numerical datatype. This is done using Pandas’ astype function. After that, we construct a Heatmap to identify the correlation between all the respective features and to determine which features to be removed from the dataset. In general, those features that have weak to no correlation with the outcome of a job or task are removed. Once removed, the only features that remain in dataset A are the scheduling class and resource-related requests (i.e. CPU requests , memory requests and disk space requests ). Meanwhile, the remaining features of dataset B are scheduling class , priority and resource-related requests.

For the second objective, we reduce the rows or records on the basis of two filters. The first filter uses the timestamp feature, where we have decided to select data from the first fourteen (14). As the value is in a timestamp format, we have to convert the day into microseconds to enable the filtering task. For the second filter, we have decided to remove some records using the event type feature. There are 9 distinct values in event type . Here, we have chosen to remove those with their event types considered incomplete lifecycle (that is, job / task submission (0), job / task scheduled (1), update while still in queue (7), and update during running (8)).

Data Transformation

In this task, the focus is on re-categorizing the termination status in both datasets, A and B. Hence, the even type feature is used for this reason. Due to the previous data reduction, the remaining event type values are evict (2), fail (3), finished (4), kill (5) and lost (6) which indicate the termination status of each job or task. Based on these values, we categorize the type of event into two, success (that is, where the type of event equals 4) and failure (i.e., other than 4).

Class Balancing

This task is needed due to the number of records of the failure and success classes being imbalanced. There are several techniques that can be used to balance them. Here, we apply the synthetic minority sampling technique (SMOTE) [ 18 ]. In general, SMOTE facilitates the selection of examples that are closed in the feature space, drawing a line between the examples in the feature space, and drawing a new sample at a point along that line. The SMOTE process will increase the amount of data for the minority class and will also aid in model construction and training, especially for the DL models in the later stage.

Model Generation

During this phase, two types of machine learning algorithms are implemented, classified as TML and DL algorithms. Each dataset, A and B, is divided into two parts, specifically 70% for training and 30% for testing. The algorithms involved are as follows.

Traditional Machine Learning Algorithms

We have implemented three types of algorithms to address the classification problem, which are regression, tree, and ensemble, respectively. In the case of regression, LR [ 19 ] is chosen since it is the most investigated regression algorithm in machine learning. For the tree, we have selected DT [ 20 ] since it is the primary tree machine learning algorithm for the classification problem. Lastly, for the ensemble category, we have chosen RF [ 21 ], GB [ 22 ], and XGBoost [ 23 ]. These machine learning algorithms were implemented using the scikit-learn library [ 24 ], except XGBoost, which was implemented using the XGBoost [ 23 ] library. All models were built as a default model with a small change in the default arguments. However, for the LR model, the number of maximum iterations has been changed because the default value is insufficient for converging all solvers. Solver is an optimization algorithm for calculating the loss of the LR model.

Deep Learning Algorithms

The last three machine learning algorithms will be three different variants of the LSTM [ 25 ] based algorithms. They are differentiated by the number of layers. The three variants of the DL model are Single Layer LSTM algorithm, Bi-Layer LSTM algorithm (two hidden layers), and Tri-Layer LSTM (three hidden layers), respectively. Finally, we add a dense layer to ensure that our algorithm produces only a single value for a prediction. The epoch is set to 100 to ensure we gain the best model possible. To reduce training times and prevent overfitting of the model, training is automatically stopped if there is no improvement of the validation loss value after 10 epochs.

Analysis & Experiments

This phase comprises exploration activities, identification of the feature importance and determination of the best models in terms of performance and scalability. Each of them is explained below.

Exploratory Data Analysis

This analysis focuses on exploring the data in the job event table and the task event table before preparing the data for the machine learning task. The job event table comprises of eight columns namely Timestamp, Missing Info, Job ID, Event Type, User Name, Scheduling Class, Job Name, and Logical Job Name, Meanwhile, the task event table comprises of thirteen columns namely Timestamp, Missing Info, Job ID, Task Index, Machine ID, Event Type, User Name, Scheduling Class, Priority, CPU Request, Memory Request, Disk Space Request, and Different Machine Constraint. Several types of analysis are performed, namely data distribution analysis for exploring the continuous type features and data classification analysis for exploring the discrete type features. We then produce a series of visualizations, specifically a bar chart and a pie chart to visualize the results of data categorization and a box plot to visualize the results of data distribution.

Performance Analysis

This analysis focuses on measuring the performance of predictive models towards building highly accurate classifiers. The analysis is divided into job-level and task-level predictive models. A large amount is used to train and test the models, namely 1 million records for the job-level prediction and 14 million records for the task-level prediction. A set of evaluation metrics is applied to measure each model, namely error rate, precision, sensitivity, specificity, and F-score. These measures can be obtained from the confusion matrix generated, which consists of four parameters: True positive (TP), true negative (TN), false positive (FP) and false negative (FN).

Feature Importance Analysis

The feature importance refers to the score that measures the importance of each feature in a predictive model. Feature importance helps to understand the relative importance of each feature in estimating the models. Feature importance does not relate to the accuracy of the model. The importance of each feature is determined by a score calculated by predicting the training data. A high score means that the feature will have high priority in determining the outcome of the prediction. In this study, the scores of feature importance are calculated using the library Dalex [ 26 ].

Scalability Analysis

This analysis focuses on measuring the scalability of predictive models, which means the ability of the models to scale in response to the amount of data. It is determined by measuring the time taken to make the prediction based on the given input data. For this analysis, we have prepared a few sets of data for analyzing the job-level and task-level predictive models. In the case of the job level, there are three sets of input data of different sizes, namely (1) 10,000 rows, (2) 100,000 rows, and (3) 1,000,000 rows. For the task level, we prepared four sets of input data, specifically, (1) 10,000 rows, (2) 100,000 rows, (3) 1,000,000 rows, and (4) 10,000,000 rows. These input data are taken from the job event table and the task event table. For this analysis, the event type (i.e. the dependent variable) is ignored, and only the independent variables are considered. The process of preparing the data is similar to the process of data handling discussed in Section  3.2 that excludes data reduction, data transformation, and class balancing.

Results and Findings

This section explains the results of analysis and experiments, namely, related to exploratory data analysis, performance, feature importance, and scalability aspects.

Results of Exploratory Data Analysis

We explain the results based on two tables, which are the job event and the task event table.

Job Event Table

This table contains eight columns (or features) and 2,012,242 rows, where three of the columns take a hash string type and the rest are defined as integer type. Meanwhile, there are only 672,074 unique job identifiers. Each job has at least three occurrences, namely, job submission, job scheduling, and job termination. Figure 3 shows the distribution of scheduling classes in the job event table, while Fig. 4 shows the job termination status classified by their scheduling class.

figure 3

The figure shows the distribution of scheduling class in job event table. Most of the job events are classified as 0 which indicates a non-production job

figure 4

The figure shows the distribution of termination status categorized based on the scheduling class. Most of the job events terminates with the finish status

There are four scheduling classes where scheduling class 0 indicates the non-production jobs (e.g. non-business critical analysis), scheduling class 3 represents latency-sensitive jobs (e.g. serving revenue-generating user requests), while scheduling classes 1 and 2 represent the jobs that are somewhere between the least and the most latency one. Based on Fig. 3 , almost half of the jobs are identified as class 0. There are around 5,000 jobs designated as class 3, 215,000 jobs as class 1 and 194,000 jobs as class 2. From Fig. 4 , about 385,582 jobs finished normally, and 274,341 jobs were killed due to user interruption, or their dependent jobs have died. About 10,000 jobs were failed due to task failures. Lastly, about 22 jobs were evicted because of executing jobs with higher priority, the scheduler overcommitted, the actual demands exceeded the machine capacity, the machines became unusable, or the disks were failed.

figure 5

Job Event Table Correlation. The figures shows the the heat map of the correlation analysis of the job event table

Figure 5 shows the correlation between the features in the job event table using heatmap. As shown, we can observe that event type is highly correlated with missing info . Also, note that other features have a weak or no correlation with event type . Two features, namely user name and job name , are not considered in this correlation, as we focus only on anonymous values.

Task Event Table

This table contains 13 columns (or features) and 144,648,288 rows, where eight of the columns are integer types, three are floating types, and one column is string type. There are 25,424,731 unique tasks in the table. A unique task is identified by the combination of the job id and the task index. Similarly to the job event table, a single task may contain at least three occurrences, namely, task submission, task scheduling, and task termination. Of 25,424,731 rows, 18,375 rows were synthesized, which is approximately 7% of the total records. Figure 6 shows the distribution of priority tasks, while Fig. 7 shows the termination status of tasks categorized by priority.

figure 6

The figure shows the distribution of task priority in the task event table. The left chart depicts the overall distribution of task priority, whereas the right chart depicts a detailed distribution of job priority, that is smaller than 5%

figure 7

The figure shows the distribution of the tasks’ termination status in the task event table. Level 0 is the least priority task, while level 11 is the highest priority task

From Fig. 6 , we can see that more than half of the task priority in the dataset is identified as level 0 which represents the least priority task. Unlike the job, task priority plays a significant role in determining the resource access for each task. From Fig. 7 , we can see that the ratio of failed tasks to finished tasks is 3 to 4. This means that from the total of failed tasks and finished tasks, 3 of 4 tasks did not finish correctly. This ratio is abnormal, as a cloud service provider usually aims for 99.9% service availability. Therefore, we can assume that there were a series of outages at the site while monitoring the traces.

figure 8

The figures show the distribution of the resource requests (CPU request, memory request, disk space request)

The distribution of resource requests is shown in Fig. 8 . As shown, the average value for the CPU request per task is around 0.0125 to 0.0625 of the total CPU resources in a machine. The amount of memory request required is usually in the range of 0.01553 to 0.03180, while the disk space request is in the range of 0.000058 to 0.000404. The values of all resources have been normalized from 0 to 1. As a result, we may conclude that the amount of disk space needed is insignificant compared to the available resource. We can also see that each requested task takes less than 10% or 0.1 of the total CPU and memory requests.

figure 9

This figures show the heatmap of the correlation between features of the task event table

Figure 9 shows the correlation of features in the task event table using heatmap. From the figure, we can see that event type has a weak correlation with other features. In addition to that, we notice that there is a positive correlation between resource usage and priority . Scheduling class does not appear to be affected by the amount of resources requested and the status of the task. Although it is a weak correlation, missing info is adversely correlated with resource requests .

Feature Importance Results

Herein, we present the results of feature importance based on job-level and task-level failure prediction.

Job Level Failure Prediction

Figure 10 shows the features that matter the most for all models driven by machine learning. With the exception of the GB and XGBoost models, in general, scheduling class and CPU request are the most significant features in determining the result of the job termination status. These two features score around 30 to 35% for their importance in cloud job termination status. For the three DL algorithms, it has been shown that CPU request is the most important feature in determining the termination status of the cloud job, whereas for the LR model, memory request is the most important feature in predicting the termination status of the job.

figure 10

This figures show the feature importance for the job level prediction models

Task Level Failure Prediction

As shown in Fig. 11 , the priority is the most important feature in predicting the termination status of the task. For the TML algorithms, with the exception of GB, the priority importance score is 0.2, which means that it has been the main factor in determining 20% of the result of the given dataset. In the case of resource-related requests (i.e., CPU request , memory request , disk space request ), it has been determined that memory request is the most important feature for TML models, excluding the DT model where disk space request is the most important. For DL models, it has shown that memory request is the most important feature to predict the outcome in Single-Layer LSTM and Bi-Layer LSTM, while for Tri-Layer LSTM, disk space request in the most important resource request.

figure 11

This figure shows the feature importance for the task level prediction models

Performance Results

We present the performance results based on the prediction of failures for the job level and the task level.

Table  4 shows the performance of each model using the training dataset, while Table  5 shows the performance of all models trained to classify the termination status of the test dataset. We can observe that the performance of all LSTM models is lower than the performance of the TML models, excluding the LR model. Specifically, the XGBoost model has the best accuracy at 93.25% end at 93.10%. The F-score scores of 0.9325 and 0.9310 further demonstrated that XGBoost is the best model of all the models generated in this experiment. XGBoost also demonstrated the highest precision, correctly identifying 94.31% of the termination statuses of the job. Furthermore, the XGBoost sensitivity and specificity scores are 91.92% and 96.07%, respectively, indicating that it can successfully predict the termination status of the job.

The second highest accuracy is recorded by the DT model. The results for the XGBoost and DT models show only a slight difference in terms of accuracy, precision, and specificity performance during the testing and training phase. The DT model has achieved 93.27% accuracy during training and 93.23% accuracy during testing, with a precision of 92.19% during training and 91.95% during testing. Similarly to XGBoost, the DT model also obtained a high score for sensitivity and specificity during the training and testing phase, demonstrating that the model can correctly classify the job termination status.

The GB model has the third highest accuracy at 90.72% during the testing phase compared to the RF at 89.65%. However, during the training phase, the GB model shows a lower accuracy at 90.72% compared to the RF at 93.27%. It has also shown better performance during the testing phase, where its F-score is 0.8874 during the testing phase, while the F-score in the training phase is 0.8098. Therefore, the RF model is the fourth model used to predict job failure prediction, although the accuracy is slightly lower during testing, which is 89.65%. The LR model has the least accuracy compared to another model. This model has the lowest F-score and accuracy among the other TML models in this experiment. The LR model managed to obtain the accuracy score of 65% and incorrectly classify the result of 35% of the jobs in the dataset.

Finally, with the exception of the LR model, the three LSTM models developed for this experiment performed somewhat worse than other TML algorithms. This may be caused by underfitting the models when there is not enough data for the model to interpret the complexity of the data to make the correct classification. We also observed that there were not many differences in performance despite their differences in the number of hidden layers. We also can conclude that the amount of LSTM hidden layer suitable for this experiment is two, as we can see there is a dip in performance for Tri-Layer LSTM compared to Bi-Layer LSTM.

Table  6 shows the performance of each model using the training dataset, while Table  7 shows the performance of each model trained based on the test dataset. As shown, the best models to predict task-level failure prediction are the RF and DT models. Both achieved 89.75% accuracy during the testing phase. However, the RF model gains a slight edge during the training phase with 92.47% accuracy, while the obtained DT model achieved 89.75%. Both the DT and RF models managed to correctly classify 78.4% of the task that ended normally. This trend can be seen in most machine learning models.

XGBoost has the third highest accuracy at 89.35% during the testing phase compared to GB at 87.87%. The same is true for the training phase where the accuracy for the GB model is 87.81% and the accuracy for XGBoost at 89.34%. F-Score for XGBoost is 0.9120 and 0.9119, while the F-Score for the GB model is 0.9003. The LR model has the lowest accuracy compared to another model. This model has the lowest F-score and accuracy among the other TML models for this experiment. The LR model has achieved 70% accuracy and wrongly classified 30% of the task in the dataset.

Finally, similar to the job-level failure prediction, the three LSTM models performed marginally worse than all TML models excluding the LR model. The accuracy score for all LSTM models is between 86% and 87%. We also observed that there are not many differences in terms of performance between the three LSTM models despite their differences in the number of hidden layers.

Scalability Results

We present the scalability results based on the prediction of the failure level of the job and the task level.

Figure  12 illustrates the scalability result of the job-level prediction models. For the TML models, we found that on average a single prediction will take around 2 to 3 microseconds per prediction, whereas the DL models take around 2 to 3 milliseconds for a single prediction. The best model for TML is based on the LR model, where it can predict up to 1 million inputs in under one second.

figure 12

This figure shows the scalability of job level failure prediction models in relation to different amount of data

We notice that the time taken for a task-level prediction is similar to that for the job-level prediction. Figure  13 illustrates the scalability result of the task-level prediction models. For the TML models, we found that on average a single prediction will take around 2 to 3 microseconds per prediction, whereas the DL models take around 2 to 3 milliseconds for a single prediction. The best model for the TML model is based on the LR model, where it can predict up to 10 million inputs in less than two seconds.

figure 13

This figure shows the scalability of task level failure prediction models in relation to different amount of data

Related Works

In this section, we discuss related work in three aspects. First, we review the work that has addressed the prediction of job and task failure using the GCT dataset, as shown in Table  8 . Second, we review the work that addressed different types of failure prediction using different types of dataset, as shown in Table  9 . Third, we review other types of prediction that have specifically used the GCT dataset as shown in Table  10 .

The review resulting in Table  8 is based on three key elements. First, we identify the prediction scope to determine whether the related work addresses job failure or task failure, or both. Second, we identify the feature studied in relation to the GCT dataset. Third, we determine the applied machine learning algorithms for producing predictive models, which we categorize into SML, TML, and DL. For comparison purposes, we focus on the first and third elements, whilst the second element is meant to provide more information of the related works. From the prediction scope, we can conclude that most related work addresses either job or task failure prediction. Limited work has addressed both failures. With regard to the algorithms applied, we notice that most of the related work has applied TML algorithms. There are limited studies that applied DL and only one study applied SML. Our work concerns addressing both job and task failure prediction, where highly accurate predictive models are constructed and evaluated from two categories of algorithms, namely the TML and DL algorithms. Thus, compared to related works, our work is more comprehensive in determining the best predictive models, since we cover both types of failure and utilize more algorithms from TML and DL.

We then expand our review beyond the prediction of job and task failures. We are concerned with the coverage of algorithms applied in related work. The review resulting in Table  9 is also based on three key elements, similar to Table  8 . The main difference is that the table contains related works that utilized a dataset other than GCT. Therefore, its failure prediction scope is not intended to train predictive models for job and task failure. However, we take them into account, since these works are still within the cloud computing area, and the applied algorithms can also be categorized into SML, TML, and DL. This review can provide additional references to interested readers in a broader context of failure prediction and also beyond the use of the GCT dataset. Although these works are not comparable to our work in relation to the targeted failure prediction problem, we can conclude that there are still limited works that comprehensively evaluate the failure predictive models (i.e., using multiple types of machine learning algorithm) as presented in the table.

Furthermore, the review that resulted in Table  10 is also based on three key elements, similar to Table  8 . The main similarity is that the related work is those utilized in the GCT dataset. Meanwhile, the key difference is that we broaden the scope of the prediction beyond the failure prediction perspective. This summary may provide a wider context for the use of GCTs to interested readers. As shown, we can conclude that, in addition to failure prediction, GCT has been used mainly to predict workload. We can also conclude that more work has applied multiple categories of machine learning algorithms to find the best models. Therefore, it supports our strategy to comprehensively address the failure prediction problem with TML and DL. Furthermore, our work also contributes to the use of GCT from a failure prediction perspective with a comprehensive evaluation.

Conclusion and Future Work

In this paper, we have proposed a comprehensive evaluation approach to predict task failure and failure. For this reason, we constructed five TML models and three DL models and compared their performance in predicting job and task failure using the GCT dataset. Our performance analysis showed that the best performing model for predicting job-level failures is the XGBoost classifier, which has achieved an accuracy score of 94.35% and an F-score of 0.9310. In the case of task-level prediction, we found two best-performing models, which are based on DT and RF. Both models have obtained an accuracy score of 89.75% and an F-score of 0.9154. Overall, the results have shown that the TML models perform slightly better than the DL models in classifying job and task termination status. Furthermore, our analysis of feature importance determined that scheduling class and CPU request are the most significant features for TML, while disk space request and memory request are the most important features for DL in the context of prediction of job failure. Meanwhile, for task-level prediction, resource requests is the dominant one for TML while priority is the most important feature for DL models. Finally, our scalability analysis found that TML models can make the prediction in a reasonably short time, even though they are executed on consumer-level hardware.

There are several recommendations for future work. First, a multi-objective prediction will be useful for supporting cloud resource management. For example, the prediction of workload and failure can be addressed simultaneously when automatically deciding on resource allocation, scheduling, or provisioning. Training for this kind of predictive model can benefit from the GCT dataset. Second, the scope of the prediction can be expanded to the energy efficiency aspect. This challenge is important for cloud service providers to minimize their costs. Additionally, it also supports the Sustainable Development Goals agenda. However, training for this kind of model should use other potential datasets. Third, the application of transfer learning techniques can be explored towards producing the best quality of predictive models for the cloud resource management.

Availability of data and materials

Not applicable.

Stein M, Campitelli V, Mezzio S (2020) Managing the Impact of Cloud Computing. CPA J N Y 90(6):20–27

Google Scholar  

Fortune Business Insight (2021) Cloud Computing Market Size, Share & COVID-19 Impact Analysis, By Type (Public Cloud, Private Cloud, Hybrid Cloud), By Service (Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS)), By Industry (Banking, Financial Services, and Insurance (BFSI), IT and Telecomunications, Goverment, Consumer Goods, and Retai, Healthcare, Manufacturing, Others), and Regional Forecast, 2021–2028. Technical report, Fortune Business Insight

Gill SS, Buyya R (2018) Failure management for reliable cloud computing: a taxonomy, model, and future directions. Comput Sci Eng 22(3):52–63

Article   Google Scholar  

Press Association (2017) British Airways it failure caused by ‘uncontrolled return of power’. Guardian.  https://www.theguardian.com/business/2017/may/31/ba-it-shutdown-caused-by-uncontrolled-return-of-power-after-outage . Accessed 24 Jan 2022

Nazari Cheraghlou M, Khadem-Zadeh A, Haghparast M (2016) A survey of fault tolerance architecture in cloud computing. J Netw Comput Appl 61:81–92

Abdul-Rahman OA, Aida K (2014) Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: 2014 IEEE 6th International Conference on Cloud Computing Technology and Science. IEEE, Los Alamitos, p 272–277

Verma A, Pedrosa L, Korupolu MR, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with Borg. In: Proceedings of the European Conference on Computer Systems (EuroSys). Association for Computing Machinery (ACM), France, p 1–17

Bala A, Chana I (2012) Fault tolerance-challenges, techniques and implementation in cloud computing. Int J Comput Sci Issues (IJCSI) 9(1):288

Shahid MA, Islam N, Alam MM, Mazliham M, Musa S (2021) Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment. Comput Sci Rev 40:100398

Setlur AR, Nirmala SJ, Singh HS, Khoriya S (2020) An efficient fault tolerant workflow scheduling approach using replication heuristics and checkpointing in the cloud. J Parallel Distrib Comput 136:14–28

Kochhar D, Jabanjalin H (2017) An approach for fault tolerance in cloud computing using machine learning technique. Int J Pur Appl Math 117(22):345–351

Mukwevho MA, Celik T (2018) Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Trans Serv Comput 14(2):589–605

Li Y, Jiang ZM, Li H, Hassan AE, He C, Huang R et al (2020) Predicting node failures in an ultra-large-scale cloud computing platform: an aiops solution. ACM Trans Softw Eng Methodol (TOSEM) 29(2):1–24

Costa CH, Park Y, Rosenburg BS, Cher CY, Ryu KD (2014) A system software approach to proactive memory-error avoidance. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Los Alamitos, p 707–718

Gao J, Wang H, Shen H (2020) Task failure prediction in cloud data centers using deep learning. IEEE Trans Serv Comput 15(3):1411–22

Bisong E (2019) An overview of google cloud platform services. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Apress, Berkeley, p 7–10

Reiss C, Wilkes J, Hellerstein JL (2011) Google cluster-usage traces: format + schema. Mountain View, Google Inc. Revised 2014-11-17 for version 2.1. Posted at  https://github.com/google/cluster-data . Accessed 24 Jan 2022

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

Tolles J, Meurer WJ (2016) Logistic regression: relating patient characteristics to outcomes. JAMA 316(5):533–534

Myles AJ, Feudale RN, Liu Y, Woody NA, Brown SD (2004) An introduction to decision tree modeling. J Chemom J Chemometr Soc 18(6):275–285

Breiman L (2001) Random forests. Mach Learn 45(1):5–32

Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobotics 7:21

Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H et al (2015) Xgboost: extreme gradient boosting. R package version 04-2 1(4):1–4

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830

MathSciNet   MATH   Google Scholar  

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Baniecki H, Kretowicz W, Piatyszek P, Wisniewski J, Biecek P (2020) dalex: Responsible Machine Learning with Interactive Explainability and Fairness in Python. arXiv preprint arXiv:2012.14406

Chen X, Lu CD, Pattabiraman K (2014) Failure prediction of jobs in compute clouds: A google cluster case study. In: 2014 IEEE International Symposium on Software Reliability Engineering Workshops. IEEE, Los Alamitos, p 341–346

Soualhia M, Khomh F, Tahar S (2015) Predicting scheduling failures in the cloud: A case study with google clusters and hadoop on amazon EMR. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE, Los Alamitos, p 58–65

Rosa A, Chen LY, Binder W (2015) Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, Los Alamitos, p 221–230

Tang H, Li Y, Jia T, Wu Z (2016) Hunting Killer Tasks for Cloud System through Machine Learning: A Google Cluster Case Study. In: 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, Los Alamitos, p 1–12

Islam T, Manivannan D (2017) Predicting application failure in cloud: A machine learning approach. In: 2017 IEEE International Conference on Cognitive Computing (ICCC). IEEE, Los Alamitos, p 24–31

Liu C, Han J, Shang Y, Liu C, Cheng B, Chen J (2017) Predicting of job failure in compute cloud based on online extreme learning machine: a comparative study. IEEE Access 5:9359–9368

El-Sayed N, Zhu H, Schroeder B (2017) Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, Los Alamitos, p 1333–1344

Jassas MS, Mahmoud QH (2019) Failure characterization and prediction of scheduling jobs in google cluster traces. In: 2019 IEEE 10th GCC Conference & Exhibition (GCC). IEEE, Los Alamitos, p 1–7

Shetty J, Sajjan R, Shobha G (2019) Task resource usage analysis and failure prediction in cloud. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE, Los Alamitos, p 342–348

Jassas MS, Mahmoud QH (2021) A Failure Prediction Model for Large Scale Cloud Applications using Deep Learning. In: 2021 IEEE International Systems Conference (SysCon). IEEE, Los Alamitos, p 1–8

Guan Q, Zhang Z, Fu S (2012) Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems. J Commun 7(1):52–61

Adamu H, Mohammed B, Maina AB, Cullen A, Ugail H, Awan I (2017) An approach to failure prediction in a cloud based environment. In: 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud). IEEE, Los Alamitos, p 191–197

Pitakrat T, Okanović D, van Hoorn A, Grunske L (2018) Hora: Architecture-aware online failure prediction. J Syst Softw 137:669–685

Zhang S, Liu Y, Meng W, Luo Z, Bu J, Yang S et al (2018) Prefix: Switch failure prediction in datacenter networks. Proc ACM on Measurement and Analysis of Computing Systems 2(1):1–29

Lin Q, Hsieh K, Dang Y, Zhang H, Sui K, Xu Y, et al (2018) Predicting node failure in cloud service systems. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, p 480–490

Han S, Wu J, Xu E, He C, Lee PP, Qiang Y, et al (2019) Robust data preprocessing for machine-learning-based disk failure prediction in cloud production environments. arXiv preprint arXiv:1912.09722

Mohammed B, Awan I, Ugail H, Younas M (2019) Failure prediction using machine learning in a virtualised HPC system and application. Clust Comput 22(2):471–485

Chen Y, Yang X, Lin Q, Zhang H, Gao F, Xu Z, et al (2019) Outage prediction and diagnosis for cloud service systems. In: The World Wide Web Conference. Association for Computing Machinery, New York, p 2659–2665

Rawat A, Sushil R, Agarwal A, Sikander A (2021) A new approach for vm failure prediction using stochastic model in cloud. IETE J Res 67(2):165–172

Yu F, Xu H, Jian S, Huang C, Wang Y, Wu Z (2021) DRAM Failure Prediction in Large-Scale Data Centers. In: 2021 IEEE International Conference on Joint Cloud Computing (JCC). IEEE, Los Alamitos, p 1–8

Rasheduzzaman M, Islam MA, Islam T, Hossain T, Rahman RM (2014) Study of different forecasting models on Google cluster trace. In: 16th Int’l Conf. Computer and Information Technology. IEEE, Los Alamitos, p 414–419

Liu B, Lin Y, Chen Y (2016) Quantitative workload analysis and prediction using Google cluster traces. In: 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, Los Alamitos, p 935–940

Zhang W, Li B, Zhao D, Gong F, Lu Q (2016) Workload prediction for cloud cluster using a recurrent neural network. In: 2016 International Conference on Identification, Information and Knowledge in the Internet of Things (IIKI). IEEE, Los Alamitos, p 104–109

Hemmat RA, Hafid A (2016) SLA violation prediction in cloud computing: A machine learning perspective. arXiv preprint arXiv:1611.10338

Zhang W, Duan P, Yang LT, Xia F, Li Z, Lu Q et al (2017) Resource requests prediction in the cloud computing environment with a deep belief network. Softw Pract Experience 47(3):473–488

Chen Z, Hu J, Min G, Zomaya AY, El-Ghazawi T (2019) Towards accurate prediction for high-dimensional and highly-variable cloud workloads with deep learning. IEEE Trans Parallel Distrib Syst 31(4):923–934

Gao J, Wang H, Shen H (2020) Machine learning based workload prediction in cloud computing. In: 2020 29th international conference on computer communications and networks (ICCCN). IEEE, Los Alamitos, p 1–9

Di S, Kondo D, Cirne W (2012) Host load prediction in a Google compute cloud with a Bayesian model. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, Los Alamitos, p 1–11

Download references

Acknowledgements

Azlan Ismail acknowledges the support of the Fundamental Research Grant Scheme, FRGS / 1/2018 / ICT01 / UITM / 02/3, funded by the Ministry of Education Malaysia.

Similar as in the acknowledgment.

Author information

Authors and affiliations.

Faculty of Computer and Mathematical Sciences (FSKM), Universiti Teknologi MARA (UiTM), 40450, Shah Alam, Selangor, Malaysia

Tengku Nazmi Tengku Asmawi & Azlan Ismail

Institute for Big Data Analytics and Artificial Intelligence (IBDAAI), Kompleks Al-Khawarizmi, Universiti Teknologi MARA (UiTM), 40450, Shah Alam, Selangor, Malaysia

Azlan Ismail

Faculty of Engineering and Information Sciences, School of Computing and Information Technology, University of Wollongong, 2522, Wollongong, NSW, Australia

You can also search for this author in PubMed   Google Scholar

Contributions

T.N. conducted the experiments, analyzed the data, and wrote the paper; A.I. proposed the idea, reviewed the experiments, and revised the paper; J.S revised the paper. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Azlan Ismail .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Tengku Asmawi, T.N., Ismail, A. & Shen, J. Cloud failure prediction based on traditional machine learning and deep learning. J Cloud Comp 11 , 47 (2022). https://doi.org/10.1186/s13677-022-00327-0

Download citation

Received : 14 May 2022

Accepted : 09 September 2022

Published : 19 September 2022

DOI : https://doi.org/10.1186/s13677-022-00327-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cloud computing
  • Job and task failure
  • Failure prediction
  • Deep learning
  • Machine learning

cloud computing failure case study

IMAGES

  1. 1. Causes Of Failure In Cloud Computing Projects..datacademy

    cloud computing failure case study

  2. Dynamic Failure Detection And Recovery Cloud Computing Standard

    cloud computing failure case study

  3. (PDF) Exposing cloud computing as a failure

    cloud computing failure case study

  4. (PDF) Case Study of Cloud Computing Security Issues and confidentiality

    cloud computing failure case study

  5. Failure Points For Cloud Computing Services PPT Slide

    cloud computing failure case study

  6. (PDF) Failures in Cloud Computing Data Centers in 3-tier Cloud Architecture

    cloud computing failure case study

VIDEO

  1. "Microsoft Outage: Azure Services Disrupted, CrowdStrike Investigates Cause"

  2. CLOUD COMPUTING CASE STUDY 1 PRESENTATTION

  3. Capgemini Azure Public Cloud Migration l Case Study

  4. CLOUD COMPUTING week 1 & week 2 assignment #youtube #education #trending #nptel

  5. The Future of Cloud Has Arrived

  6. Cloud computing: where scalability meets efficiency. #shorts #cloudcomputing

COMMENTS

  1. Lift, Shift And Drift: When Cloud Migrations Fail Miserably

    RFP evaluations may lean toward a lift and shift selection due to the lower upfront cost and accelerated timeline, and CIOs are under increasing pressure to leverage cloud services faster by...

  2. Why cloud migration failures happen and how to prevent them

    Cloud migration failures are more common than you think, with three-quarters of organizations moving applications from the cloud back to on-premises data centers. Find out how to overcome migration challenges and avoid this costly misstep.

  3. 10 Important Cloud Migration Case Studies You Need to Know

    Challenges for Cloud Adoption: Is Your Organization Ready to Scale and Be Cloud-first? We examine several of these case studies from a more technical perspective in our white paper on Top Challenges for Cloud Adoption in 2019. In this white paper, you’ll learn: Why cloud platform development created scaling challenges for businesses

  4. 6 Cloud Computing Failures that Shocked the World - ReadITQuik

    Here’s how cloud computing misfires impacted the world in recent times: 1. Salesforce goes down: On May 9, 2016, the Silicon Valley NA14 instance of Salesforce.com went offline, resulting in an outage that lasted for more than 24 hours.

  5. 4 failure patterns to avoid in cloud modernization | IBM

    Many companies apply traditional practices and existing capabilities to the cloud and fail. In our experience, there are four primary issues that hold businesses back from realizing value: 1. Business and IT misalignment. An overwhelming majority of cloud programs tend to be driven by IT organizations. However, the bulk of cloud value is ...

  6. Enhancing the analysis of software failures in cloud ...

    In this paper, we presented a novel approach for analyzing failure data from cloud systems, by using unsupervised learning algorithms and deep learning to cluster the failure data into failure classes.

  7. Cloud failure prediction based on traditional machine ...

    Fault Tolerance in Cloud Computing. Despite the advancement of cloud computing technology, cloud computing performance is still hampered by its vulnerability to failure. Therefore, fault tolerance is one of the fundamental requirements of cloud computing.

  8. A Case Study of the Capital One Data Breach

    Introduction. Technology is nowadays one of the main enablers of digital transformation worldwide. The use of information technologies increases each year and directly impact changes in consumer behavior, development of new business models, and creation of new relationships supported by all the information underlying these interactions.

  9. Challenges in migrating legacy software systems to the cloud ...

    Abstract. Moving existing legacy systems to cloud platforms is a difficult and high cost process that may involve technical and non-technical resources and challenges. There is evidence that the lack of understanding and preparedness of cloud computing migration underpin many migration failures in achieving organisations’ goals.

  10. Cloud Migration: A Case Study of Migrating an Enterprise IT ...

    The results of the case study presented in this paper are novel as they attempt to highlight the overall organizational implications of using cloud computing.