Blog

Dynamically Right-Sizing Your Cloud Infrastructure

The benefits of cloud service providers such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform cannot be overstated. With the click of a button (or execution of a script), a user can create anything from a simple web site to hundreds of servers, each with hundreds of cores of CPU power. This allows development resources to be created and scaled on-demand, giving development teams access to bleeding-edge technologies without having to wait for a dedicated infrastructure team to design, request, and build the supporting infrastructure. However, as we will see, with great power comes great fiscal responsibility.

A common objection to cloud migrations, as well as a motivation behind rolling back cloud systems to traditional on-premises infrastructure, is that of the persistent cost involved. In the case of an on-prem setup, machines are right-sized up front to account for bursts of resource demands, that is, machines are sized to be able to handle some level of peak usage and sit idle when not needed. A high initial cost is paid, but after the initial investment, costs drop down significantly as the servers largely run in maintenance mode. In this case, right-sized infrastructure is directly tied with the absolute peak system demands, regardless of how often these resources are actually required.

In the case of a cloud subscription, the up-front cost is a mere fraction of the up-front investment of physical infrastructure, but the costs persist for the life of the systems – many machines are billed with a formula along the lines of base machine cost * hours of operation. The recurring spend is arguably the most common cost-related objection to cloud providers – physical systems are a one-time expense (until the systems need to be upgraded). That’s not to say significant savings in cloud systems are impossible – cloud services today provide on-demand scaling of resources, allowing you to ensure your systems are not only right-sized for peak demands, but also scaled down when demands are lower. There are many tactics one can use to manage cloud costs effectively; this article dives into two possible ways to alter one or both of the above multipliers as needed to reduce compute costs – but it does require insight as to how – and perhaps more importantly, when, your systems are being used.

Example 1: Scaling Vertically
You are using Platform-as-a-Service (PaaS) managed services for SQL databases. One such database is used internally for handling batch processing operations. Investigation into the activity of the resource shows a heartbeat-like graph indicating extreme write-heavy activity for about one hour every day at 4AM corresponding to when a file import is processed. The rest of the time, usage never climbs above 10%. In the on-prem world, the server would be built to be able to handle the spikes and sit idle the rest of the day; however, taking this same approach with managed resources would result in substantial unnecessary costs for the 23 hours per day that 90% of the available resources are unused.

In this case, we have a very predictable pattern with respect to system demands and time of day and can take proactive action to supply system resources as needed. The actual implementation can vary based on cloud provider and personal preference; examples include automation accounts or functions in Azure clouds, lambdas in AWS, or an orchestrator machine running scheduled tasks in any environment.

The basic logical flow would read as follows:
At 3:50 AM, (10 minutes before daily spike), scale database server up to premium tier.

At 5:10 AM (10 minutes after spike time ends), verify usage has dropped down to normal levels; if so, scale down to standard tier.

If usage does not ramp up or drop down in the expected window, notify relevant teams.

Example 2: Scaling Horizontally
You are operating a set of load-balanced Infrastructure-as-a-Service (IaaS) Virtual Machines hosting a consumer-facing web application. During off-peak hours, two machines is enough to handle incoming traffic while allowing for failover if necessary. During the workday, traffic gradually ramps up, requiring the resources of one to three additional machines, variable by day and time of day. This graph has much more variance than the previous example, with some days experiencing higher load than others, but never represented by an immediate spike.

In this scenario, we can be more reactive with respect to our resizing and make use of threshold-based metrics to tell us when we need to scale up or down. Again, there a multitude of ways to accomplish this – for example, a cloud monitor tied to trigger an event when machines begin to see 60% CPU load to scale out, and a second monitor to send the ‘scale in’ broadcast when load has dropped below 40% for more than five minutes.

The logical flow here would read as follows:
If total CPU exceeds 60%, add another machine to the set.

If the VM pool contains more than two machines and total usage drops below 40% for more than five minutes, remove one machine from the pool. Report any anomalies to relevant teams.

As you can see, this example makes no assumptions about time of day and may upsize the pool for 3 hours beginning at 7AM on Monday but only need to scale out for 2 hours at 1:30PM on Wednesday.

Complex Scaling Strategies in the Real World
In all likelihood, you will see benefit from a mix and match of these (and other) cost-management strategies in your cloud environment. Oftentimes a single piece of the solution architecture will serve as a candidate for multiple scaling strategies; for example, in an internal build system, one may seek to scale up an individual agent to allocate more power to resource-intensive jobs, while at the same time scaling out to create more build agents and allow parallelization of builds. This is the principle behind many build orchestration tools which provide the ability to spawn ephemeral build agents in response to demand. Similarly, a tax preparation service may need to create significantly more instances of their webserver farms around tax day to handle higher levels of internet traffic, while also vertically scaling to increase the processing power of any given database server.

Conclusion
Like development itself, a cloud migration is forever a work-in-progress. Simply lift-and-shifting your systems from physical hardware to managed solutions provide immense benefits in ease of use and reliability; however, best practices for things like right-sizing your systems have changed and there are significant benefits from constantly evaluating your system architecture and looking for areas of improvement. By routinely monitoring how and when your systems are being most heavily used, you can then work to implement solutions to get the most out of your infrastructure as well as your budget.

 

About the Author:
Chris Gutmanis is an engineering consultant based out of the Milwaukee Development Center. He’s worked as a software and systems engineer and has been focusing lately on DevOps and cloud computing. Chris has a wide range of experience covering everything from startups to large financial services and health care companies. He lives in Milwaukee with his wife, two dogs and three cats and enjoys Brazilian jiu-jitsu, heavy metal music, making guacamole and trips to the dog park.

Related Blogs
See All Blogs
Blog
Apr 29, 2024

Why Aligning Product Vision with Organizational Vision is Critical

For those in the tech industry, a company's vision isn't merely a checkbox; it's the propelling force steering your organization forward. But does your product vision align and drive your product and engineering teams? In this article, Senior Product Strategist Heather Harris shares five ways not having alignment between your product vision and organizational goals can impact your entire company.

Read More
Blog
Apr 4, 2024

Four Ways a Strong Customer Experience (CX) Strategy Can Benefit Your Entire Business

Creating a positive customer experience (CX) is typically an important part of a company’s product strategy, but many people don’t realize the far-reaching impact of CX on the overall business. In this article, Senior Principal Consultant Joe Dallacqua and Principal Product Strategist Ryan Finco delve into the elements of a strong CX and how they can benefit your entire business.

Read More
Blog
Mar 18, 2024

Unlocking Gen AI’s Full Potential: The Crucial Role of Quality Data

In an era where artificial intelligence (AI) promises to revolutionize industries and redefine competitive landscapes, generative AI stands out for its ability to create new content, from text to images, videos and beyond. This article explores the pivotal role of high-quality data in generative AI efficacy, examines the preparedness of companies for adopting these technologies and outlines essential steps for building a robust data foundation.

Read More
Blog
Mar 13, 2024

Navigating Readiness & Expense for Section 1071 Compliance

After 14 years, Section 1071 of the Consumer Financial Protection Bureau (CFPB) moved from the back burner in bank lending under the Dodd-Frank Act. The question about 1071 remains: will it come onto the front burner considering the legal challenges and injunctions that have delayed its implementation for years? We believe that there are many areas to consider as a bank assesses their compliance readiness, which should be driving discussions across these executive responsibilities. Read on for key readiness focus areas and questions for discussion.

Read More
See All Blogs
noun-arrow-2025160 copy 2
noun-arrow-2025160 copy 2
See All Blogs