Resizing an Azure VM is a disruptive operation and must be treated as such in automation, especially for production workloads.
From the provided context, the following points are supported:
- Detecting whether applications are actively running inside a VM
The context does not define an Azure-native, generic way to detect “application activity” inside a VM (for example, which processes or business apps are running). That detection is application-specific and must be implemented using guest-level monitoring or custom health checks.
However, Azure Advisor does use VM-level performance metrics (CPU, memory, disk IOPS, bandwidth) over time to decide when a VM is under- or over-sized, which is a useful pattern for automation.
- Using performance metrics to infer activity/idle state
Azure Advisor’s “right-sizing” logic for highly utilized VMs is based on:
- CPU utilization
- Memory utilization
- VM Cached IOPS Consumed Percentage
- VM Uncached Bandwidth Consumed Percentage
It:
- Aggregates metrics over a minimum of seven days.
- Samples every 30 seconds, aggregates to 1 minute, then to 30 minutes.
- Identifies resize candidates when:
- Both CPU and memory are ≥ 90% of current SKU limits, or
- Disk metrics are ≥ 95% of limits under specific conditions.
This shows that Azure’s own guidance for resize decisions is based on sustained utilization of CPU, memory, and storage bandwidth/IOPS, not on a single instantaneous snapshot.
For an “idle” check in automation, the same categories of metrics (CPU, memory, disk, network) are appropriate, but the context only confirms their use for identifying high utilization, not for guaranteeing that an application is idle. Any “idle” threshold remains a design decision for the workload owner.
- Azure-native services/APIs for activity and configuration
From the context:
- Azure Advisor: Provides recommendations to resize VMs based on sustained high utilization across CPU, memory, and disk metrics. This can be used as an input signal for resize automation when VMs are consistently constrained.
- Change tracking and inventory using Azure Monitoring Agent: Tracks OS configuration drift, installed software, services/daemons, and key files on Azure VMs and Arc-enabled VMs. This is useful for understanding what is installed and running, but the context does not state that it directly exposes “application activity” or “request traffic” semantics.
No context is provided for an Azure-native API that directly tells whether a business application is actively serving traffic. That must be implemented via application health probes, logs, or custom metrics.
- Whether the VM must be deallocated vs stopped for resize
The context distinguishes between VM states and resize behavior:
- Resizing behavior:
- A VM can be resized while running or deallocated.
- In some cases, the VM must be deallocated before resizing, particularly when the requested size is not available on the current hardware cluster.
- Changing the size of a running VM causes a restart and is disruptive.
- Stopped vs deallocated:
- “Stopped” (OS shutdown) keeps compute resources reserved and continues to incur compute charges.
- “Deallocated” (stopped via Azure control plane) releases compute resources and stops compute charges.
The resize article states:
- After creation, a VM can be scaled up or down by changing the size.
- This works whether the VM is running or already deallocated.
- In some cases, deallocation is required if the new size is not available on the current hardware cluster.
From this, for automation:
- It is not universally mandatory to deallocate a VM before resizing; Azure supports resizing a running VM (with restart) or a deallocated VM.
- However, automation should be prepared for cases where deallocation is required to complete the resize, especially when changing to a size that may not be available on the current cluster.
The context does not state that a “stopped but not deallocated” state is sufficient or supported for resize; it only explicitly mentions running and deallocated states for resize operations.
- Recommended validation steps for safe resize automation
Based on the context, the following validations and behaviors are supported and recommended for a safe automation workflow:
- Treat resize as disruptive:
- Resizing a running VM causes a restart and should be considered disruptive, especially for stateful workloads.
- Automation should only proceed when the workload can tolerate a restart or when the VM is intentionally taken offline.
- Power state validation:
- Confirm the VM power state before resize.
- Decide a policy:
- Either resize while running (accepting a restart), or
- Explicitly stop (deallocate) the VM via Azure APIs, then resize.
- Be aware that deallocation releases dynamic IP addresses; automation must handle IP changes if dynamic IPs are used.
- Capacity and allocation considerations:
- If resizing within an availability set, capacity constraints on the original cluster can cause allocation failures.
- Workarounds include:
- Choosing a different VM size with better availability.
- Stopping (deallocating) all VMs in the availability set and starting them together to allow allocation from all available clusters.
- Automation should handle allocation failures gracefully and possibly fall back to alternative sizes or retry strategies.
- Metric-based validation (patterned after Azure Advisor):
- Use Azure metrics for CPU, memory, disk IOPS, and bandwidth as inputs to resize decisions.
- For “scale up” decisions, follow the Azure Advisor pattern of looking at sustained high utilization over time.
- For “safe to resize now” checks, the same metrics can be used to ensure utilization is below chosen thresholds, but the context does not prescribe specific “idle” thresholds.
- Cost and state management:
- Use deallocation when VMs are not needed to avoid compute charges, as recommended in FinOps best practices.
- Ensure automation differentiates between “stopped” (OS-level) and “deallocated” (Azure control plane) and uses the appropriate API to achieve the desired state.
Summary relative to the example validations:
- CPU utilization threshold check: Supported as a key signal (used by Azure Advisor for resize recommendations).
- Memory utilization threshold check: Supported and used by Azure Advisor.
- Network traffic activity: Not explicitly called out in the context for resize decisions, but is a reasonable additional signal; not mandated by the documentation.
- Disk read/write activity: Supported via disk IOPS and bandwidth metrics, used by Azure Advisor.
- VM power state validation: Supported and important; resize is disruptive and may require deallocation depending on size and cluster capacity.
The context does not define a single “recommended” full workflow for safe resize automation, but it clearly establishes that:
- Resizing is disruptive and may require deallocation.
- VM metrics (CPU, memory, disk) are the primary signals Azure uses for resize recommendations.
- Stopped vs deallocated states have different billing and resource implications, and deallocation is sometimes required for resize.
- Capacity and allocation behavior in availability sets must be considered when resizing.
References: