Several infrastructure changes occur in your cloud environment hourly, daily, weekly, and monthly.
As your cloud environment grows, monitoring and governing these changes becomes increasingly challenging.
Policies are sets of rules that allow you to govern various aspects of your cloud infrastructure, such as cost, availability, security, performance, and usage. For each aspect, you can identify a desired operational state and configure policies to monitor for conditions that deviate from that state. When conditions do change, policies can take actions such as notifying you of the change or, in some cases, remediating the condition.
Policies are an effective way to eliminate noise and focus on the key aspects of your cloud infrastructure that require attention.
Policies differ in the types of actions they can take.
Type | When to use |
---|---|
Standard policy | To get email notifications of broader infrastructure changes or to automate remedial actions based on a condition. |
Rightsizing policy | To rightsize your instances or volumes by specifying underutilization thresholds. Results appear in the Health Check Pulse report. |
At the core of each policy is a rule, which monitors for one or more conditions and, optionally, responds with an action.
Conditions specify thresholds for one or more of these aspects: (a) cost, (b) configuration, (c) usage, (d) performance, and (e) security.
A good way to think about building conditions is as follows:
Example: When S3 Storage GB increased by more than 50 GB over 3 days.
Policy evaluation periods begin and end at midnight UTC. Therefore, the evaluation period covers the last full day, but not the current unfinished day. For example, let’s consider a policy that monitors for when S3 Storage GB increases by more than 50 GB over 3 days. If today is March 15, then the policy checks for the last 3 complete days - March 12, 13, and 14.
When a policy checks for an increase or decrease over an evaluation period, the policy checks the total increase or decrease value between the start and end of the evaluation period. For example, let’s consider a policy that monitors whether AWS Billing Statement total cost has increased by more than 20% over 30 days. The policy checks whether the total cost has increased by more than 20% between the start date and end date of the evaluation period. If the total cost is $100 on the first day of the evaluation period and $130 on the last day of the evaluation period, then the total increase is greater than 20% and the policy’s action triggers.
Note that the policy action only triggers if the measure is met over the entire evaluation period. If the total cost increases by more than 20% over one day in the evaluation period, but the total cost increase between the first and last day of the evaluation period is still less than 20%, the policy action does not trigger. To create a day-over-day cost alert, you can create a second policy with an evaluation period of 1 day.
Actions define how you want to respond when thresholds set by a condition or a set of conditions are violated. The following actions are possible:
If the results of an evaluation period are the same as those in the previous evaluation, no actions are taken.
Blocks are combinations of rules based on an organization principle, for example, by functional cluster (Cassandra) or by department (Engineering). You can specify which type of resource each block evaluates. You can also enable and disable blocks. One or more blocks make up a policy.
By default, blocks are evaluated daily at 6:00 AM EST. If a block contains multiple rules, block evaluation stops at the rule that is violated first. The evaluation of all subsequent rules within the block is ignored.
NoteDaylight Savings Time is not taken into account when selecting the run time of the policy. To account for Daylight Savings Time, manually adjust the evaluation time as needed.
Types of governance approaches and when to use them
Tanzu CloudHealth Policies differ in the types of actions they allow you to take.
Governance Approach | Type | When to use |
---|---|---|
Notifications and Actions | Standard policy | To get email notifications of broader infrastructure changes and, optionally, automate remedial actions based on a condition. |
Rightsizing | Rightsizing policy | To rightsize your instances or volumes by specifying underutilization thresholds. Results appear in the Health Check Pulse report and the rightsizing report. |
Receive alerts if the cost of one or more services in your environment has increased by more than a certain percentage for a given time period. By defining business groups using Perspectives, you can limit these alerts to certain groups. For example, you can be alerted whenever the cost of your production environment increases by 10% in the last 30 days.
Receive alerts when the Tanzu CloudHealth platform finds opportunities for instance or volume rightsizing. Tanzu CloudHealth can monitor for these changes:
Only available for AWS configurations.
In a fast-changing, distributed environment, receive notifications for security risks resulting from inadvertent or noncompliant changes to services. Tanzu CloudHealth provides the following capabilities.
In addition to receiving notifications of changes in your cloud infrastructure, you can build a policy that will execute actions based on certain conditions. You can configure actions to run at a specific time and date. The following are examples of actions you can configure for AWS:
The following are examples of actions you can configure for Azure:
The following are examples of actions you can configure for GCP:
The Tanzu CloudHealth platform can take these actions on its own or after an approver has signed off on the action.
Organizations tend to provision more instances than necessary either to give themselves more headroom or because they are unaware of their performance requirements. This over provisioning can lead to exponentially higher costs. Rightsizing of volumes, instances, and VMs is an optimization technique that helps you reduce costs. But the dynamic nature of cloud infrastructures makes it difficult to perform this optimization continuously. Tanzu CloudHealth Policies help you automate the continuous optimization and reduce these costs.
Tanzu CloudHealth analyzes usage, read throughput, and write throughput on your volumes to determine if they need to be rightsized based on performance thresholds that you specify.
For example, if a volume is attached to an instance and has very few read or write operations, the instance is either inactive or the volume is unnecessary.
CPU utilization, memory utilization, disk utilization, and network in/out utilization of an instance determine whether it meets your performance requirements. Tanzu CloudHealth monitors these metrics over time and reports deviations from thresholds that you specify. It is common for instances to be underutilized, so you can reduce costs by ensuring that all your instances are of the right size.
Memory, CPU, and Disk utilization of a VM determine whether it meets your performance requirements. Tanzu CloudHealth evaluates the performance of individual VMs and compares it against the published performance specifications from Microsoft. This comparison produces a rightsizing score. You can use this score to make intelligent decisions around downgrading or upgrading VMs to reduce cost or improve performance.
In order to enable the rightsizing assessment, enable diagnostics on the VM or install the Tanzu CloudHealth Agent. In addition, configure a Service Principal.
Build a policy that notifies one or more people of groups when specific conditions are met
A common use of policies is to inform one or more individuals about changes to your cloud infrastructure. A Standard Policy helps you set up these notifications.
A policy contains one or more blocks, each containing a specific rule that checks for operational conditions that you specify.
Blocks help you organize and manage rules within a policy, and each block is associated with a resource you want to monitor. A block can contain one or more rules, each of which can be enabled or disabled.
Rules monitor for one or more conditions and, optionally, respond with an action. A default rule is created when you associate the block with a resource.
Specify the severity of the rule.
The severity appears on the console as well as in notification emails that are sent through the rule. If you want to create two separate notifications for conditions (e.g., Critical and Warning), define two separate Blocks.
Rules contains two sections: Conditions, which monitor changes to specific aspects, and Actions, which respond to those changes.
Click Save Condition. You can build multiple conditions, where each condition specifies a threshold for a different aspect of your infrastructure. For example, one condition can monitor whether the average CPU usage is less than 5% and the other, whether maximum CPU usage is less than 20% for at least 1 day.
When multiple conditions define a rule, notifications are dispatched only if both conditions are met.
Define what happens when one or more conditions are met. Click Add Action.
Type | When to use |
---|---|
Notify individuals or groups. | |
Email Owner | Notify the IAM user who launched the resource. If the IAM user is not connected to a Tanzu CloudHealth user, notify the user who owns unassociated IAM users. |
Actions | Initiate remedial actions defined in a previously defined automated task. |
Just as you can specify multiple conditions within a rule, so too can you define multiple rules within a policy block. You can organize rules by dragging them and reordering them.
Example: You can set a policy around EC2 utilization, and choose to have multiple rules based on different conditions.
Increase EC2 Instance Size if your EC2 Instance Average CPU is greater than 70% for at least 1 day
.Heavy EC2 CPU Utilization if your EC2 Average CPU is greater than 60% for at least 1 day
.You can add multiple blocks to a policy and evaluate your infrastructure in a way that aligns with your business. With multiple blocks, you can measure multiple aspects of the same assets or, conversely, the same aspect of different assets.
Example: Consider a policy for your AWS infrastructure that monitors the utilization of your EC2 and RDS instances. You can create a policy named Instance Utilization that contains two blocks. The first block is named EC2 Utilization with its Resource Type set to AWS EC2 Instance. This policy can contain rules and conditions for EC2 Instance Average CPU being greater than 70% for at least 1 day.
You can add a second block to the same policy and name it RDS Utilization and set its Resource Type to AWS RDS Instance. Perhaps you want your RDS instances running at no higher than 70%. You can set the threshold for those instances to 60%.
Using this multiblock structure, you can measure two different assets using separate criteria, that are part of the same policy.
Click Save Policy. The policy appears on the Setup > Governance > Policies page.
On this page, you can select a policy and view policy violations, edit the policy, duplicate the policy, or delete the policy.
Use policies to help optimize your environment by setting specific rules that include starting and stopping an instance. Create a standard policy to automate this action, and optionally send an email when the policy is triggered.
Use policies to help optimize your environment by setting specific rules that include starting and stopping a resource. Create a standard policy to automate this action, and optionally send an email when the policy is triggered.
Tanzu CloudHealth requires certain permissions to stop or start resources in your environment. This section provides information on the permissions needed for each cloud provider.
Configure your AWS IAM role to allow Tanzu CloudHealth to start and stop instances as follows:
Associate a custom role with each Azure Service Principal to allow Tanzu CloudHealth to start and stop virtual machines as follows:
For more details, see Automate Azure VM Management Using Actions.
Include compute.instances.stop
and compute.instances.start
permissions in the Google Console. You can add these permissions in one of the following ways:
Currently, AWS EC2 instances, AWS RDS instances, GCP compute instances, and Azure virtual machines are the only resource types that can be stopped or started using a policy.
To verify the status of the policy in the Tanzu CloudHealth platform, go to Dashboard > Notification > Actions, and click the view icon. The status of the action can be:
If you choose to send an email notification, the email will include IDs of the resources that succeeded and/or failed to start or stop as of 40 minutes after the policy action was initiated. If you choose to only be notified of failures, and no actions failed, no email is sent.
The following example shows that the Start GCP Compute Instance action executed successfully:
Build a policy that notifies particular business groups when specific conditions are met
As you scale your cloud environment, stakeholders across your organization will want to analyze, measure, and report across a wide variety of infrastructure services, assets, and resources. A CFO may want to evaluate by departments or cost centers; the COO, by environments or products and services; and engineering, by functions. Tanzu CloudHealth Perspectives simplify this process by giving you a framework for categorizing assets in your cloud infrastructure based on business groups in your organization.
You can introduce perspectives in your policies by associating blocks with the business groups you defined using Tanzu CloudHealth Perspectives. In this way, you can develop policies that are relevant to specific business groups in your organization. The approach also ensures that each block in a policy is only evaluated once for a given cloud asset.
$500
for at least a month.Build a policy that sends Slack messages to a workspace when specific conditions are met
This feature is in private beta. If you want to enable this feature, contact the support team [email protected].
Slack is a tool that provides a digital workspace to organizations. You can use Slack to communicate across teams, collaborate on projects, and automate workflows using bots. In the context of the Tanzu CloudHealth Platform, you can use Slack to drive and create workflows via Policy Actions.