These CIM providers report on the operating environment for management nodes. They must be monitored on all nodes.
Linux_OperatingSystem
- Description
There is only a single instance of this class per appliance.
- Properties
- FreePhysicalMemory: If this value reaches 0 that is a critical fault and must be resolved immediately (see the calculation below).
- FreeVirtualMemory: If this value reaches 0 0 that is a critical fault and must be resolved immediately (see the calculation below).
- HealthState: Anything but a value of 5 indicates a problem.
- OperationalStatus: Anything but a value of 2 (OK) indicates a problem. However, an occasional value of 4 (stressed) may appear. If repeated samplings indicate a value other than 2, you should raise an alert.
- TotalVirtualMemorySize: The total amount of swap space available to the system.
- Calculations
- PercentSwapUsed: 100 * ( TotalSwapSpaceSize – FreeSpaceInPagingFiles ) / TotalSwapSpaceSize
- It is useful to monitor for swap space usage. When the system begins using swap space, performance degrades. The free memory alert should be triggered before the system uses swap space so the use of swap is a serious problem.
- Mitigation
Recommendation is to warn if PercentSwapUsed > 5% and alert if PercentSwapUsed > 20%.
If the memory used reaches high levels, you should check to see if there are any memory-intensive processes that must be restarted using top and shift-M on the node in question:$ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6816 root 20 0 2069m 389m 13m S 0.0 19.6 3:36.97 java 6634 root 20 0 755m 84m 9.8m S 0.0 4.2 1:21.70 java ...
If no single application appears to be the problem, restart the node.
Linux_EthernetPort
- Description
There typically are two instances of this class, one for the eth0 interface (tenant or service-provider network) and one for the eth1 (management backbone) interface.
- Properties
- EnabledState: Anything but the value 2 is a problem.
- Status: Anything but OK is a problem.
- Mitigation
If the eth0 status is not OK, then use ifconfig to check that the interfaces are up and have an IP address. You should also be able to ping the IPv4 gateway for each node.
If the eth1 status is not OK, then try to connect to that appliance using ssh from the transit server. If this works, then the eth1 interface is OK.
Linux_ComputerSystem
- Description
There is only a single instance of this class per appliance.
- Properties
- EnabledState: Anything but a value of 2 indicates an issue.
- Mitigation
If EnabledState is anything but 2, attempt to ping the node, ssh to the node, and check the status of the dtService (service dtService status) on the node.
CIM_FileSystem
- Description
There are several subclasses of this. (You can also check the CIM_LocalFileSystem class if you do not want to view remote file systems.) The most important to focus on are all the Linux_Ext4FleSystem instances. In addition to the root file system, there might be others that are important to check that they are not in ReadOnly mode. Currently you should check these file systems:
- /(root)
- /boot
- /data
- /tmp
- /usr/local
- /var
On the resource manager nodes and the DB nodes, there are some number of Linux_NFS instances. These are remotely mounted file systems. You can choose to monitor these mounts using our appliances or an alternate mechanism based on the storage system.
- Properties
- EnabledState: Any value other than 2 (enabled) on a remotely mounted NFS file system is cause for alarm. However, local file systems in management nodes might show up with an EnabledState of 3.
- ReadOnly: This value should be FALSE. A value of TRUE is cause for alarm. If the CIM_FileSystem class does not respond for a particular file system, the file system can be read-only and you should restart the node. Contact VMware support if the restart fails.
- Status: Any value other than OK is cause for alarm. Go to the node and use mount to check that the file system is mounted. If the file system is mounted, try to create a file.
- PercentageSpaceUsed: Displays percent of available disk space that is used. Recommendation is to warn at 70% and then increase the alert priority in 10% increments (that is, 70, 80, 90).
- Mitigation
If any of the file systems report high usage, contact VMware support for corrective action.