These CIM providers report on the operating environment for DaaS management nodes. They should be monitored on all DaaS nodes.

Linux_OperatingSystem

  • Description

    There will only be a single instance of this class per appliance.

  • Properties

    • FreePhysicalMemory: If this reaches 0 that is a critical fault and needs to be resolved immediately (see the calculation below).

    • FreeVirtualMemory: If this reaches 0 0 that is a critical fault and needs to be resolved immediately (see the calculation below).

    • HealthState: Anything but a value of 5 indicates a problem.

    • OperationalStatus: Anything but a value of 2 (OK) indicates a problem. However, an occasional value of 4 (stressed) may appear. If repeated samplings indicate a value other than 2, you should raise an alert.

    • TotalVirtualMemorySize: The total amount of swap space available to the system.

  • Calculations

    • PercentSwapUsed: 100 * ( TotalSwapSpaceSize – FreeSpaceInPagingFiles ) / TotalSwapSpaceSize

    • It is useful to monitor for swap space usage. Once the system begins using swap space, performance will degrade. The free memory alert should be triggered prior to the system using swap space so the use of swap should be considered a serious problem.

  • Mitigation

    Recommendation is to warn if PercentSwapUsed > 5% and alert if PercentSwapUsed > 20%.

    If the memory used reaches high levels, you should check to see if there are any memory-intensive processes that need to be restarted using top and shift-M on the node in question:

    $ top
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    6816 root      20   0 2069m 389m  13m S  0.0 19.6   3:36.97 java
    6634 root      20   0  755m  84m 9.8m S  0.0  4.2   1:21.70 java
    ...

    If no single application appears to be the problem, restart the node.

Linux_EthernetPort

  • Description

    There will typically be two instances of this class, one for the eth0 interface (tenant or service-provider network) and one for the eth1 (management backbone) interface.

  • Properties

    • EnabledState: Anything but the value 2 is a problem.

    • Status: Anything but OK is a problem.

  • Mitigation

    If the eth0 status is not OK, then use ifconfig to check that the interfaces are up and have an IP address. You should also be able to ping the IPv4 gateway for each node.

    If the eth1 status is not OK, then try to connect to that appliance via ssh from the transit server. If this works, then the eth1 interface is OK.

Linux_ComputerSystem

  • Description

    There will only be a single instance of this class per appliance.

  • Properties

    • EnabledState: Anything but a value of 2 indicates an issue.

  • Mitigation

    If EnabledState is anything but 2, attempt to ping the node, ssh to the node, and check the status of the dtService (service dtService status) on the node.

CIM_FileSystem

  • Description

    There are several subclasses of this. (You can also check the CIM_LocalFileSystem class if you don't want to view remote file systems.) The most important to focus on are all the Linux_Ext4FleSystem instances. In addition to the root file system, there may be others that are important to check that they are not in ReadOnly mode. Currently you should check these file systems:

    • /(root)

    • /boot

    • /data

    • /tmp

    • /usr/local

    • /var

    Additionally on the resource manager nodes and the DB nodes there will be some number of Linux_NFS instances. These are remotely mounted file systems. You can choose to monitor these mounts via our appliances or an alternate mechanism based on the storage system.

  • Properties

    • EnabledState: Any value other than 2 (enabled) on a remotely mounted NFS file system is cause for alarm. However, local file systems in management nodes may show up with an EnabledState of 3.

    • ReadOnly: This value should be FALSE. A value of TRUE is cause for alarm. If the CIM_FileSystem class does not respond for a particular file system, the file system may be read-only and you should restart the node. Contact DaaS support if the restart fails.

    • Status: Any value other than OK is cause for alarm. Go to the node and use mount to check that the file system is mounted. If the file system is mounted, try to create a file.

    • PercentageSpaceUsed: Displays percent of available disk space that is used. Recommendation is to warn at 70% and then increase the alert priority in 10% increments (that is, 70, 80, 90).

  • Mitigation

    If any of the file systems report high usage, please contact DaaS support for corrective action.