Archive for the 'Performance' Category

Nagios XI 2012r1.4 Improvements

With each new version of Nagios XI, we do our best to include the most important bug fixes, improvements, and features that we can accomplish in a few weeks time. The upcoming Nagios XI 2012r1.4 is going to be a notable release of XI for both performance improvements and internationalization.

Internationalization

For our international users, we’ve been hard at work to update XI appropriately for internationalization, as well as kick-starting multiple translations using Google translate. We’ve been working to balance code updates with community contributions for languages, and this upcoming release will ship with a default.pot file that can be used to update user’s PO files that they may have begun populating. This release of XI will ship with kick-started translations in the following languages.

  • German
  • Spanish
  • French
  • Italian
  • Portuguese
  • Russian
  • Korean
  • Chinese

Performance Improvements in 1.4

For customers with larger installs, we’ve been analyzing bottlenecks in both the monitoring process and the UI to try and make XI run faster and leaner. Users with hosts+services in the thousands will almost certainly see an improvement both in CPU load and page load times in the UI. For changes that affect the monitoring process, we updated the Monitoring Engine Event Queue dashlet and the Monitoring Engine Check Statistics Dashlets to all pull data from the same status information that the rest of XI uses, which reduces an enormous amount of data from needing to be logged to mysql from the monitoring process. The end result of this change is that mysql will only need to be doing about 30% of the work that it was having to do in previous releases. For large installs, this is a big deal!

The other key change that all users will probably see a benefit from is a refactoring of data queries for AJAX loaded content in the XI interface. Load times for dashlets that contain tactical or summary data went from 15-20 seconds per dashlet down to .05 seconds in local tests with 10k checks. The other upside of this change is that the CPU usage from XI users accessing the interface is substantially reduced. The Tactical Overview dashlets see the largest benefit in load times by far. For users who had to utilize the unified Tactical Overview for performance reasons, we encourage you to try the dashlet version in 1.4.

We hope to have 1.4 ready to release sometime this week, we appreciate our community of users and the feedback that we continue to get for our product. Thanks for helping us make XI better!

 

 

Building a Nagios 4 / Nagios XI Prototype Box

So after an awesome set of presentations at the Nagios World Conference 2012, one of the hot topics for discussion was clearly the upcoming Nagios Core 4 release. Andreas Ericsson has been hard at work overhauling the Core engine to optimize performance and reduce Disk and CPU usage for Nagios, and initial tests are showing his work has paid off in a substantial way. For this experiment, we’re going to use a system with the following specs:

  • Virtual Machine running under Vmware Workstation 8
  • 2GB of RAM
  • 1 CPU, 4 Cores
  • 80GB Hard Disk
  • Nagios XI Installed
  • Nagios binaries replace with Nagios 4 monitoring engine
  • ndoutils binaries replaced with with the latest SVN code for ndoutils: nagios/ndoutils/branches/ndoutils-2-0
  • No initial performance tweaks other than Nagios 4 and ndoutils 2

I’ll post setup instructions below for users who also want to play around with this setup. Note: This setup is not intended for production installs, use this in test environments only!

Start with Nagios XI installed, either through the pre-installed VM or with a manual installation. I chose a manual installation for this demo so I could set up the hardware to my liking and give it sufficient hard drive space to test a LOT of hosts. My first attempt at the prototype only had 10GB on the box, and filled up quite quickly because of performance data. .I ran the following commands after initial Nagios XI installation and setup was completed.

From the command-line:

You can verify the upgrade succeeded by reviewing the /usr/local/nagios/var/nagios.log file. There should be some new warnings about obsolete definitions like “failure_prediction_enabled”, which we won’t worry about for now. For now I’d like to see what kind of performance impact I can expect for a large number of checks being run on this machine, so I need to quickly create a large number of checks.  I’ll achieve this by running a tools script that we include with every installation of Nagios XI.

I chose to use static configs instead of the CCM for this benchmark for ease of setup time, and also easy removal later on. This also creates a list of checks with 25% of the services showing up as critica, which is useful in testing a system stressed with alerts and notifications. However, I’m also going to turn off notifications and event handlers during this setup phase just to make sure I don’t bottleneck somewhere and tank the entire box. Now lets restart Nagios to start using the new configs.

After adding 1000 hosts and 4000 services all at a 5mn interval the CPU load is running at a nominal level, averaging anywhere from .30 – .70, which is pretty impressive for a 4 core system! There is still some Disk IO because performance data processing is happening for each service, and this will likely be one of the noticeable bottlenecks as we add more checks to this system. After the system levels out and all of the checks are settled into a hard state, I turn on notifications and event handlers and begin watching the system and testing for bottlenecks. I’ll post back with some results soon! If there are any XI users out there who want to give this a shot in their test environments and post back with their results we’d love to hear what you find!

 

 

 

Nagios V-Shell 1.9 Released

Nagios V-Shell 1.9 includes major performance updates, and a re-implementation of PHP caching that should decrease V-Shell page load times anywhere from 40-75%.  I ran some benchmarking tests on a test system(Dual core desktop with 4GB of RAM) with 1800 hosts, and 7200 services.  This system runs with an average CPU load of 2.0-6.0 throughout the day, so the hardware is being pushed pretty hard already from the check load. V-Shell 1.8 created page load times anywhere from 18-28 seconds throughout the interface without APC caching enabled.  Needless to say, this is problematic for many users with larger environments.  The Core cgi’s were able to load anywhere from 2-11 seconds, with the service status page taking around 9-11 seconds to load all of the data.  My goal for 1.9 was to minimize any unnecessary processing, and optimize any functions that were inefficient or using slower PHP built-in functions.  The differences in 1.9 are substantial.  Without any caching enabled at all, I was able to decrease the average page load time to 9-14 seconds, which is 40-50% faster by itself.  Once I had the code optimized, I reworked the APC caching functionality.  If a user has PHP’s APC caching packages installed and enabled on their web server, V-Shell will cached the objects.cache file until it detects any changes in the file, while the data in the status.dat file will be cached based on a TTL (time to live) config option which now exists in 1.9.  Once the data is cached in APC, the page load times throughout the interface averaged between 4-5 seconds for all pages, which is a 75% decrease in load time on average.

My goal for the next version of V-Shell is to add support for mklivestatus and ndoutils for backend data, which will eliminate the need to parse the objects.cache file and status.dat files for systems with those backends.  This should further improve performance for larger installations.

Download Nagios V-Shell 1.9

CHANGELOG

 

Nagios Performance Tuning – Tech Tips: Understanding Disk I\O

We often get questions about the kind of hardware requirements needed for a particular Nagios installation.  As covered in a previous article, this is often a very difficult question to answer since monitoring environments differ so much.  Most people assume that for a large Nagios installation, it’s a matter of simply adding enough CPU’s to the machine to handle the workload that it’s given.  Although having enough CPU power is important, I’ve found that it’s ultimately not the biggest hardware limitation to the system.  A large Nagios installation creates an enormous amount of disk activity, and if the hard disk can’t keep up with the constant traffic flow that needs to happen, all of those precious CPU’s are simply going to wait in line to be able to do what they need to do on the system.  I’ve talked to some users who have spent some serious money on hardware to have insanely fast disks to handle their workload, but I wanted to do some experiments in-house for those users who may need to have better performance on a budget.  I want to give special thanks to Nagios community members Dan Wittenberg and Max Schubert for documenting some of the tricks that you guys pioneered on this topic.

Continue reading ‘Nagios Performance Tuning – Tech Tips: Understanding Disk I\O’