With each new version of Nagios XI, we do our best to include the most important bug fixes, improvements, and features that we can accomplish in a few weeks time. The upcoming Nagios XI 2012r1.4 is going to be a notable release of XI for both performance improvements and internationalization.
For our international users, we’ve been hard at work to update XI appropriately for internationalization, as well as kick-starting multiple translations using Google translate. We’ve been working to balance code updates with community contributions for languages, and this upcoming release will ship with a default.pot file that can be used to update user’s PO files that they may have begun populating. This release of XI will ship with kick-started translations in the following languages.
Performance Improvements in 1.4
For customers with larger installs, we’ve been analyzing bottlenecks in both the monitoring process and the UI to try and make XI run faster and leaner. Users with hosts+services in the thousands will almost certainly see an improvement both in CPU load and page load times in the UI. For changes that affect the monitoring process, we updated the Monitoring Engine Event Queue dashlet and the Monitoring Engine Check Statistics Dashlets to all pull data from the same status information that the rest of XI uses, which reduces an enormous amount of data from needing to be logged to mysql from the monitoring process. The end result of this change is that mysql will only need to be doing about 30% of the work that it was having to do in previous releases. For large installs, this is a big deal!
The other key change that all users will probably see a benefit from is a refactoring of data queries for AJAX loaded content in the XI interface. Load times for dashlets that contain tactical or summary data went from 15-20 seconds per dashlet down to .05 seconds in local tests with 10k checks. The other upside of this change is that the CPU usage from XI users accessing the interface is substantially reduced. The Tactical Overview dashlets see the largest benefit in load times by far. For users who had to utilize the unified Tactical Overview for performance reasons, we encourage you to try the dashlet version in 1.4.
We hope to have 1.4 ready to release sometime this week, we appreciate our community of users and the feedback that we continue to get for our product. Thanks for helping us make XI better!
Nagios V-Shell 1.9 includes major performance updates, and a re-implementation of PHP caching that should decrease V-Shell page load times anywhere from 40-75%. I ran some benchmarking tests on a test system(Dual core desktop with 4GB of RAM) with 1800 hosts, and 7200 services. This system runs with an average CPU load of 2.0-6.0 throughout the day, so the hardware is being pushed pretty hard already from the check load. V-Shell 1.8 created page load times anywhere from 18-28 seconds throughout the interface without APC caching enabled. Needless to say, this is problematic for many users with larger environments. The Core cgi’s were able to load anywhere from 2-11 seconds, with the service status page taking around 9-11 seconds to load all of the data. My goal for 1.9 was to minimize any unnecessary processing, and optimize any functions that were inefficient or using slower PHP built-in functions. The differences in 1.9 are substantial. Without any caching enabled at all, I was able to decrease the average page load time to 9-14 seconds, which is 40-50% faster by itself. Once I had the code optimized, I reworked the APC caching functionality. If a user has PHP’s APC caching packages installed and enabled on their web server, V-Shell will cached the objects.cache file until it detects any changes in the file, while the data in the status.dat file will be cached based on a TTL (time to live) config option which now exists in 1.9. Once the data is cached in APC, the page load times throughout the interface averaged between 4-5 seconds for all pages, which is a 75% decrease in load time on average.
My goal for the next version of V-Shell is to add support for mklivestatus and ndoutils for backend data, which will eliminate the need to parse the objects.cache file and status.dat files for systems with those backends. This should further improve performance for larger installations.
Download Nagios V-Shell 1.9
In the past months we’ve had several requests for better control and time specifications for Nagios performance graphs, and me being a big fan of fancy visualizations, I’ve been staring at the old PNP graphs for a while and wondering if there’s a way we can create graphs that look like they’re actually from this decade. After reviewing several different visualization libraries, we decided to take a stab at developing some new tools with some graphing libraries from HighCharts. Although some of the fine details are still being polished, our first prototype has us pretty excited about where this project is headed.
JQuery Performance Graphs in XI
Our first prototype is a zoomable performance graph, that allows you to specify start/stop times, and then dynamically zoom the graph all the way down to a 5mn interval for closer examination. Although these graphs are client-side, they can all be exported into either png, pdf, jpg, or SVG images to use in external reporting or presentations. Let us know what you think!
We came across an issue about a month ago where a user was losing data with a distributed/passive checks setup. Upon a closer investigation we uncovered that all of the passive checks were being executed every 5 minutes from servers that were all synced to the same time server. The result? Hundreds of checks were all coming in with a few seconds, putting a heavy load on Nagios, while the other 4 minutes and 50 seconds were going virtually unused by the server. After some discussion on this we decided to make use of a built-in tool for Nagios – nagiostats – and create a wizard that could monitor Nagios itself to see how the checks were coming in and being processed. Although multiple checks have been written in the past, we’ve created a new wizard that allows you to quickly create several checks against the nagiostats binary to monitor the monitoring environment itself. We’ve just released a 1.0 version of this wizard and we’re curious to know what users think of it. Feel free to give it a try and send us your feedback!
Nagiostats Wizard on Exchange
Graphs from the Nagiostats Wizard