Archive for the 'Performance' Category

Nagios XI 2012r1.4 Improvements

With each new version of Nagios XI, we do our best to include the most important bug fixes, improvements, and features that we can accomplish in a few weeks time. The upcoming Nagios XI 2012r1.4 is going to be a notable release of XI for both performance improvements and internationalization.

Internationalization

For our international users, we’ve been hard at work to update XI appropriately for internationalization, as well as kick-starting multiple translations using Google translate. We’ve been working to balance code updates with community contributions for languages, and this upcoming release will ship with a default.pot file that can be used to update user’s PO files that they may have begun populating. This release of XI will ship with kick-started translations in the following languages.

  • German
  • Spanish
  • French
  • Italian
  • Portuguese
  • Russian
  • Korean
  • Chinese

Performance Improvements in 1.4

For customers with larger installs, we’ve been analyzing bottlenecks in both the monitoring process and the UI to try and make XI run faster and leaner. Users with hosts+services in the thousands will almost certainly see an improvement both in CPU load and page load times in the UI. For changes that affect the monitoring process, we updated the Monitoring Engine Event Queue dashlet and the Monitoring Engine Check Statistics Dashlets to all pull data from the same status information that the rest of XI uses, which reduces an enormous amount of data from needing to be logged to mysql from the monitoring process. The end result of this change is that mysql will only need to be doing about 30% of the work that it was having to do in previous releases. For large installs, this is a big deal!

The other key change that all users will probably see a benefit from is a refactoring of data queries for AJAX loaded content in the XI interface. Load times for dashlets that contain tactical or summary data went from 15-20 seconds per dashlet down to .05 seconds in local tests with 10k checks. The other upside of this change is that the CPU usage from XI users accessing the interface is substantially reduced. The Tactical Overview dashlets see the largest benefit in load times by far. For users who had to utilize the unified Tactical Overview for performance reasons, we encourage you to try the dashlet version in 1.4.

We hope to have 1.4 ready to release sometime this week, we appreciate our community of users and the feedback that we continue to get for our product. Thanks for helping us make XI better!

 

 

Building a Nagios 4 / Nagios XI Prototype Box

So after an awesome set of presentations at the Nagios World Conference 2012, one of the hot topics for discussion was clearly the upcoming Nagios Core 4 release. Andreas Ericsson has been hard at work overhauling the Core engine to optimize performance and reduce Disk and CPU usage for Nagios, and initial tests are showing his work has paid off in a substantial way. For this experiment, we’re going to use a system with the following specs:

  • Virtual Machine running under Vmware Workstation 8
  • 2GB of RAM
  • 1 CPU, 4 Cores
  • 80GB Hard Disk
  • Nagios XI Installed
  • Nagios binaries replace with Nagios 4 monitoring engine
  • ndoutils binaries replaced with with the latest SVN code for ndoutils: nagios/ndoutils/branches/ndoutils-2-0
  • No initial performance tweaks other than Nagios 4 and ndoutils 2

I’ll post setup instructions below for users who also want to play around with this setup. Note: This setup is not intended for production installs, use this in test environments only!

Start with Nagios XI installed, either through the pre-installed VM or with a manual installation. I chose a manual installation for this demo so I could set up the hardware to my liking and give it sufficient hard drive space to test a LOT of hosts. My first attempt at the prototype only had 10GB on the box, and filled up quite quickly because of performance data. .I ran the following commands after initial Nagios XI installation and setup was completed.

From the command-line:

cd /tmp
yum install -y subversion
svn co https://nagios.svn.sourceforge.net/svnroot/nagios/ndoutils/branches/ndoutils-2-0
svn export ndoutils-2-0/ ndoutils
svn co https://nagios.svn.sourceforge.net/svnroot/nagios/nagioscore/trunk/ coretrunk
svn export coretrunk/ nagioscore
service nagios stop
service ndo2db stop
cd nagioscore
./configure --with-command-group=nagcmd
make all
make install
cd /tmp/ndoutils
./configure; make; make install
cd db
./upgradedb -u root -p nagiosxi -h localhost -d nagios
service ndo2db start
service nagios start

You can verify the upgrade succeeded by reviewing the /usr/local/nagios/var/nagios.log file. There should be some new warnings about obsolete definitions like “failure_prediction_enabled”, which we won’t worry about for now. For now I’d like to see what kind of performance impact I can expect for a large number of checks being run on this machine, so I need to quickly create a large number of checks.  I’ll achieve this by running a tools script that we include with every installation of Nagios XI.

cd /usr/local/nagiosxi/tools
./create_checks.php --hosts=1000 --prefix=_MASS1_ > /usr/local/nagios/etc/static/_MASS1.cfg

I chose to use static configs instead of the CCM for this benchmark for ease of setup time, and also easy removal later on. This also creates a list of checks with 25% of the services showing up as critica, which is useful in testing a system stressed with alerts and notifications. However, I’m also going to turn off notifications and event handlers during this setup phase just to make sure I don’t bottleneck somewhere and tank the entire box. Now lets restart Nagios to start using the new configs.

service nagios restart

After adding 1000 hosts and 4000 services all at a 5mn interval the CPU load is running at a nominal level, averaging anywhere from .30 – .70, which is pretty impressive for a 4 core system! There is still some Disk IO because performance data processing is happening for each service, and this will likely be one of the noticeable bottlenecks as we add more checks to this system. After the system levels out and all of the checks are settled into a hard state, I turn on notifications and event handlers and begin watching the system and testing for bottlenecks. I’ll post back with some results soon! If there are any XI users out there who want to give this a shot in their test environments and post back with their results we’d love to hear what you find!

 

 

 

Nagios V-Shell 1.9 Released

vshell2

Nagios V-Shell 1.9 includes major performance updates, and a re-implementation of PHP caching that should decrease V-Shell page load times anywhere from 40-75%.  I ran some benchmarking tests on a test system(Dual core desktop with 4GB of RAM) with 1800 hosts, and 7200 services.  This system runs with an average CPU load of 2.0-6.0 throughout the day, so the hardware is being pushed pretty hard already from the check load. V-Shell 1.8 created page load times anywhere from 18-28 seconds throughout the interface without APC caching enabled.  Needless to say, this is problematic for many users with larger environments.  The Core cgi’s were able to load anywhere from 2-11 seconds, with the service status page taking around 9-11 seconds to load all of the data.  My goal for 1.9 was to minimize any unnecessary processing, and optimize any functions that were inefficient or using slower PHP built-in functions.  The differences in 1.9 are substantial.  Without any caching enabled at all, I was able to decrease the average page load time to 9-14 seconds, which is 40-50% faster by itself.  Once I had the code optimized, I reworked the APC caching functionality.  If a user has PHP’s APC caching packages installed and enabled on their web server, V-Shell will cached the objects.cache file until it detects any changes in the file, while the data in the status.dat file will be cached based on a TTL (time to live) config option which now exists in 1.9.  Once the data is cached in APC, the page load times throughout the interface averaged between 4-5 seconds for all pages, which is a 75% decrease in load time on average.

My goal for the next version of V-Shell is to add support for mklivestatus and ndoutils for backend data, which will eliminate the need to parse the objects.cache file and status.dat files for systems with those backends.  This should further improve performance for larger installations.

Download Nagios V-Shell 1.9

CHANGELOG

 

Nagios Performance Tuning – Tech Tips: Understanding Disk I\O

We often get questions about the kind of hardware requirements needed for a particular Nagios installation.  As covered in a previous article, this is often a very difficult question to answer since monitoring environments differ so much.  Most people assume that for a large Nagios installation, it’s a matter of simply adding enough CPU’s to the machine to handle the workload that it’s given.  Although having enough CPU power is important, I’ve found that it’s ultimately not the biggest hardware limitation to the system.  A large Nagios installation creates an enormous amount of disk activity, and if the hard disk can’t keep up with the constant traffic flow that needs to happen, all of those precious CPU’s are simply going to wait in line to be able to do what they need to do on the system.  I’ve talked to some users who have spent some serious money on hardware to have insanely fast disks to handle their workload, but I wanted to do some experiments in-house for those users who may need to have better performance on a budget.  I want to give special thanks to Nagios community members Dan Wittenberg and Max Schubert for documenting some of the tricks that you guys pioneered on this topic.

Continue reading ‘Nagios Performance Tuning – Tech Tips: Understanding Disk I\O’

Nagios XI Graph Explorer Component Released

My brother (a fellow programmer) once told me, “the solution is easy once you know what it is.”  That’s been the case for the finishing touches needed to finally release a component that I’ve been excited about for a long time: The Nagios XI Graph Explorer.  This component utilizes a javascript visualization library and allows users to easily zoom graphs, select custom time frames, and even stack time periods on top of each other to compare performance from one time period to the next.  If you like data visualization, you’ll love this tool.  Currently this download is for current Nagios XI customers only and can be downloaded from the Nagios XI Customer Downloads page, and I recommend using this with Firefox for maximum reliability.  Special thanks to Nicholas Scott for accidentally pointing out the solution to the problem that’s been in front of my face the whole time ; )

 

Helping MySQL Move Out And Find Its Own Server

Anybody keeping tabs on the performance of their NagiosXI server knows that mysqld, httpd and nagios all play an intense game of king-of-the-CPU.   The cool thing about NagiosXI is that it comes with NDOUtils out of the box, which is a great tool for offloading the MySQL server, which is great if you need to stack on more checks.  If you run a NagiosXI server that is completely loaded down and have another server that could host a MySQL server for that NagiosXI server, this  PDF would definitely be worth a read. The PDF attached is a step-by-step guide to migrate your existing MySQL server to a remote MySQL server and is definitely an interesting look at just how exstensible NagiosXI is.

Offloading MySQL to a Remote Server

Nagios Visualization Toolkit (Under Construction)

In the past months we’ve had several requests for better control and time specifications for Nagios performance graphs, and me being a big fan of fancy visualizations, I’ve been staring at the old PNP graphs for a while and wondering if there’s a way we can create graphs that look like they’re actually from this decade.  After reviewing several different visualization libraries, we decided to take a stab at developing some new tools with some graphing libraries from HighCharts.  Although some of the fine details are still being polished, our first prototype has us pretty excited about where this project is headed.

Graph

JQuery Performance Graphs in XI

Our first prototype is a zoomable performance graph, that allows you to specify start/stop times, and then dynamically zoom the graph all the way down to a 5mn interval for closer examination.  Although these graphs are client-side, they can all be exported into either png, pdf, jpg, or SVG images to use in external reporting or presentations.  Let us know what you think!

Distributed Monitoring Solutions For Nagios

Distributed MonitoringInterested in scaling your Nagios deployment to monitor a large environment?  Distributed monitoring may be the solution you’re looking for.  We just created a document that describes different methods for configuring a distributed monitoring solution with Nagios Core and Nagios XI.

Distributed_Monitoring_Solutions.pdf

Analyze Nagios Performance With The Nagiostats Wizard

We came across an issue about a month ago where a user was losing data with a distributed/passive checks setup.  Upon a closer investigation we uncovered that all of the passive checks were being executed every 5 minutes from servers that were all synced to the same time server.  The result?  Hundreds of checks were all coming in with a few seconds, putting a heavy load on Nagios, while the other 4 minutes and 50 seconds were going virtually unused by the server.  After some discussion on this we decided to make use of a built-in tool for Nagios -  nagiostats – and create a wizard that could monitor Nagios itself to see how the checks were coming in and being processed.   Although multiple checks have been written in the past, we’ve created a new wizard that allows you to quickly create several checks against the nagiostats binary to monitor the monitoring environment itself.  We’ve just released a 1.0 version of this wizard and we’re curious to know what users think of it.  Feel free to give it a try and send us your feedback!

Nagiostats Wizard on Exchange

Wizard Preview

Graphs

Graphs from the Nagiostats Wizard