If you’re looking for a reverse proxy server, Varnish is an excellent choice. It’s fast, and it’s used by Facebook and Twitter, as well as plenty of others. For most sites, it can be used effectively pretty much out of the box with minimal tuning.
Like many decently-sized Rails apps, we leverage a lot of open source code. Dozens of gems and plugins, a variety of cloud services, Varnish and Nginx for caching and load balancing, and various persistence solutions. The point is, as our app usage has grown over the last year, we’ve had our share of stressful, on-the-fly debugging while our app was down. That’s not the best time to learn about all the fun nuances and interactions of your technology stack.
It’s a good idea to know what your services are doing and the key metrics to watch, so you’re better prepared when you hit those inevitable scaling pain points. New Relic has been tremendously useful for monitoring and debugging our database and Rails app. The rest of this post goes over some key metrics for Varnish and setting up Munin to monitor them.
Optimizing and Inspecting Varnish
Unless your application has an extremely high volume of traffic, you likely won’t have to optimize Varnish itself (e.g., cache sizes, thread pool settings, etc). Most of the work will be in verifying that your resources have appropriate HTTP caching parameters (Expires/max-age and ETag/Last-Modified). You’re most of the way there if you do the following:
- Run Varnish on a 64-bit machine. It’ll run on a 32-bit machine, but it likes the virtual address space of a 64-bit machine. Also, Varnish’s test suites are only run on 64-bit distributions.
- Normalize the hostname. e.g., www.website.com => website.com, to avoid caching the same resource multiple times. Details here.
- Unset cookies for any resource that should be cacheable. Details here.
Varnish includes a variety of command line tools to inspect what Varnish is doing. SSH into the server running Varnish, and let’s take a look.
Inspecting an individual resource
First, let’s look at how Varnish handles an individual resource. On a client machine, point a web browser to an resource cached by Varnish. On the server, type:
$ varnishlog -c -o ReqStart <IP address of client machine>
The output of this command will be communication between the client machine and Varnish. In another SSH terminal, type:
$ varnishlog -b -o TxHeader <IP address of client machine>
The output of this command will be communication between Varnish and a backend server (i.e., an origin server, the actual application). Try reloading the resource in the browser. If it is cached correctly, you shouldn’t see any communication between Varnish and any backend servers. If you do see something printed there, inspect the HTTP caching headers and verify they are correct.
Now that we’ve seen that Varnish is working for an individual resource, let’s see how it’s doing overall. In your SSH session, type:
The most important metrics to note here are the hitrate and the uptime. Varnish has a parent process whose only function is to monitor and restart a child process. If Varnish is restarting itself frequently, that’s something to be investigated by looking at its output in /var/log/syslog.
Other than that, check out Varnishstat For Dummies for a good overview.
It’s great that we can check on Varnish fairly easily, but the key is to automate this process; otherwise, it can be very difficult to detect warning patterns early. Also, it’s not realistic to have a huge, manual, pre-flight checklist to check on the health of all your services. Enter Munin…
Get Started with Munin in 15 minutes
Munin is a monitoring tool with a plug-in framework. Munin nodes periodically report back to a Munin server. The Munin server collects the data and generates an HTML page with graphs. The default install of Munin contains a plug-in for reporting Varnish statistics. The Varnish plug-in includes a variety of graphs, including the one below.
If you’re installing Munin on an Ubuntu machine (or any distribution that uses apt), use the commands below. For other platforms, see the installation instructions here.
For every server you want to monitor, type:
$ sudo apt-get install munin-node
Designate a server to collect the data. The server can also be a Munin node. On the server, type:
$ sudo apt-get install munin
For each node, open the configuration file at /etc/munin/munin-node.conf. Add the IP address of the Munin server.
After you modify the configuration file, restart the Munin node by typing:
$ sudo service munin-node restart
For the server, open the configuration file at /etc/munin/munin.conf. Add each node that you want to monitor.
[Domain;serverA] address xxx.xxx.xxx.xxx use_node_name yes
Choose any value you like for Domain and serverA above; the names are purely for organization. When the Munin server was installed, it also installed a cron job that runs every 5 minutes and collects data from each node. After editing the configuration file, wait 5 minutes for the charts to be generated. If you’re impatient, type:
$ sudo -u munin /usr/bin/munin-cron
View Munin Graphs
If you have lighttpd or Apache, point it at /var/cache/munin/www. If the charts have been generated properly, there should be an index.html file in that directory.
If the Munin charts aren’t being generated, make sure that the directories listed in /etc/munin/munin.conf exist and have appropriate permissions for the user, munin.
Try manually executing munin-cron and see if there is any error output.
Look at /var/log/syslog for any Munin-related errors.
That’s it! Varnish is optimized and working correctly, and Munin is reporting the important stats so you can sleep easy at night. Enjoy!
Web caching references
Caching Tutorial – Excellent overview of web caching by Mark Nottingham.
Things Caches Do – Overview of reverse proxy caches like Varnish and Rack-Cache.
HTTP 1.1 Caching Specification – Official HTTP 1.1 Caching Specification.