• Follow us

Monitoring Unicorn with Monit

Matteo latini

Matteo Latini

on 14 Aug 2012 in Development, DevOps

14 minutes Read
Monitoring Unicorn with Monit

It’s been a while since we switched all our infrastructure from Apache2 + Passenger to Nginx + Unicorn and we’ve really been satisfied with such a choice.

Unicorn gives you tremendous geek power and responsibility. It may seem initially hard (it’s so distant from that “php feeling” you get when working with Passenger) but once you unleash some Unicorn power, you can never go back.

One of the most amazing things you can do with Unicorn is monitor its behavior in depth with ease and peace of mind (read UNIX Signals). To monitor a Unicorn process you can use a multitude of tools: bluepill, god and monit. The first two being Ruby applications while monit being a C application.

Since we read so many horror stories about Ruby memory leaking all over that great unicorn fur, we chose monit as our main monitoring solution (not that its syntax is so much more complex than Ruby’s).

Monitoring Unicorn

Unicorn’s structure is simple and effective; it has a “master process” which administers a number of “worker processes” that actually serve HTTP requests. These worker processes are actually forked from the master process, this means (in a monitoring perspective) that we can not only monitor each worker individually but also the master process as a sum of resources used up by its workers. There is a much more in-depth article (it’s great and you should read it): I like Unicorn because it’s Unix.

We’ll see the approach we use in each of our applications:

  1. monitor the application as a whole via the unicorn master and alert if there is any status change;
  2. monitor each worker process to gracefully kill any worker which is using up too much resources.

All the examples will refer to the nebulab application.

Master

Monitoring the unicorn master process is simple, we should already have everything in place by default. This is the required monit configuration (/etc/monit/conf.d/nebulab.monitrc):

check process nebulab with pidfile /var/run/unicorn/nebulab/pid
  start program "/etc/init.d/nebulab start"
  stop program "/etc/init.d/nebulab stop"
  if 5 restarts within 5 cycles then timeout

Monit will then make sure that this process is always running and will email us if anything happens to the process (i.e. it gets restarted without monit’s consensus). If you want to keep track of how much memory/cpu unicorn is using up as a whole, you can add some more triggers like so:

check process nebulab with pidfile /var/run/unicorn/nebulab/pid
  start program "/etc/init.d/nebulab start"
  stop program "/etc/init.d/nebulab stop"
  if 5 restarts within 5 cycles then timeout
  if totalcpu is greater than 50% for 2 cycles then alert
  if totalcpu is greater than 90% for 3 cycles then restart
  if totalmem is greater than 60% for 1 cycles then restart

but we usually don’t use those since we have many more monit configurations that monitor the system as a hole.

Workers

Monitoring the worker processes is a little bit more complicated, it requires a way of signaling a single worker that needs to shutdown after closing its last request. We also need to make monit aware of that process (we need a pid for each worker). This requires a unicorn configuration that is a bit more elaborate than the “default”.

We begin by creating a pid for each worker. This can be done with the unicorn afterfork_ hook:

after_fork do |server, worker|
  # Create worker pids too
  child_pid = server.config[:pid].sub(/pid$/, "worker.#{worker.nr}.pid")
  system("echo #{Process.pid} > #{child_pid}")
end

This ensures that, immediately after the worker process has been forked, a pidfile named worker.nr.pid is created (nr being the worker number). Here is the complete configuration file:

##
# Unicorn config at /etc/unicorn/nebulab.rb
# Managed by Chef - Local Changes will be Nuked from Orbit (just to be sure)
##

# What ports/sockets to listen on, and what options for them.
listen '/var/run/unicorn/nebulab/socket', :backlog => 100

working_directory '/var/www/nebulab/current'

# What the timeout for killing busy workers is, in seconds
timeout 60

# Whether the app should be pre-loaded
preload_app true

# How many worker processes
worker_processes 2

if GC.respond_to?(:copy_on_write_friendly=)
  GC.copy_on_write_friendly = true
end

before_exec do |server|
  ENV['BUNDLE_GEMFILE'] = "/var/www/nebulab/current/Gemfile"
end

# What to do before we fork a worker
before_fork do |server, worker|
  defined?(ActiveRecord::Base) && ActiveRecord::Base.connection.disconnect!

  system "test -x /opt/scripts/repair_permissions_nebulab.sh && /opt/scripts/repair_permissions_nebulab.sh"

  old_pid = "/var/run/unicorn/nebulab/pid.oldbin"
  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill('QUIT', File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
      # someone else did our job for us
    end
  end
end

# What to do after we fork a worker
after_fork do |server, worker|
  defined?(ActiveRecord::Base) && ActiveRecord::Base.establish_connection

  # Create worker pids too
  child_pid = server.config[:pid].sub(/pid$/, "worker.#{worker.nr}.pid")
  system("echo #{Process.pid} > #{child_pid}")
end

# Where to drop a pidfile
pid '/var/run/unicorn/nebulab/pid'

# Where stderr gets logged
stderr_path '/var/log/unicorn/nebulab/error.log'

# Where stdout gets logged
stdout_path '/var/log/unicorn/nebulab/output.log'

# The user/group to run unicorn as
user 'www-data', 'www-data'

We now need a way to signal each worker when to die… Since we already have an init.d script, will edit that to support single worker signaling. A unicorn worker can be gracefully killed by sending a QUIT signal, we can do it with the init.d script:

#!/bin/sh
set -e

# Include Bundler path
PATH=$PATH:/usr/local/bin

### Unicorn Variables ###
TIMEOUT=60
APP_ROOT=/var/www/nebulab/current
PID_PATH=/var/run/unicorn/nebulab
PID=$PID_PATH/pid
CMD="bundle exec unicorn -D -c /etc/unicorn/nebulab.rb -E production"
INIT_CONF=$APP_ROOT/config/init.conf

action="$1"
set -u

test -f "$INIT_CONF" && . $INIT_CONF

old_pid="$PID.oldbin"

cd $APP_ROOT || exit 1

sig () {
  test -s "$PID" && kill -$1 `cat $PID`
}

oldsig () {
  test -s $old_pid && kill -$1 `cat $old_pid`
}

workersig () {
  workerpid=$PID_PATH/worker.$2.pid
  test -s "$workerpid" && kill -$1 `cat $workerpid`
}

create_pid_path () {
  test -d $PID_PATH || ( mkdir -p $PID_PATH && chown www-data.www-data $PID_PATH )
}

case $action in
start)
  create_pid_path
  sig 0 && echo >&2 "Already running" && exit 0
  $CMD
;;
stop)
  sig QUIT && exit 0
  echo >&2 "Not running"
;;
force-stop)
  sig TERM && exit 0
  echo >&2 "Not running"
;;
restart|reload)
  sig HUP && echo reloaded OK && exit 0
  echo >&2 "Couldn't reload, starting '$CMD' instead"
  $CMD
;;
upgrade)
  sig USR2 && exit 0
  echo >&2 "Couldn't upgrade, starting '$CMD' instead"
  $CMD
;;
status)
  sig 0 && echo "running with pid `cat $PID`" && exit 0
  echo stopped && exit 1
;;
kill_worker)
  workersig QUIT $2 && exit 0
  echo >&2 "Worker not running"
;;
reopen-logs)
  sig USR1
;;
*)
  echo >&2 "Usage: $0 <start|stop|status|restart|upgrade|force-stop|reopen-logs|kill_worker>"
  exit 1
;;
esac

We added a kill_worker command to the init script that sends the QUIT signal so that we can kill a worker by running:

$ /etc/init.d/nebulab kill_worker nr

where nr is the number of the worker. This will give us a way of killing a worker without worrying about lost connections and the unicorn master process will make sure to spawn a brand new worker.

Now that we have the correct pieces in place, we can configure monit. We should have a configuration for each of the worker processes. The nebulab application will have two monit configurations (one for each worker). Below is the configuration for worker 0:

check process nebulab_worker_0 with pidfile /var/run/unicorn/nebulab/worker.0.pid
  alert root@localhost only on { pid }
  if changed pid 2 times within 60 cycles then alert
  if memory usage > 16% for 1 cycles then
    exec "/etc/init.d/nebulab kill_worker 0"
  if cpu is greater than 50% for 2 cycles then
    exec "/etc/init.d/nebulab kill_worker 0"

This will ensure that our worker will get gracefully killed if:

  • it takes more than 16% memory for 30 seconds
  • it takes more than 50% cpu for 1 minute

Since we expect the worker to be killed often, we don’t want to be notified every time something happens to the workers:

alert root@localhost only on { pid }
if changed pid 2 times within 60 cycles then alert

This will notify us only of pid changes (when the worker is killed and a new one is spawned) and we’ll be notified only when the pid is changed at least twice in 30 minutes. There is no need to spam our inbox, the more mail we receive, the more we risk disregarding a serious problem as “just another of monit’s alerts”. Instead, by doing so, we only get an email when a worker goes crazy for some reason and keeps eating up cpu/memory, thus being continuously restarted.

Once you have everything in place you should be able to restart monit and wait for the magic to happen… You can keep track of what monit is doing with the monit status command.

Additional Resources

Join the Conversation