A Practical guide to StatsD/Graphite monitoring

another · on June 28, 2013

In addition to this helpful guide, note that statsd / graphite both spring some unfortunate surprises on new users, e.g., graphite changing your data across retention rates and time scales [0], graphite changing your data at different plot widths (?!) [1], statsd believing that only count and time data deserve to be aggregated [2], etc.

I have no alternative to suggest, however. Perhaps Cube [3], but unclear if it has any user community.

[0] http://stackoverflow.com/questions/10820119/graphite-is-not-... [1] http://graphite.readthedocs.org/en/1.0/functions.html#graphi... [2] https://github.com/etsy/statsd/issues/98 [3] https://github.com/square/cube

fourk · on June 28, 2013

Re [0]: If you never want your data downsampled, keep data at a single resolution which is equal to the flush interval used to push data to Graphite. Carbon will never "change your data" under such a configuration.

Re [1]: How would you expect the presentation layer to present >n data points using n pixels?

Graphite doesn't "change your data". Presentation of data != the data itself, just as a map of a city != the city itself.

another · on June 28, 2013

> If you never want your data downsampled, keep data at a single resolution...

Sure, and many people do exactly that. The point is that a new user to graphite is likely to be surprised by this behavior. (I would further bet that a reasonable fraction of statsd+graphite users end up viewing incorrect data without realizing it, especially given the statsd focus on count data, for which the default aggregationMethod setting is exactly the wrong choice.)

(And even awareness of this behavior isn't quite enough, since every user needs to also remember their server's exact storage configuration, lest they inadvertently expand their plot across a retention boundary.)

> How would you expect the presentation layer to present n data points using n pixels?

The same way that most plotting tools do so: by overdrawing. Yes, one ends up with a solid block of pixels if the data are noisy and the plot is small, but that outcome is easily understood and has the easily understood solution of explicitly aggregating appropriately. Graphite instead takes the approach of implicitly aggregating based on how wide the plot is rendered in a given interface. That behavior is, at the very least, surprising.

pestaa · on June 28, 2013

Changing data in undesired ways is something I'd call a bug, not a surprise.

sciurus · on June 28, 2013

It's not a bug; carbon was behaving exactly the way it was configured to behave. This wouldn't be surprising to anyone who is familiar with RRDTool. However, since one of the reasons graphite uses its own file format (whisper) instead of RRD is to better handle intermittent values, I could see the argument that the default xFilesFactor should be higher.

"xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5." - http://graphite.readthedocs.org/en/1.0/config-carbon.html

jlgreco · on June 28, 2013

Fixing statsd is easy at least, that program is trivial.

another · on June 28, 2013

Oh, yeah, it's weird but not even a serious limitation: it's being fixed, other statsd clones have more features, and you can always pretend your data are time intervals. statsd is limited but nicely simple.

sciurus · on June 28, 2013

This is a very nice article.

One nitpick- You don't need to use statsd as an intermediary in order for your application to send metrics via UDP; just set ENABLE_UDP_LISTENER to True in carbon.conf and graphite will accept metrics on UDP itself. Other options are TCP(obviously) and AMQP.

I love how simple Graphite's plaintext protocol is; it's nothing more than a line of text with <metric path> <metric value> <metric timestamp>. This has lead lots of software to integrate graphite support and makes it easy to do yourself. In a pinch I've even set up a cronjob reading a value from /proc and sending it to graphite via netcat.

Graphite shines at generating graphs, but it's ability to return JSON is also very useful. For example, I've written a script (https://github.com/sciurus/grallect) that plugs into Nagios and generates alerts based on system metrics sent by Collectd to Graphite.

My two frustrations with graphite-

You have to choose a single aggregation method. I'd like to be able to store the average, minimum, and maximum values.

Sometimes I find it hard to query for the data I want. E.G. To check the percentage of space used on each filesystem I have to fetch example.com.df-.df_complex-used and example.com.df-.df_complex-free separately and calculate the percentages myself because asPercent(example.df-*.df_complex-{used,free}) would combine all the filesystems.

SEJeff · on June 28, 2013

Graphite committer / co-maintainer here...

Feel free to point out useful things graphite could do better (constructively only) and/or some of your favorite posts or tools used with graphite. We aren't too far off from 2 quite massive releases (0.9.11 / 0.10) and are thinking about departing from some of the legacy bits moving forward. I'm looking at you python 2.4

m0th87 · on June 28, 2013

The process of getting graphite web up and running on Mac seemed pretty involved since it's broken up into 3 packages and depends on Cairo which can be finicky.

When I last checked there didn't seem to be solid documentation on how to get it all setup. Searching now, this looks promising: http://amin.bitbucket.org/posts/graphite-mac-homebrew.html

mattetti · on June 28, 2013

Here is another tutorial on how to get setup on OS X: http://steveakers.com/2013/03/12/installing-graphite-statsd-...

SEJeff · on June 28, 2013

Would it be helpful if that was rolled up in the official docs?

m0th87 · on June 29, 2013

Yeah I think so. Or maybe a homebrew formula. I don't know how straight-forward that is with Cairo though.

Thank you for graphite and thanks for being so receptive!

plasma · on June 29, 2013

Yes please, it looks like amazing software, I just can't figure out how to actually use it.

jjmaestro · on June 29, 2013

First of all, thanks a lot for such an awesome tool!

I think graphite would greatly improve by having the "web" part split into different apps/packages (API, graphing and frontend/dashboard).

That way, people could install whatever they wanted. Imagine "only" having to improve the backend while other people create amazing dashboards (which, right now, is already happening anyway... there's such a fragmentation in the available frontends...)

I'd love to help with the split if you deem it worthy & if you need a hand :) I will create a GH issue anyway :D

Again, THANKS!

parsley · on June 28, 2013

I use a statsd compatible alternative called statsite.[1]

It's written in pure c and behaves like you would expect statsd to, with some additional improvements. I'm definitely more comfortable deploying it as opposed to installing and managing a node.js application.

[1] https://github.com/armon/statsite

mattetti · on June 28, 2013

Interesting, 37signals released their own Golang based version of statsd: https://github.com/noahhl/go-batsd Probably for the same reasons you rewrote it in pure C.

geetarista · on June 28, 2013

The main reason being that StatsD will max out at about 10K OPS (unless they've improved it recently) whereas Statsite will reach 10 MM. Also, look at the difference between the implementation of sets. StatsD uses a JS object[1] versus statsite using a C implementation of HyperLogLog[2][3]. If you're doing anything significant, you should not be using the node.js version of StatsD.

[1] https://github.com/etsy/statsd/blob/master/lib/set.js [2] https://github.com/armon/statsite/blob/master/src/set.c [3] http://research.google.com/pubs/pub40671.html

threeseed · on June 28, 2013

Everyone considering using Graphite needs to take a look at Librato Metrics (https://librato.com).

It's very affordable and dramatically simplifies management of your stats. I am so glad to put Graphite behind me.

tbh · on June 29, 2013

Excellent guide. We (at http://www.hostedgraphite.com) will be pointing new users towards it and I'm sure many will benefit. Thanks!

mattetti · on June 28, 2013

37Signals wrote a piece on how to instrument your Rails app: http://37signals.com/svn/posts/3091-pssst-your-rails-applica...

jpadilla_ · on June 28, 2013

I created a Vagrant VM for StatsD and Graphite. Maybe it'll be of help to someone https://github.com/jpadilla/statsd-graphite-vm

boundlessdreamz · on June 28, 2013

Thank you. This is awesome.

If you have the same metrics coming in from multiple app servers, how can they be viewed in aggregate and separately?

mattetti · on June 28, 2013

To do that, you can change your stats name to include each machine's name, for instance: <host name>.accounts..http.post.authenticate.response_time

When you display your metrics you can query for: .accounts.*.http.post.authenticate.response_time

To get a breakdown per machine (and per client). or you can sum the metrics still using the wildcard.

anveo · on June 28, 2013

You can also have carbon aggregate them for more efficient queries https://graphite.readthedocs.org/en/0.9.10/config-carbon.htm...

thesis · on June 28, 2013

Graphite & Carbon was an absolute nightmare to get installed on CentOS -- we had to settle installing it on Ubuntu instead.

mhurron · on June 28, 2013

CentOS isn't right for everyone, oddly because of the Enterprise focus for stability. The system libraries on CentOS end up being quite old from the viewpoint of a lot of developers.

Sometimes the best solution is to ignore the system provided libraries and build your own environment.

sciurus · on June 28, 2013

I only ran into one problem (and they quickly accepted my pull request to fix it) building and using RPMs from the spec files at https://github.com/dcarley/graphite-rpms

SEJeff · on June 29, 2013

So for 0.10, one of my personal goals is to get it all working with EPEL5 so you can just you install all of the components