Writing exporters

If you are instrumenting your own code, the general rules of how toinstrument code with a Prometheus clientlibrary should be followed. Whentaking metrics from another monitoring or instrumentation system, thingstend not to be so black and white.

This document contains things you should consider when writing anexporter or custom collector. The theory covered will also be ofinterest to those doing direct instrumentation.

If you are writing an exporter and are unclear on anything here, pleasecontact us on IRC (#prometheus on Freenode) or the mailinglist.

Maintainability and purity

The main decision you need to make when writing an exporter is how muchwork you’re willing to put in to get perfect metrics out of it.

If the system in question has only a handful of metrics that rarelychange, then getting everything perfect is an easy choice, a goodexample of this is the HAProxyexporter.

On the other hand, if you try to get things perfect when the system hashundreds of metrics that change frequently with new versions, thenyou’ve signed yourself up for a lot of ongoing work. The MySQLexporter is on this endof the spectrum.

The node exporter is amix of these, with complexity varying by module. For example, themdadm collector hand-parses a file and exposes metrics createdspecifically for that collector, so we may as well get the metricsright. For the meminfo collector the results vary across kernelversions so we end up doing just enough of a transform to create validmetrics.

Configuration

When working with applications, you should aim for an exporter thatrequires no custom configuration by the user beyond telling it where theapplication is. You may also need to offer the ability to filter outcertain metrics if they may be too granular and expensive on largesetups, for example the HAProxyexporter allowsfiltering of per-server stats. Similarly, there may be expensive metricsthat are disabled by default.

When working with other monitoring systems, frameworks and protocols youwill often need to provide additional configuration or customization togenerate metrics suitable for Prometheus. In the best case scenario, amonitoring system has a similar enough data model to Prometheus that youcan automatically determine how to transform metrics. This is the casefor Cloudwatch,SNMP andcollectd. At most, weneed the ability to let the user select which metrics they want to pullout.

In other cases, metrics from the system are completely non-standard,depending on the usage of the system and the underlying application. Inthat case the user has to tell us how to transform the metrics. The JMXexporter is the worstoffender here, with theGraphite andStatsD exporters alsorequiring configuration to extract labels.

Ensuring the exporter works out of the box without configuration, andproviding a selection of example configurations for transformation ifrequired, is advised.

YAML is the standard Prometheus configuration format, all configurationshould use YAML by default.

Metrics

Naming

Follow the best practices on metric naming.

Generally metric names should allow someone who is familiar withPrometheus but not a particular system to make a good guess as to what ametric means. A metric named http_requests_total is not extremelyuseful - are these being measured as they come in, in some filter orwhen they get to the user’s code? And requests_total is even worse,what type of requests?

With direct instrumentation, a given metric should exist within exactlyone file. Accordingly, within exporters and collectors, a metric shouldapply to exactly one subsystem and be named accordingly.

Metric names should never be procedurally generated, except when writinga custom collector or exporter.

Metric names for applications should generally be prefixed by theexporter name, e.g. haproxy_up.

Metrics must use base units (e.g. seconds, bytes) and leave convertingthem to something more readable to graphing tools. No matter what unitsyou end up using, the units in the metric name must match the units inuse. Similarly, expose ratios, not percentages. Even better, specify acounter for each of the two components of the ratio.

Metric names should not include the labels that they’re exported with,e.g. by_type, as that won’t make sense if the label is aggregatedaway.

The one exception is when you’re exporting the same data with differentlabels via multiple metrics, in which case that’s usually the sanest wayto distinguish them. For direct instrumentation, this should only comeup when exporting a single metric with all the labels would have toohigh a cardinality.

Prometheus metrics and label names are written in snake_case.Converting camelCase to snake_case is desirable, though doing soautomatically doesn’t always produce nice results for things likemyTCPExample or isNaN so sometimes it’s best to leave them as-is.

Exposed metrics should not contain colons, these are reserved for userdefined recording rules to use when aggregating.

Only [a-zA-Z0-9:_] are valid in metric names, any other charactersshould be sanitized to an underscore.

The _sum, _count, _bucket and _total suffixes are used bySummaries, Histograms and Counters. Unless you’re producing one ofthose, avoid these suffixes.

_total is a convention for counters, you should use it if you’re usingthe COUNTER type.

The process and scrape prefixes are reserved. It’s okay to addyour own prefix on to these if they follow the matchingsemantics.For example, Prometheus has scrape_duration_seconds for how long ascrape took, it's good practice to also have an exporter-centric metric,e.g. jmx_scrape_duration_seconds, saying how long the specificexporter took to do its thing. For process stats where you have accessto the PID, both Go and Python offer collectors that’ll handle this foryou. A good example of this is the HAProxyexporter.

When you have a successful request count and a failed request count, thebest way to expose this is as one metric for total requests and anothermetric for failed requests. This makes it easy to calculate the failureratio. Do not use one metric with a failed or success label. Similarly,with hit or miss for caches, it’s better to have one metric for total andanother for hits.

Consider the likelihood that someone using monitoring will do a code orweb search for the metric name. If the names are very well-establishedand unlikely to be used outside of the realm of people used to thosenames, for example SNMP and network engineers, then leaving them as-ismay be a good idea. This logic doesn’t apply for all exporters, forexample the MySQL exporter metrics may be used by a variety of people,not just DBAs. A HELP string with the original name can provide mostof the same benefits as using the original names.

Labels

Read the generaladvice onlabels.

Avoid type as a label name, it’s too generic and often meaningless.You should also try where possible to avoid names that are likely toclash with target labels, such as region, zone, cluster,availability_zone, az, datacenter, dc, owner, customer,stage, service, environment and env. If, however, that’s whatthe application calls some resource, it’s best not to cause confusion byrenaming it.

Avoid the temptation to put things into one metric just because theyshare a prefix. Unless you’re sure something makes sense as one metric,multiple metrics is safer.

The label le has special meaning for Histograms, and quantile forSummaries. Avoid these labels generally.

Read/write and send/receive are best as separate metrics, rather than asa label. This is usually because you care about only one of them at atime, and it is easier to use them that way.

The rule of thumb is that one metric should make sense when summed oraveraged. There is one other case that comes up with exporters, andthat’s where the data is fundamentally tabular and doing otherwise wouldrequire users to do regexes on metric names to be usable. Consider thevoltage sensors on your motherboard, while doing math across them ismeaningless, it makes sense to have them in one metric rather thanhaving one metric per sensor. All values within a metric should(almost) always have the same unit, for example consider if fan speedswere mixed in with the voltages, and you had no way to automaticallyseparate them.

Don’t do this:

  1. my_metric{label=a} 1
  2. my_metric{label=b} 6
  3. my_metric{label=total} 7

or this:

  1. my_metric{label=a} 1
  2. my_metric{label=b} 6
  3. my_metric{} 7

The former breaks for people who do a sum() over your metric, and thelatter breaks sum and is quite difficult to work with. Some clientlibraries, for example Go, will actively try to stop you doing thelatter in a custom collector, and all client libraries should stop youfrom doing the latter with direct instrumentation. Never do either ofthese, rely on Prometheus aggregation instead.

If your monitoring exposes a total like this, drop the total. If youhave to keep it around for some reason, for example the total includesthings not counted individually, use different metric names.

Instrumentation labels should be minimal, every extra label is one morethat users need to consider when writing their PromQL. Accordingly,avoid having instrumentation labels which could be removed withoutaffecting the uniqueness of the time series. Additional informationaround a metric can be added via an info metric, for an example seebelow how to handle version numbers.

However, there are cases where it is expected that virtually all users ofa metric will want the additional information. If so, adding anon-unique label, rather than an info metric, is the right solution. Forexample themysqld_exporter'smysqld_perf_schema_events_statements_total's digest label is a hashof the full query pattern and is sufficient for uniqueness. However, itis of little use without the human readable digest_text label, whichfor long queries will contain only the start of the query pattern and isthus not unique. Thus we end up with both the digest_text label forhumans and the digest label for uniqueness.

Target labels, not static scraped labels

If you ever find yourself wanting to apply the same label to all of yourmetrics, stop.

There’s generally two cases where this comes up.

The first is for some label it would be useful to have on the metricssuch as the version number of the software. Instead, use the approachdescribed athttps://www.robustperception.io/how-to-have-labels-for-machine-roles/.

The second case is when a label is really a target label. These arethings like region, cluster names, and so on, that come from yourinfrastructure setup rather than the application itself. It’s not for anapplication to say where it fits in your label taxonomy, that’s for theperson running the Prometheus server to configure and different peoplemonitoring the same application may give it different names.

Accordingly, these labels belong up in the scrape configs of Prometheusvia whatever service discovery you’re using. It’s okay to apply theconcept of machine roles here as well, as it’s likely useful informationfor at least some people scraping it.

Types

You should try to match up the types of your metrics to Prometheustypes. This usually means counters and gauges. The _count and _sumof summaries are also relatively common, and on occasion you’ll seequantiles. Histograms are rare, if you come across one remember that theexposition format exposes cumulative values.

Often it won’t be obvious what the type of metric is, especially ifyou’re automatically processing a set of metrics. In general UNTYPEDis a safe default.

Counters can’t go down, so if you have a counter type coming fromanother instrumentation system that can be decremented, for exampleDropwizard metrics then it's not a counter, it's a gauge. UNTYPED isprobably the best type to use there, as GAUGE would be misleading ifit were being used as a counter.

Help strings

When you’re transforming metrics it’s useful for users to be able totrack back to what the original was, and what rules were in play thatcaused that transformation. Putting in the name of thecollector or exporter, the ID of any rule that was applied and thename and details of the original metric into the help string will greatlyaid users.

Prometheus doesn’t like one metric having different help strings. Ifyou’re making one metric from many others, choose one of them to put inthe help string.

For examples of this, the SNMP exporter uses the OID and the JMXexporter puts in a sample mBean name. The HAProxyexporter hashand-written strings. The nodeexporter also has a widevariety of examples.

Drop less useful statistics

Some instrumentation systems expose 1m, 5m, 15m rates, average rates sinceapplication start (these are called mean in Dropwizard metrics forexample) in addition to minimums, maximums and standard deviations.

These should all be dropped, as they’re not very useful and add clutter.Prometheus can calculate rates itself, and usually more accurately asthe averages exposed are usually exponentially decaying. You don’t knowwhat time the min or max were calculated over, and the standard deviationis statistically useless and you can always expose sum of squares,_sum and _count if you ever need to calculate it.

Quantiles have related issues, you may choose to drop them or put themin a Summary.

Dotted strings

Many monitoring systems don’t have labels, instead doing things likemy.class.path.mymetric.labelvalue1.labelvalue2.labelvalue3.

The Graphite andStatsD exporters sharea way of transforming these with a small configuration language. Otherexporters should implement the same. The transformation is currentlyimplemented only in Go, and would benefit from being factored out into aseparate library.

Collectors

When implementing the collector for your exporter, you should never usethe usual direct instrumentation approach and then update the metrics oneach scrape.

Rather create new metrics each time. In Go this is done withMustNewConstMetricin your Update() method. For Python seehttps://github.com/prometheus/client_python#custom-collectorsand for Java generate a List<MetricFamilySamples> in your collectmethod, seeStandardExports.javafor an example.

The reason for this is two-fold. Firstly, two scrapes could happen atthe same time, and direct instrumentation uses what are effectivelyfile-level global variables, so you’ll get race conditions. Secondly, ifa label value disappears, it’ll still be exported.

Instrumenting your exporter itself via direct instrumentation is fine,e.g. total bytes transferred or calls performed by the exporter acrossall scrapes. For exporters such as the blackboxexporter and SMNPexporter, which aren’ttied to a single target, these should only be exposed on a vanilla/metrics call, not on a scrape of a particular target.

Metrics about the scrape itself

Sometimes you’d like to export metrics that are about the scrape, likehow long it took or how many records you processed.

These should be exposed as gauges as they’re about an event, the scrape,and the metric name prefixed by the exporter name, for examplejmx_scrape_duration_seconds. Usually the _exporter is excluded andif the exporter also makes sense to use as just a collector, thendefinitely exclude it.

Machine and process metrics

Many systems, for example Elasticsearch, expose machine metrics such asCPU, memory and filesystem information. As the nodeexporter provides these inthe Prometheus ecosystem, such metrics should be dropped.

In the Java world, many instrumentation frameworks expose process-leveland JVM-level stats such as CPU and GC. The Java client and JMX exporteralready include these in the preferred form viaDefaultExports.java,so these should also be dropped.

Similarly with other languages and frameworks.

Deployment

Each exporter should monitor exactly one instance application,preferably sitting right beside it on the same machine. That means forevery HAProxy you run, you run a haproxy_exporter process. For everymachine with a Mesos worker, you run the Mesosexporter on it, andanother one for the master, if a machine has both.

The theory behind this is that for direct instrumentation this is whatyou’d be doing, and we’re trying to get as close to that as we can inother layouts. This means that all service discovery is done inPrometheus, not in exporters. This also has the benefit that Prometheushas the target information it needs to allow users probe your servicewith the blackboxexporter.

There are two exceptions:

The first is where running beside the application your monitoring iscompletely nonsensical. The SNMP, blackbox and IPMI exporters are themain examples of this. The IPMI and SNMP exporters as the devices areoften black boxes that it’s impossible to run code on (though if youcould run a node exporter on them instead that’d be better), and theblackbox exporter where you’re monitoring something like a DNS name,where there’s also nothing to run on. In this case, Prometheus shouldstill do service discovery, and pass on the target to be scraped. Seethe blackbox and SNMP exporters for examples.

Note that it is only currently possible to write this type of exporterwith the Go, Python and Java client libraries.

The second exception is where you’re pulling some stats out of a randominstance of a system and don’t care which one you’re talking to.Consider a set of MySQL replicas you wanted to run some business queriesagainst the data to then export. Having an exporter that uses your usualload balancing approach to talk to one replica is the sanest approach.

This doesn’t apply when you’re monitoring a system with master-election,in that case you should monitor each instance individually and deal withthe "masterness" in Prometheus. This is as there isn’t always exactlyone master, and changing what a target is underneath Prometheus’s feetwill cause oddities.

Scheduling

Metrics should only be pulled from the application when Prometheusscrapes them, exporters should not perform scrapes based on their owntimers. That is, all scrapes should be synchronous.

Accordingly, you should not set timestamps on the metrics you expose, letPrometheus take care of that. If you think you need timestamps, then youprobably need thePushgatewayinstead.

If a metric is particularly expensive to retrieve, i.e. takes more thana minute, it is acceptable to cache it. This should be noted in theHELP string.

The default scrape timeout for Prometheus is 10 seconds. If yourexporter can be expected to exceed this, you should explicitly call thisout in your user documentation.

Pushes

Some applications and monitoring systems only push metrics, for exampleStatsD, Graphite and collectd.

There are two considerations here.

Firstly, when do you expire metrics? Collectd and things talking toGraphite both export regularly, and when they stop we want to stopexposing the metrics. Collectd includes an expiry time so we use that,Graphite doesn’t so it is a flag on the exporter.

StatsD is a bit different, as it is dealing with events rather thanmetrics. The best model is to run one exporter beside each applicationand restart them when the application restarts so that the state iscleared.

Secondly, these sort of systems tend to allow your users to send eitherdeltas or raw counters. You should rely on the raw counters as far aspossible, as that’s the general Prometheus model.

For service-level metrics, e.g. service-level batch jobs, you shouldhave your exporter push into the Pushgateway and exit after the eventrather than handling the state yourself. For instance-level batchmetrics, there is no clear pattern yet. The options are either to abusethe node exporter’s textfile collector, rely on in-memory state(probably best if you don’t need to persist over a reboot) or implementsimilar functionality to the textfile collector.

Failed scrapes

There are currently two patterns for failed scrapes where theapplication you’re talking to doesn’t respond or has other problems.

The first is to return a 5xx error.

The second is to have a myexporter_up, e.g. haproxy_up, variablethat has a value of 0 or 1 depending on whether the scrape worked.

The latter is better where there’s still some useful metrics you can geteven with a failed scrape, such as the HAProxy exporter providingprocess stats. The former is a tad easier for users to deal with, asup works in the usual way, although you can’t distinguish between theexporter being down and the application being down.

Landing page

It’s nicer for users if visiting http://yourexporter/ has a simpleHTML page with the name of the exporter, and a link to the /metricspage.

Port numbers

A user may have many exporters and Prometheus components on the samemachine, so to make that easier each has a unique port number.

https://github.com/prometheus/prometheus/wiki/Default-port-allocationsis where we track them, this is publicly editable.

Feel free to grab the next free port number when developing yourexporter, preferably before publicly announcing it. If you’re not readyto release yet, putting your username and WIP is fine.

This is a registry to make our users’ lives a little easier, not acommitment to develop particular exporters. For exporters for internalapplications we recommend using ports outside of the range of defaultport allocations.

Announcing

Once you’re ready to announce your exporter to the world, email themailing list and send a PR to add it to the list of availableexporters.