[tor-relays] MetricsPort: tor_relay_connections_total type confusion

Hello David,

again, thanks for your work on adding more metrics to tor's MetricsPort!
Many relay operators will love this and documentation will be useful [1].

I reported

which got closed yesterday, but there was likely a misunderstanding and
the changes did not solve the underlying issue.

The core issue is: The metric called
tor_relay_connections(_total) [2][3]
contains a mix of different types (gauge, counter).

When mixing types in a single metric, the TYPE definition will always be
wrong for one or the other value.
For example grafana will show this if you use a counter metric without
rate():
"Selected metric is a counter. Consider calculating rate of counter by
adding rate()."

It is a best practice to avoid mixing different types in a single metric.
From the prometheus best practices [4]:
"As a rule of thumb, either the sum() or the avg() over all dimensions of
a given metric should be meaningful (though not necessarily useful). If it
is not meaningful, split the data up into multiple metrics. For example,
having the capacity of various queues in one metric is good, while mixing
the capacity of a queue with the current number of elements in the queue
is not."

An idea to address the underlying issue:
One connection metric for counter and one for gauge:

- tor_relay_connections_total for counters, like the current label
state="created"
- tor_relay_connections for gauge metrics, like the current label
state="opened". "rejected" also appears to be a gauge metric.

Another nice feature of these metrics would be to have a label for what
type of system is connecting (src="relay", src="non-relay") - more on that
in yesterday's email.
A tool by toralf [4] also shows these and uses the source IP but tor
itself does not need to look at the source IP to determine the type,
something discussed in last week's relay operator meetup.

best regards,
nat

[1] Add connection stats to MetricsPort documentation (#312) · Issues · The Tor Project / Web / support · GitLab
[2]
relay: Change the connection metrics name (06a26f18) · Commits · The Tor Project / Core / Tor · GitLab
[3]
metrics: Treat relay connections as gauge, not counter (6d40e980) · Commits · The Tor Project / Core / Tor · GitLab
[4] Metric and label naming | Prometheus

···

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

Hello David,

again, thanks for your work on adding more metrics to tor's MetricsPort!
Many relay operators will love this and documentation will be useful [1].

I reported
MetricsPort: tor_relay_connections incorrect type / mixture of types in single metric (#40699) · Issues · The Tor Project / Core / Tor · GitLab
which got closed yesterday, but there was likely a misunderstanding and
the changes did not solve the underlying issue.

The core issue is: The metric called
tor_relay_connections(_total) [2][3]
contains a mix of different types (gauge, counter).

The patch I got in makes all the tor_relay_connections_total{} metrics
"gauges" because in effect, some can go up and down and some might only go up
but I figured even the one that only accumulate can also be gauges.

Is that a problem to your knowledge from a Prometheus standpoint?

Cheers!
David

···

On 28 Oct (11:04:09), nat@danwin1210.de wrote:

When mixing types in a single metric, the TYPE definition will always be
wrong for one or the other value.
For example grafana will show this if you use a counter metric without
rate():
"Selected metric is a counter. Consider calculating rate of counter by
adding rate()."

It is a best practice to avoid mixing different types in a single metric.
From the prometheus best practices [4]:
"As a rule of thumb, either the sum() or the avg() over all dimensions of
a given metric should be meaningful (though not necessarily useful). If it
is not meaningful, split the data up into multiple metrics. For example,
having the capacity of various queues in one metric is good, while mixing
the capacity of a queue with the current number of elements in the queue
is not."

An idea to address the underlying issue:
One connection metric for counter and one for gauge:

- tor_relay_connections_total for counters, like the current label
state="created"
- tor_relay_connections for gauge metrics, like the current label
state="opened". "rejected" also appears to be a gauge metric.

Another nice feature of these metrics would be to have a label for what
type of system is connecting (src="relay", src="non-relay") - more on that
in yesterday's email.
A tool by toralf [4] also shows these and uses the source IP but tor
itself does not need to look at the source IP to determine the type,
something discussed in last week's relay operator meetup.

best regards,
nat

[1] Add connection stats to MetricsPort documentation (#312) · Issues · The Tor Project / Web / support · GitLab
[2]
relay: Change the connection metrics name (06a26f18) · Commits · The Tor Project / Core / Tor · GitLab
[3]
metrics: Treat relay connections as gauge, not counter (6d40e980) · Commits · The Tor Project / Core / Tor · GitLab
[4] Metric and label naming | Prometheus

--
RRmWZi+kxUk/ehwUda6Z6UE/zsCYNl2ts0zzPswJAPI=

to remind ourselves about the prometheus types and their differences
and when to use what:

The patch I got in makes all the tor_relay_connections_total{} metrics
"gauges" because in effect, some can go up and down and some might only go
up but I figured even the one that only accumulate can also be gauges.

you are right that prometheus will scrape it anyway even if the type is
incorrectly defined but by defining a counter as gauge (or vice versa) you
tell people what it is and how to make sense of it, for example the type
implies "do (not) use rate() on this".

By looking at

# TYPE tor_relay_connections_total gauge
[...]

Anyone familiar with prometheus would expect it to be a gauge
- so for example would not use rate() on it - but that is not correct for
all labels.
For example creating graphs for state="created" without using rate() would
produce boring graphs because it is a counter. They might also wonder why
a gauge ends with .._total, because that is used for accumulating count
[1].

Is that a problem to your knowledge from a Prometheus standpoint?

It is a problem because prometheus users/tools are used to
consume that TYPE line to learn what type it is.
They will get confused if they treat it according to the current type
definition.
graphing counters as if they were gauges will result in questions
like "oh why is my rate of x always increasing?"

before:
# TYPE tor_relay_connections counter
tor_relay_connections{type="OR",direction="initiated",state="opened"}

but that value is not a counter, it is a gauge. It can go down.

after yesterday's change:

# TYPE tor_relay_connections_total gauge
tor_relay_connections_total{type="OR",direction="received",state="created"}

but that is a counter, it can never go down.

To prevent this and to follow the usual prometheus practices
is best to have the current metric split into one for counters and
another one for gauges.

thanks for working on this!

best regards,
nat

[1] Metric and label naming | Prometheus

···

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays