[tor-relays] Massive CPU load on high capacity guard node

tor-relays · November 19, 2021, 9:41am

Hello Everybody,

my relay is now almost two weeks old and has the following flags:

Fast, Guard, Running, Stable, V2Dir, Valid.

I lost the HSDir flag because I had to restart the Tor process, my downtime was just a few seconds, maybe that’s why I kept the Guard flag.

I was expecting a drop in traffic when I got the Guard flag (as mentioned in the FAQ), but the opposite happened.

At the moment there are around 15000 active connections, over 11000 inbound and just 4000 outbound. I looked at the connections in Nyx, and it seems that my relay is indeed used as a Guard node (most of the IPs are “scrubbed” and the outgoing connections are to middle nodes).

Before I got the Guard flag, I had around 5000 connections at the same time and was relaying traffic at peaks of 55MB/s. My server is connected to a Gigabit link.

It’s not a regular VPS, I have a dedicated CPU with two cores and dedicated 8GB RAM. Traffic is unlimited.

The problem is that I’m now relaying traffic at ~25MB/s, and whenever there are spikes of over 30MB/s the CPU load on both cores (!) is very high.

I’m still moving ~5TB per day, that’s a lot, I know. But there would be even more possible with the internet connection of my server.

My Server has two dedicated CPU cores of an AMD EPYC 7702, but unfortunately I only get the base frequency of 2GHz inside the VM, not the boost frequency of 3,35GHz (misleading information on the hoster’s website).

I could relay way more traffic if there wouldn’t be this issue with the CPU load. This is the bottleneck, the 1Gbit link is guaranteed.

I read in the FAQ that a modern CPU with hardware acceleration is able to relay traffic @~500Mbit in both directions. The EPYC 7702 supports AES-NI. I checked this, it is activated in my VM.

I’m running Debian 11 Bullseye and tweaked the networking capabilities with some instructions I found from torservers.net (mnostly sysctl.conf tweaks)

There is no additional software installed that uses lots of ressources, just a few tools.

Here is a screenshot of Glances during a traffic peak (I set the Tor process to +10 on purpose):

https://i.ibb.co/8brmZkf/glances.png

The average CPU load is ~1.50, this is still ok for a dual core, but it should stay below 2.0 (at least it should not go above 2.0 for more than a few minutes).

Does anyone here have an idea what I could do?

Since the load on both cores is pretty high, I don’t think it makes much sense to set up a second relay on the same server.

Of course I could throttle the traffic, but is there anything else I can do? I rented this rather expensive server to help the Tor network with a really fast Guard node…

Thank you everyone for your time and responses!

Have a great weekend!

Best Regards,

Elias

lists · November 19, 2021, 8:21pm

Hello Everybody,

my relay is now almost two weeks old and has the following flags:
Fast, Guard, Running, Stable, V2Dir, Valid.

I lost the HSDir flag because I had to restart the Tor process, my downtime
was just a few seconds, maybe that's why I kept the Guard flag.

This is normal, HSDir flag is always gone after reboot or restart. Other flags
remain after reboot or restart.

At the moment there are around 15000 active connections, over 11000 inbound
and just 4000 outbound. I looked at the connections in Nyx, and it seems
that my relay is indeed used as a Guard node (most of the IPs are
"scrubbed" and the outgoing connections are to middle nodes).

Before I got the Guard flag, I had around 5000 connections at the same time
and was relaying traffic at peaks of 55MB/s. My server is connected to a
Gigabit link. It's not a regular VPS, I have a dedicated CPU with two cores
and dedicated 8GB RAM. Traffic is unlimited.

Many VMs with 1G are still throttled. You share the server bandwidth with all
other VM customers.

The problem is that I'm now relaying traffic at ~25MB/s, and whenever there
are spikes of over 30MB/s the CPU load on both cores (!) is very high. I'm
still moving ~5TB per day, that's a lot, I know. But there would be even
more possible with the internet connection of my server.

~5TB per day ≈ 150 TB/month
You usually don't even get that on a dedicated bare metal root server that
costs $ 30-100 a month. One of my hosters limited bandwith to 300Mbit after
10TB of traffic.

Uh, welcome to the club.
Because of DDoS, I have had 40 cores at around 90% for weeks. Until 3 weeks
ago the ixgbe driver was killed every 2-3 days. I hope I have solved the
problem now.

My Server has two dedicated CPU cores of an AMD EPYC 7702, but unfortunately
I only get the base frequency of 2GHz inside the VM, not the boost
frequency of 3,35GHz (misleading information on the hoster's website).

I could relay way more traffic if there wouldn't be this issue with the CPU
load. This is the bottleneck, the 1Gbit link is guaranteed.

I read in the FAQ that a modern CPU with hardware acceleration is able to
relay traffic @~500Mbit in both directions. The EPYC 7702 supports AES-NI.
I checked this, it is activated in my VM.

I'm running Debian 11 Bullseye and tweaked the networking capabilities with
some instructions I found from torservers.net (mnostly sysctl.conf tweaks)

The old stuff from their github?
I would delete them again. You are in a VM and the torservers.net sysctl.conf
settings are over 10 years old! (A joke by niftybunny: From times when low
traffic was RFC 2549.) 1G NIC has long been standard. With Debian 9, 10 and 11 I
only used the default 'sysctl' settings. Means none at all. tcp-syncookies has
also been enabled in Debian for many, many years.

The average CPU load is ~1.50, this is still ok for a dual core, but it
should stay below 2.0 (at least it should not go above 2.0 for more than a
few minutes).

Does anyone here have an idea what I could do?
Since the load on both cores is pretty high, I don't think it makes much
sense to set up a second relay on the same server.

Maybe it helps:

I have iptables persistent on my guard servers. Sample rules:

or try

MaxAdvertisedBandwidth
   If set, we will not advertise more than this amount of bandwidth
   for our BandwidthRate. Server operators who want to reduce the
   number of clients who ask to build circuits through them (since
   this is proportional to advertised bandwidth rate) can thus reduce
   the CPU demands on their server without impacting network performance

···

On Friday, November 19, 2021 10:41:27 AM CET Elias via tor-relays wrote:

Of course I could throttle the traffic, but is there anything else I can do?
I rented this rather expensive server to help the Tor network with a really
fast Guard node...

--
╰_╯ Ciao Marco!

Debian GNU/Linux

It's free software and it gives you freedom!

Johan_Nilsson · November 20, 2021, 7:57am

It does from my experience. Run two relays on the machine. Set
RelayBandwidthRate to the same value for both relays. Care about the
total throughput and not the peak for one relay.

Regards,
Johan

···

On Fri, Nov 19, 2021 at 09:41:27AM +0000, Elias via tor-relays wrote:

Does anyone here have an idea what I could do?
Since the load on both cores is pretty high, I don't think it makes
much sense to set up a second relay on the same server.

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

tor-relays · November 19, 2021, 9:48pm

First of all, thank you very much for your response!

This is normal, HSDir flag is always gone after reboot or restart. Other flags
remain after reboot or restart.
I know, it wouldn’t even bother me if I lost the Guard flag.
The Tor network can decide whatever it want’s to use my relay for.

Many VMs with 1G are still throttled. You share the server bandwidth with all
other VM customers.
This one is not. The hoster sells this machine as a “Root Server”, it’s actually
connected to a 2,5Gbit link. The 1Gbit speed is guaranteed, and before I
set up the relay I made multiple speed tests - I definitely get 1Gbit.

The problem is that I’m now relaying traffic at ~25MB/s, and whenever there
are spikes of over 30MB/s the CPU load on both cores (!) is very high. I’m
still moving ~5TB per day, that’s a lot, I know. But there would be even
more possible with the internet connection of my server.

~5TB per day ≈ 150 TB/month
You usually don’t even get that on a dedicated bare metal root server that
costs $ 30-100 a month. One of my hosters limited bandwith to 300Mbit after
10TB of traffic.
I paid close attention to any limit rules, and there is one. But I’m unable
to break this rule: They limit my bandwith to 200Mbit when I used more than
120TB of traffic within one month and at the same time (!) used more than
1Gbit bandwith on average (!) for more than 60 minutes. I set
MaxAdvertisedBandwith to 1000Mbit, so I will never get throttled by the
hoster.

Uh, welcome to the club.
Because of DDoS, I have had 40 cores at around 90% for weeks. Until 3 weeks
ago the ixgbe driver was killed every 2-3 days. I hope I have solved the
problem now.
Yeah, and this wasn’t even a DDoS. If don’t change my config then it’s pretty
easy to shoot my server off the internet with a low scale DDoS. And we
both know they do this, especially with high capacity Guard nodes…
I secured the server as good as I could before it went online, but there is no
real DDoS protection in place, and it seems I need it.

The old stuff from their github?
I would delete them again. You are in a VM and the torservers.net sysctl.conf
settings are over 10 years old!
The old stuff from this mailing list. But you’re right, that stuff was from 2010,
I will revert back to normal.

I have iptables persistent on my guard servers. Sample rules:
https://github.com/boldsuck/tor-relay-bootstrap/tree/master/etc/iptables
Thank you, I’ll give that a try!

If set, we will not advertise more than this amount of bandwidth
for our BandwidthRate. Server operators who want to reduce the
number of clients who ask to build circuits through them (since
this is proportional to advertised bandwidth rate) can thus reduce
the CPU demands on their server without impacting network performance
This will be my next step if the iptables rules have no effect.
At the moment I advertise 125 MiB, this is obviously very optimistic…
I have by far the fastest relay at this hoster in terms of bandwith, but
that’s nothing to be proud of if the relay crashes or is overloaded all
the time.

Thanks again for your suggestions!

All the best!
Elias