[tor-relays] How to reduce tor CPU load on a single bridge?

[I'm about to go off-line for some days, so I am sending my current
suboptimally-organized reply, which I hope is better than waiting another
week to respond :)]

Let's make a distinction between the "frontend" snowflake-server
pluggable transport process, and the "backend" tor process. These don't
necessarily have to be 1:1; either one could be run in multiple
instances. Currently, the "backend" tor is the limiting factor, because
it uses only 1 CPU core. The "frontend" snowflake-server can scale to
multiple cores in a single process and is comparatively unrestrained.

Excellent point, and yes this simplifies. Great.

I believe that the "pinning" of a client session to particular tor
instance will work automatically by the fact that snowflake-server keeps
an outgoing connection alive (i.e., through the load balancer) as long
as a KCP session exists.
[...]
But before starting the second instance the first time, copy keys from
the first instance:

Hm. It looks promising! But we might still have a Tor-side problem
remaining. I think it boils down to how long the KCP sessions last.

The details on how exactly these bridge instances will diverge over time:

The keys directory will start out the same, but after four weeks
(DEFAULT_ONION_KEY_LIFETIME_DAYS, used to be one week but in Tor
0.3.1.1-alpha, proposal 274, we bumped it up to four weeks) each
bridge will rotate its onion key (the one clients use for circuit-level
crypto). That is, each instance will generate its own fresh onion key.

The two bridge instances actually haven't diverged completely at that
point, since Tor remembers the previous onion key (i.e. the onion key
from the previous period) and is willing to receive create cells that
use it for one further week (DEFAULT_ONION_KEY_GRACE_PERIOD_DAYS). So it
is after 5 weeks that the original (shared) onion key will no longer work.

Where this matters is (after this 5 weeks have passed) if the client
connects to the bridge, fetches and caches the bridge descriptor of
instance A, and then later it connects to the bridge again and gets
passed to instance B. In this case, the create cell that the client
generates will use the onion key for instance A, and instance B won't
know how to decrypt it so it will send a destroy cell back.

If this is an issue, we can definitely work around it, by e.g. disabling
the onion key rotation on the bridges, or setting up a periodic rsync+hup
between the bridges, or teaching clients to use createfast cells in this
situation (this type of circuit crypto doesn't use the onion key at all,
and just relies on TLS for security -- which can only be done for the
first hop of the circuit but that's the one we're talking about here).

But before we think about workarounds, maybe we don't need one: how long
does "the KCP session" last?

Tor clients try to fetch a fresh bridge descriptor every three-ish
hours, and once they fetch a bridge descriptor from their "current"
bridge instance, they should know the onion key that it wants to use. So
it is that up-to-three-hour window where I think things could go wrong.
And that timeframe sounds promising.

(I also want to double-check that clients don't try to use the onion
key from the current cached descriptor while fetching the updated
descriptor. That could become an ugly bug in the wrong circumstances,
and would be something we want to fix if it's happening.)

Here's how you can simulate a pair of bridge instances that have diverged
after five weeks, so you can test how things would work with them:

Copy the keys directory as before, but "rm secret_onion_key*" in the
keys directory on n-1 of the instances, before starting them.)

Thanks!
--Roger

···

On Thu, Dec 30, 2021 at 10:42:51PM -0700, David Fifield wrote:

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

2 Likes

Kristian,

I am not really concerned about my IPs being blacklisted as these are normal relays, not bridges.

I suppose if you have the address space and are running your relays in a server environment–it’s your prerogative. In my case, I’m running my super relay, from home, with limited address space, so it is more suited to my needs.

In that area I am a little bit old school, and I am indeed running them manually for now. I don’t think there is a technical reason for it. It’s just me being me.

I’m a proponent of individuality. Keep being you.

Respectfully,

Gary

Hi Gary,

thanks!

As an aside… Presently, are you using a single, public address with many ports or many, public addresses with a single port for your Tor deployments? Have you ever considered putting all those Tor instances behind a single, public address:port (fingerprint) to create one super bridge/relay? I’m just wondering if it makes sense to conserve and rotate through public address space to stay ahead of the blacklisting curve?

Almost all of my dedicated servers have multiple IPv4 addresses, and you can have up to two tor relays per IPv4. So, the answer is multiple IPs and on multiple different ports. A “super relay” still has no real merit for me. I am not really concerned about my IPs being blacklisted as these are normal relays, not bridges.

What I am doing now for new servers is running them for a week or two as bridges and only then I move them over to hosting relays. In the past I have not seen a lot of traffic on bridges, but this has changed very recently. I saw 200+ unique users in the past 6 hours on one of my new bridges yesterday with close to 100 Mbit/s of consistent traffic. There appears to be an increased need right now, which I am happy to tend to.

Also… Do you mind disclosing what all your screen instances are for? Are you running your Tor instances manually and not in daemon mode? “Inquiring minds want to know.” :grin:

In that area I am a little bit old school, and I am indeed running them manually for now. I don’t think there is a technical reason for it. It’s just me being me.

Best Regards,

Kristian

Dec 29, 2021, 01:46 by tor-relays@lists.torproject.org:

···

On Wednesday, December 29, 2021, 03:32:55 AM MST, abuse— via tor-relays tor-relays@lists.torproject.org wrote:

Hi Kristian,

Thanks for the screenshot. Nice Machine! Not everyone is as fortunate as you when it comes to resources for their Tor deployments. While a cpu affinity option isn’t high on the priority list, as you point out, many operating systems do a decent job of load management and there are third-party options available for cpu affinity, but it might be helpful for some to have an application layer option to tune their implementations natively.

As an aside… Presently, are you using a single, public address with many ports or many, public addresses with a single port for your Tor deployments? Have you ever considered putting all those Tor instances behind a single, public address:port (fingerprint) to create one super bridge/relay? I’m just wondering if it makes sense to conserve and rotate through public address space to stay ahead of the blacklisting curve?

Also… Do you mind disclosing what all your screen instances are for? Are you running your Tor instances manually and not in daemon mode? “Inquiring minds want to know.” :grin:

As always… It is great to engage in dialogue with you.

Respectfully,

Gary

On Tuesday, December 28, 2021, 1:39:31 PM MST, abuse@lokodlare.com abuse@lokodlare.com wrote:

Hi Gary,

why would that be needed? Linux has a pretty good thread scheduler imo and will shuffle loads around as needed.

Even Windows’ thread scheduler is quite decent these days and tools like “Process Lasso” exist if additional fine tuning is needed.

Attached is one of my servers running multiple tor instances on a 12/24C platform. The load is spread quite evenly across all cores.

Best Regards,

Kristian

Dec 27, 2021, 22:08 by tor-relays@lists.torproject.org:

BTW… I just fact-checked my post-script and the cpu affinity configuration I was thinking of is for Nginx (not Tor). Tor should consider adding a cpu affinity configuration option. What happens if you configure additional Tor instances on the same machine (my Tor instances are on different machines) and start them up? Do they bind to a different or the same cpu core?

Respectfully,

Gary

On Monday, December 27, 2021, 2:44:59 PM MST, Gary C. New via tor-relays tor-relays@lists.torproject.org wrote:

David/Roger:

Search the tor-relay mail archive for my previous responses on loadbalancing Tor Relays, which I’ve been successfully doing for the past 6 months with Nginx (it’s possible to do with HAProxy as well). I haven’t had time to implement it with a Tor Bridge, but I assume it will be very similar. Keep in mind it’s critical to configure each Tor instance to use the same DirectoryAuthority and to disable the upstream timeouts on Nginx/HAProxy.

Happy Tor Loadbalancing!

Respectfully,

Gary

P.S. I believe there’s a torrc config option to specify which cpu core a given Tor instance should use, too.

On Monday, December 27, 2021, 2:00:50 PM MST, Roger Dingledine arma@torproject.org wrote:

On Mon, Dec 27, 2021 at 12:05:26PM -0700, David Fifield wrote:

I have the impression that tor cannot use more than one CPU core???is that

correct? If so, what can be done to permit a bridge to scale beyond

1×100% CPU? We can fairly easily scale the Snowflake-specific components

around the tor process, but ultimately, a tor client process expects to

connect to a bridge having a certain fingerprint, and that is the part I

don’t know how to easily scale.

  • Surely it’s not possible to run multiple instances of tor with the

same fingerprint? Or is it? Does the answer change if all instances

are on the same IP address? If the OR ports are never used?

Good timing – Cecylia pointed out the higher load on Flakey a few days

ago, and I’ve been meaning to post a suggestion somewhere. You actually

can run more than one bridge with the same fingerprint. Just set it

up in two places, with the same identity key, and then whichever one the

client connects to, the client will be satisfied that it’s reaching the

right bridge.

There are two catches to the idea:

(A) Even though the bridges will have the same identity key, they won’t

have the same circuit-level onion key, so it will be smart to “pin”

each client to a single bridge instance – so when they fetch the bridge

descriptor, which specifies the onion key, they will continue to use

that bridge instance with that onion key. Snowflake in particular might

also want to pin clients to specific bridges because of the KCP state.

(Another option, instead of pinning clients to specific instances,

would be to try to share state among all the bridges on the backend,

e.g. so they use the same onion key, can resume the same KCP sessions,

etc. This option seems hard.)

(B) It’s been a long time since anybody tried this, so there might be

surprises. :slight_smile: But it should work, so if there are surprises, we should

try to fix them.

This overall idea is similar to the “router twins” idea from the distant

distant past:

https://lists.torproject.org/pipermail/tor-dev/2002-July/001122.html

https://lists.torproject.org/pipermail/tor-commits/2003-October/024388.html

https://lists.torproject.org/pipermail/tor-dev/2003-August/000236.html

  • Removing the fingerprint from the snowflake Bridge line in Tor Browser

would permit the Snowflake proxies to round-robin clients over several

bridges, but then the first hop would be unauthenticated (at the Tor

layer). It would be nice if it were possible to specify a small set of

permitted bridge fingerprints.

This approach would also require clients to pin to a particular bridge,

right? Because of the different state that each bridge will have?

–Roger


tor-relays mailing list

tor-relays@lists.torproject.org

https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays


tor-relays mailing list

tor-relays@lists.torproject.org

https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays


tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

1 Like

David, Roger, et al.,

I just got back from holidays and really enjoyed this thread!

I run my Loadbalanced Tor Relay as a Guard/Middle Relay, very similar to David’s topology diagram, without the Snoflake-Server proxy. I’m using Nginx (which forks a child process per core) instead of HAProxy. My Backend Tor Relay Nodes are running on several, different Physical Servers; thus, I’m using Private Address Space instead of Loopback Address Space.

In this configuration, I discovered that I had to configure Nginix/HAProxy to use Transparent Streaming Mode, use Source IP Address Sticky Sessions (Pinning), configure the Loadbalancer to send the Backend Tor Relay Nodes’ traffic back to Nginx/HAProxy (Kernel & IPTables), configure all Backend Tor Relay Nodes to use a copy of the same .tordb (I wasn’t able to get the Backend Tor Relay Nodes working with the same .tordb (over NFS) without the DirectoryAuthorities complaining), and configure the Backend Tor Relay Nodes to use the same DirectoryAuthority (to ensure each Backend Tor Relay Node sends Meta-Data to the same DirectoryAuthority). Moreover, I’ve enabled logging to a central Syslog Server for each Backend Tor Relay Node and created a number of Shell Scripts to help remotely manage each Backend Tor Relay Node.

Here are some sample configurations for reference.

Nginx Config:

upstream orport_tornodes {
#least_conn;
hash $remote_addr consistent;
#server 192.168.0.1:9001 weight=1 max_fails=1 fail_timeout=10s;
#server 192.168.0.1:9001 down;
server 192.168.0.11:9001 weight=4 max_fails=0 fail_timeout=0s;
server 192.168.0.21:9001 weight=4 max_fails=0 fail_timeout=0s;
#server 192.168.0.31:9001 weight=4 max_fails=3 fail_timeout=300s;
server 192.168.0.41:9001 weight=4 max_fails=0 fail_timeout=0s;
server 192.168.0.51:9001 weight=4 max_fails=0 fail_timeout=0s;
#zone orport_torfarm 64k;

HAProxy Config (Alternate):

frontend tornodes

Log to global config

log global

Bind to port 443 on a specified interface

bind 0.0.0.0:9001 transparent

We’re proxying TCP here…

mode tcp

default_backend orport_tornodes

Simple TCP source consistent over several servers using the specified

source 0.0.0.0 usesrc clientip

backend orport_tornodes

balance source
hash-type consistent
#server tornode1 192.168.0.1:9001 check disabled
#server tornode11 192.168.0.11:9001 source 192.168.0.1
server tornode11 192.168.0.11:9001 source 0.0.0.0 usesrc clientip check disabled
server tornode21 192.168.0.21:9001 source 0.0.0.0 usesrc clientip check disabled
#server tornode31 192.168.0.31:9001 source 0.0.0.0 usesrc clientip check disabled
server tornode41 192.168.0.41:9001 source 0.0.0.0 usesrc clientip check disabled
server tornode51 192.168.0.51:9001 source 0.0.0.0 usesrc clientip check disabled

Linux Kernel & IPTables Config:

modprobe xt_socket
modprobe xt_TPROXY

echo 1 > /proc/sys/net/ipv4/ip_forward; cat /proc/sys/net/ipv4/ip_forward
echo 1 > /proc/sys/net/ipv4/ip_nonlocal_bind; cat /proc/sys/net/ipv4/ip_nonlocal_bind
echo 15000 64000 > /proc/sys/net/ipv4/ip_local_port_range; cat /proc/sys/net/ipv4/ip_local_port_range

ip rule del fwmark 1 lookup 100 2>/dev/null # Ensure Duplicate Rule is not Created
ip rule add fwmark 1 lookup 100 # ip rule show
ip route add local 0.0.0.0/0 dev lo table 100 # ip route show table wan0; ip route show table 100

iptables -I INPUT -p tcp --dport 9001 -j ACCEPT
iptables -t mangle -N TOR
iptables -t mangle -A PREROUTING -p tcp -m socket -j TOR
iptables -t mangle -A TOR -j MARK --set-mark 1
iptables -t mangle -A TOR -j ACCEPT
#iptables -t mangle -A PREROUTING -p tcp -s 192.168.0.0/24 --sport 9001 -j MARK --set-xmark 0x1/0xffffffff
#iptables -t mangle -A PREROUTING -p tcp --dport 9001 -j TPROXY --tproxy-mark 0x1/0x1 --on-port 9001 --on-ip 127.0.0.1

Backend Tor Relay Node Configs:

cat /tmp/torrc

Nickname xxxxxxxxxxxxxxxxxx
ORPort xxx.xxx.xxx.xxx:9001 NoListen
ORPort 192.168.0.11:9001 NoAdvertise
SocksPort 9050
SocksPort 192.168.0.11:9050
ControlPort 9051
DirAuthority longclaw orport=443 no-v2 v3ident=23D15D965BC35114467363C165C4F724B64B4F66 199.58.81.140:80 74A910646BCEEFBCD2E874FC1DC997430F968145
FallbackDir 193.23.244.244:80 orport=443 id=7BE683E65D48141321C5ED92F075C55364AC7123
DirCache 0
ExitRelay 0
MaxMemInQueues 192 MB
GeoIPFile /opt/share/tor/geoip
Log notice file /tmp/torlog
Log notice syslog
VirtualAddrNetwork 10.192.0.0/10
AutomapHostsOnResolve 1
TransPort 192.168.0.11:9040
DNSPort 192.168.0.11:9053
RunAsDaemon 1
DataDirectory /tmp/tor/torrc.d/.tordb
AvoidDiskWrites 1
User tor
ContactInfo tor-operator@your-emailaddress-domain

cat /tmp/torrc

Nickname xxxxxxxxxxxxxxxxxx
ORPort xxx.xxx.xxx.xxx:9001 NoListen
ORPort 192.168.0.41:9001 NoAdvertise
SocksPort 9050
SocksPort 192.168.0.41:9050
ControlPort 9051
DirAuthority longclaw orport=443 no-v2 v3ident=23D15D965BC35114467363C165C4F724B64B4F66 199.58.81.140:80 74A910646BCEEFBCD2E874FC1DC997430F968145
FallbackDir 193.23.244.244:80 orport=443 id=7BE683E65D48141321C5ED92F075C55364AC7123
DirCache 0
ExitRelay 0
MaxMemInQueues 192 MB
GeoIPFile /opt/share/tor/geoip
Log notice file /tmp/torlog
Log notice syslog
VirtualAddrNetwork 10.192.0.0/10
AutomapHostsOnResolve 1
TransPort 192.168.0.41:9040
DNSPort 192.168.0.41:9053
RunAsDaemon 1
DataDirectory /tmp/tor/torrc.d/.tordb
AvoidDiskWrites 1
User tor
ContactInfo tor-operator@your-emailaddress-domain

Shell Scripts to Remotely Manage Tor Relay Nodes:

cat /usr/sbin/stat-tor-nodes

#!/bin/sh
uptime-all-nodes; memfree-all-nodes; netstat-tor-nodes

cat /usr/sbin/uptime-all-nodes

#!/bin/sh
/usr/bin/ssh -t admin@192.168.0.11 ‘hostname; uptime’
/usr/bin/ssh -t admin@192.168.0.21 ‘hostname; uptime’
/usr/bin/ssh -t admin@192.168.0.31 ‘hostname; uptime’
/usr/bin/ssh -t admin@192.168.0.41 ‘hostname; uptime’
/usr/bin/ssh -t admin@192.168.0.51 ‘hostname; uptime’

cat /usr/sbin/memfree-all-nodes

#!/bin/sh
/usr/bin/ssh -t admin@192.168.0.11 ‘hostname; grep MemFree /proc/meminfo’
/usr/bin/ssh -t admin@192.168.0.21 ‘hostname; grep MemFree /proc/meminfo’
/usr/bin/ssh -t admin@192.168.0.31 ‘hostname; grep MemFree /proc/meminfo’
/usr/bin/ssh -t admin@192.168.0.41 ‘hostname; grep MemFree /proc/meminfo’
/usr/bin/ssh -t admin@192.168.0.51 ‘hostname; grep MemFree /proc/meminfo’

cat /usr/sbin/netstat-tor-nodes

#!/bin/sh
/usr/bin/ssh -t admin@192.168.0.11 ‘hostname; netstat -anp | grep -i tor | grep -v 192.168.0.1: | wc -l’
/usr/bin/ssh -t admin@192.168.0.21 ‘hostname; netstat -anp | grep -i tor | grep -v 192.168.0.1: | wc -l’
/usr/bin/ssh -t admin@192.168.0.31 ‘hostname; netstat -anp | grep -i tor | grep -v 192.168.0.1: | wc -l’
/usr/bin/ssh -t admin@192.168.0.41 ‘hostname; netstat -anp | grep -i tor | grep -v 192.168.0.1: | wc -l’
/usr/bin/ssh -t admin@192.168.0.51 ‘hostname; netstat -anp | grep -i tor | grep -v 192.168.0.1: | wc -l’

cat /jffs/sbin/ps-tor-nodes

#!/bin/sh
/usr/bin/ssh -t admin@192.168.0.11 ‘hostname; ps w | grep -i tor’
/usr/bin/ssh -t admin@192.168.0.21 ‘hostname; ps w | grep -i tor’
/usr/bin/ssh -t admin@192.168.0.31 ‘hostname; ps w | grep -i tor’
/usr/bin/ssh -t admin@192.168.0.41 ‘hostname; ps w | grep -i tor’
/usr/bin/ssh -t admin@192.168.0.51 ‘hostname; ps w | grep -i tor’

cat /usr/sbin/killall-tor-nodes

#!/bin/sh
read -r -p "Are you sure? [y/N] " input
case “$input” in
[yY])
/usr/bin/ssh -t admin@192.168.0.11 ‘killall tor’
/usr/bin/ssh -t admin@192.168.0.21 ‘killall tor’
#/usr/bin/ssh -t admin@192.168.0.31 ‘killall tor’
/usr/bin/ssh -t admin@192.168.0.41 ‘killall tor’
/usr/bin/ssh -t admin@192.168.0.51 ‘killall tor’
return 0
;;
*)
return 1
;;
esac

cat /usr/sbin/restart-tor-nodes

#!/bin/sh
read -r -p "Are you sure? [y/N] " input
case “$input” in
[yY])
/usr/bin/ssh -t admin@192.168.0.11 ‘/usr/sbin/tor -f /tmp/torrc --quiet’
/usr/bin/ssh -t admin@192.168.0.21 ‘/usr/sbin/tor -f /tmp/torrc --quiet’
#/usr/bin/ssh -t admin@192.168.0.31 ‘/usr/sbin/tor -f /tmp/torrc --quiet’
/usr/bin/ssh -t admin@192.168.0.41 ‘/usr/sbin/tor -f /tmp/torrc --quiet’
/usr/bin/ssh -t admin@192.168.0.51 ‘/usr/sbin/tor -f /tmp/torrc --quiet’
return 0
;;
*)
return 1
;;
esac

I’ve been meaning to put together a tutorial on Loadbalancing Tor Relays, but haven’t found the time as of yet. Perhaps, this will help, until I am able to find the time.

I appreciate your knowledge sharing and for furthering the topic of Loadbalancing Tor Relays; especially, with regard to Bridging and Exit Relays.

Keep up the Great Work!

Respectfully,

Gary

[I’m about to go off-line for some days, so I am sending my current suboptimally-organized reply, which I hope is better than waiting another week to respond :)]

Let’s make a distinction between the “frontend” snowflake-server
pluggable transport process, and the “backend” tor process. These don’t
necessarily have to be 1:1; either one could be run in multiple
instances. Currently, the “backend” tor is the limiting factor, because
it uses only 1 CPU core. The “frontend” snowflake-server can scale to
multiple cores in a single process and is comparatively unrestrained.

Excellent point, and yes this simplifies. Great.

I believe that the “pinning” of a client session to particular tor
instance will work automatically by the fact that snowflake-server keeps
an outgoing connection alive (i.e., through the load balancer) as long
as a KCP session exists.
[…]
But before starting the second instance the first time, copy keys from
the first instance:

Hm. It looks promising! But we might still have a Tor-side problem
remaining. I think it boils down to how long the KCP sessions last.

The details on how exactly these bridge instances will diverge over time:

The keys directory will start out the same, but after four weeks
(DEFAULT_ONION_KEY_LIFETIME_DAYS, used to be one week but in Tor
0.3.1.1-alpha, proposal 274, we bumped it up to four weeks) each
bridge will rotate its onion key (the one clients use for circuit-level
crypto). That is, each instance will generate its own fresh onion key.

The two bridge instances actually haven’t diverged completely at that
point, since Tor remembers the previous onion key (i.e. the onion key
from the previous period) and is willing to receive create cells that
use it for one further week (DEFAULT_ONION_KEY_GRACE_PERIOD_DAYS). So it
is after 5 weeks that the original (shared) onion key will no longer work.

Where this matters is (after this 5 weeks have passed) if the client
connects to the bridge, fetches and caches the bridge descriptor of
instance A, and then later it connects to the bridge again and gets
passed to instance B. In this case, the create cell that the client
generates will use the onion key for instance A, and instance B won’t
know how to decrypt it so it will send a destroy cell back.

If this is an issue, we can definitely work around it, by e.g. disabling
the onion key rotation on the bridges, or setting up a periodic rsync+hup
between the bridges, or teaching clients to use createfast cells in this
situation (this type of circuit crypto doesn’t use the onion key at all,
and just relies on TLS for security – which can only be done for the
first hop of the circuit but that’s the one we’re talking about here).

But before we think about workarounds, maybe we don’t need one: how long
does “the KCP session” last?

Tor clients try to fetch a fresh bridge descriptor every three-ish
hours, and once they fetch a bridge descriptor from their “current”
bridge instance, they should know the onion key that it wants to use. So
it is that up-to-three-hour window where I think things could go wrong.
And that timeframe sounds promising.

(I also want to double-check that clients don’t try to use the onion
key from the current cached descriptor while fetching the updated
descriptor. That could become an ugly bug in the wrong circumstances,
and would be something we want to fix if it’s happening.)

Here’s how you can simulate a pair of bridge instances that have diverged
after five weeks, so you can test how things would work with them:

Copy the keys directory as before, but “rm secret_onion_key*” in the
keys directory on n-1 of the instances, before starting them.)

Thanks!

–Roger

···

On Tuesday, January 4, 2022, 09:57:52 PM MST, Roger Dingledine arma@torproject.org wrote:
On Thu, Dec 30, 2021 at 10:42:51PM -0700, David Fifield wrote:


tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

1 Like

Hm. It looks promising! But we might still have a Tor-side problem remaining. I think it boils down to how long the KCP sessions last.

The details on how exactly these bridge instances will diverge over time:

The keys directory will start out the same, but after four weeks (DEFAULT_ONION_KEY_LIFETIME_DAYS, used to be one week but in Tor 0.3.1.1-alpha, proposal 274, we bumped it up to four weeks) each bridge will rotate its onion key (the one clients use for circuit-level crypto). That is, each instance will generate its own fresh onion key.

The two bridge instances actually haven't diverged completely at that point, since Tor remembers the previous onion key (i.e. the onion key from the previous period) and is willing to receive create cells that use it for one further week (DEFAULT_ONION_KEY_GRACE_PERIOD_DAYS). So it is after 5 weeks that the original (shared) onion key will no longer work.

Where this matters is (after this 5 weeks have passed) if the client connects to the bridge, fetches and caches the bridge descriptor of instance A, and then later it connects to the bridge again and gets passed to instance B. In this case, the create cell that the client generates will use the onion key for instance A, and instance B won't know how to decrypt it so it will send a destroy cell back.

I've done an experiment with a second snowflake bridge that has the same identity keys but different onion keys. A client can bootstrap with either one starting from a clean state, but it fails if you bootstrap with one and then try to bootstrap with the other using the same DataDirectory. The error you get is

onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4

The first bridge is the existing "prod" snowflake bridge with nickname:
* flakey

The other "staging" bridge is the load-balanced configuration with four instances. All four instances currently have the same onion keys; which however are different from the "prod"'s onion keys. (The onion keys actually come from a backup I made.)
* flakey1
* flakey2
* flakey3
* flakey4

Bootstrapping "prod" with a fresh DataDirectory "datadir.prod" works. Here is torrc.prod:

UseBridges 1
SocksPort auto
DataDirectory datadir.prod
ClientTransportPlugin snowflake exec ./client -keep-local-addresses -log snowflake.log
Bridge snowflake 192.0.2.3:1 2B280B23E1107BB62ABFC40DDCC8824814F80A72 url=https://snowflake-broker.torproject.net/ max=1 ice=stun:stun.voip.blackberry.com:3478,stun:stun.altar.com.pl:3478,stun:stun.antisip.com:3478,stun:stun.bluesip.net:3478,stun:stun.dus.net:3478,stun:stun.epygi.com:3478,stun:stun.sonetel.com:3478,stun:stun.sonetel.net:3478,stun:stun.stunprotocol.org:3478,stun:stun.uls.co.za:3478,stun:stun.voipgate.com:3478,stun:stun.voys.nl:3478

Notice `new bridge descriptor 'flakey' (fresh)`:

snowflake/client$ tor -f torrc.prod
[notice] Tor 0.3.5.16 running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.1d, Zlib 1.2.11, Liblzma 5.2.4, and Libzstd 1.3.8.
[notice] Bootstrapped 0%: Starting
[notice] Starting with guard context "bridges"
[notice] Delaying directory fetches: No running bridges
[notice] Bootstrapped 5%: Connecting to directory server
[notice] Bootstrapped 10%: Finishing handshake with directory server
[notice] Bootstrapped 15%: Establishing an encrypted directory connection
[notice] Bootstrapped 20%: Asking for networkstatus consensus
[notice] new bridge descriptor 'flakey' (fresh): $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[notice] Bootstrapped 25%: Loading networkstatus consensus
[notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
[notice] Bootstrapped 40%: Loading authority key certs
[notice] The current consensus has no exit nodes. Tor can only build internal paths, such as paths to onion services.
[notice] Bootstrapped 45%: Asking for relay descriptors for internal paths
[notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 0/6673, and can only build 0% of likely paths. (We have 100% of guards bw, 0% of midpoint bw, and 0% of end bw (no exits in consensus, using mid) = 0% of path bw.)
[notice] Bootstrapped 50%: Loading relay descriptors for internal paths
[notice] The current consensus contains exit nodes. Tor can build exit and internal paths.
[notice] Bootstrapped 57%: Loading relay descriptors
[notice] Bootstrapped 64%: Loading relay descriptors
[notice] Bootstrapped 73%: Loading relay descriptors
[notice] Bootstrapped 78%: Loading relay descriptors
[notice] Bootstrapped 80%: Connecting to the Tor network
[notice] Bootstrapped 85%: Finishing handshake with first hop
[notice] Bootstrapped 90%: Establishing a Tor circuit
[notice] Bootstrapped 100%: Done

Bootstrapping "staging" with a fresh DataDirectory "datadir.staging" also works. Here is torrc.staging:

UseBridges 1
SocksPort auto
DataDirectory datadir.staging
ClientTransportPlugin snowflake exec ./client -keep-local-addresses -log snowflake.log
Bridge snowflake 192.0.2.3:1 2B280B23E1107BB62ABFC40DDCC8824814F80A72 url=http://127.0.0.1:8000/ max=1 ice=stun:stun.voip.blackberry.com:3478,stun:stun.altar.com.pl:3478,stun:stun.antisip.com:3478,stun:stun.bluesip.net:3478,stun:stun.dus.net:3478,stun:stun.epygi.com:3478,stun:stun.sonetel.com:3478,stun:stun.sonetel.net:3478,stun:stun.stunprotocol.org:3478,stun:stun.uls.co.za:3478,stun:stun.voipgate.com:3478,stun:stun.voys.nl:3478

Notice `new bridge descriptor 'flakey4' (fresh)`:

snowflake/broker$ ./broker -disable-tls -addr 127.0.0.1:8000
snowflake/proxy$ ./proxy -capacity 10 -broker http://127.0.0.1:8000/ -keep-local-addresses -relay wss://snowflake-staging.bamsoftware.com/
snowflake/client$ tor -f torrc.staging
[notice] Tor 0.3.5.16 running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.1d, Zlib 1.2.11, Liblzma 5.2.4, and Libzstd 1.3.8.
[notice] Bootstrapped 0%: Starting
[notice] Starting with guard context "bridges"
[notice] Delaying directory fetches: No running bridges
[notice] Bootstrapped 5%: Connecting to directory server
[notice] Bootstrapped 10%: Finishing handshake with directory server
[notice] Bootstrapped 15%: Establishing an encrypted directory connection
[notice] Bootstrapped 20%: Asking for networkstatus consensus
[notice] new bridge descriptor 'flakey4' (fresh): $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey4 at 192.0.2.3
[notice] Bootstrapped 25%: Loading networkstatus consensus
[notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
[notice] Bootstrapped 40%: Loading authority key certs
[notice] The current consensus has no exit nodes. Tor can only build internal paths, such as paths to onion services.
[notice] Bootstrapped 45%: Asking for relay descriptors for internal paths
[notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 0/6673, and can only build 0% of likely paths. (We have 100% of guards bw, 0% of midpoint bw, and 0% of end bw (no exits in consensus, using mid) = 0% of path bw.)
[notice] Bootstrapped 50%: Loading relay descriptors for internal paths
[notice] The current consensus contains exit nodes. Tor can build exit and internal paths.
[notice] Bootstrapped 57%: Loading relay descriptors
[notice] Bootstrapped 63%: Loading relay descriptors
[notice] Bootstrapped 72%: Loading relay descriptors
[notice] Bootstrapped 77%: Loading relay descriptors
[notice] Bootstrapped 80%: Connecting to the Tor network
[notice] Bootstrapped 85%: Finishing handshake with first hop
[notice] Bootstrapped 90%: Establishing a Tor circuit
[notice] Bootstrapped 100%: Done

But now, if we try running torrc.staging but give it the DataDirectory "datadir.prod", it fails at 90%. Notice `new bridge descriptor 'flakey' (cached)`: if the descriptor had not been cached it would have been flakey[1234] instead.

$ tor -f torrc.staging DataDirectory datadir.prod Log "notice stderr" Log "info file info.log"
[notice] Tor 0.3.5.16 running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.1d, Zlib 1.2.11, Liblzma 5.2.4, and Libzstd 1.3.8.
[notice] Bootstrapped 0%: Starting
[notice] Starting with guard context "bridges"
[notice] new bridge descriptor 'flakey' (cached): $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[notice] Delaying directory fetches: Pluggable transport proxies still configuring
[notice] Bootstrapped 5%: Connecting to directory server
[notice] Bootstrapped 10%: Finishing handshake with directory server
[notice] Bootstrapped 80%: Connecting to the Tor network
[notice] Bootstrapped 90%: Establishing a Tor circuit
[notice] Delaying directory fetches: No running bridges

Here is an excerpt from the info-level log that shows the error. The important part seems to be `onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4`.

[notice] new bridge descriptor 'flakey' (cached): $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[notice] Delaying directory fetches: Pluggable transport proxies still configuring
[info] extend_info_from_node(): Including Ed25519 ID for $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[info] onion_pick_cpath_exit(): Using requested exit node '$2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3'
[info] circuit_handle_first_hop(): Next router is [scrubbed]: Not connected. Connecting.
[notice] Bootstrapped 5%: Connecting to directory server
[info] connection_or_set_canonical(): Channel 0 chose an idle timeout of 247.
[info] connection_or_set_identity_digest(): Set identity digest for 0x55c3f9356770 ([scrubbed]): 2B280B23E1107BB62ABFC40DDCC8824814F80A72 1zOHpg+FxqQfi/6jDLtCpHHqBTH8gjYmCKXkus1D5Ko.
[info] connection_or_set_identity_digest():    (Previously: 0000000000000000000000000000000000000000 <unset>)
[info] connection_or_set_canonical(): Channel 1 chose an idle timeout of 232.
[info] circuit_predict_and_launch_new(): Have 0 clean circs (0 internal), need another exit circ.
[info] choose_good_exit_server_general(): Found 1336 servers that might support 0/0 pending connections.
[info] choose_good_exit_server_general(): Chose exit server '$0F1C8168DFD0AADBE61BD71194D37C867FED5A21~FreeExit at 81.17.18.60'
[info] extend_info_from_node(): Including Ed25519 ID for $0F1C8168DFD0AADBE61BD71194D37C867FED5A21~FreeExit at 81.17.18.60
[info] select_primary_guard_for_circuit(): Selected primary guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72) for circuit.
[info] extend_info_from_node(): Including Ed25519 ID for $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[info] extend_info_from_node(): Including Ed25519 ID for $7158D1E0D9F90F7999ACB3B073DA762C9B2C3275~maltimore at 207.180.224.17
[info] circuit_handle_first_hop(): Next router is [scrubbed]: Connection in progress; waiting.
[info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
[info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
[notice] Bootstrapped 10%: Finishing handshake with directory server
[notice] Bootstrapped 80%: Connecting to the Tor network
[info] parse_socks_client(): SOCKS 5 client: need authentication.
[info] parse_socks_client(): SOCKS 5 client: authentication successful.
[info] connection_read_proxy_handshake(): Proxy Client: connection to 192.0.2.3:1 successful
[info] circuit_predict_and_launch_new(): Have 1 clean circs (0 internal), need another exit circ.
[info] choose_good_exit_server_general(): Found 1336 servers that might support 0/0 pending connections.
[info] choose_good_exit_server_general(): Chose exit server '$D8A1F5A8EA1AF53E3414B9C48FE6B10C31ACC9B2~privexse1exit at 185.130.44.108'
[info] extend_info_from_node(): Including Ed25519 ID for $D8A1F5A8EA1AF53E3414B9C48FE6B10C31ACC9B2~privexse1exit at 185.130.44.108
[info] select_primary_guard_for_circuit(): Selected primary guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72) for circuit.
[info] extend_info_from_node(): Including Ed25519 ID for $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[info] extend_info_from_node(): Including Ed25519 ID for $2F9AFDE43DC8E3F05803304C01BD3DBF329169AC~dutreuil at 213.152.168.27
[info] circuit_handle_first_hop(): Next router is [scrubbed]: Connection in progress; waiting.
[info] circuit_predict_and_launch_new(): Have 2 clean circs (0 uptime-internal, 0 internal), need another hidden service circ.
[info] extend_info_from_node(): Including Ed25519 ID for $8967A8912E61070FCFA9B8EC9869E5AC8F94949A~4Freunde at 145.239.154.56
[info] select_primary_guard_for_circuit(): Selected primary guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72) for circuit.
[info] extend_info_from_node(): Including Ed25519 ID for $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[info] extend_info_from_node(): Including Ed25519 ID for $9367EB01DF75DE6265A0971249204029D6A55877~oddling at 5.182.210.231
[info] circuit_handle_first_hop(): Next router is [scrubbed]: Connection in progress; waiting.
[info] circuit_predict_and_launch_new(): Have 3 clean circs (1 uptime-internal, 1 internal), need another hidden service circ.
[info] extend_info_from_node(): Including Ed25519 ID for $AF85E6556FD5692BC554A93BAC9FACBFC2D79EFD~whoUSicebeer09b at 192.187.103.74
[info] select_primary_guard_for_circuit(): Selected primary guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72) for circuit.
[info] extend_info_from_node(): Including Ed25519 ID for $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[info] extend_info_from_node(): Including Ed25519 ID for $9515B435D8D063E537AB137FCF5A97B1ACE3CA2A~corvuscorone at 135.181.178.197
[info] circuit_handle_first_hop(): Next router is [scrubbed]: Connection in progress; waiting.
[info] circuit_predict_and_launch_new(): Have 4 clean circs (2 uptime-internal, 2 internal), need another hidden service circ.
[info] extend_info_from_node(): Including Ed25519 ID for $68A9F0DFFC7C8F57B3DEA3801D6CF001652A809F~vpskilobug at 213.164.206.145
[info] select_primary_guard_for_circuit(): Selected primary guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72) for circuit.
[info] extend_info_from_node(): Including Ed25519 ID for $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3
[info] extend_info_from_node(): Including Ed25519 ID for $2C13A54E3E8A6AFB18E0DE5890E5B08AAF5B0F36~history at 138.201.123.109
[info] circuit_handle_first_hop(): Next router is [scrubbed]: Connection in progress; waiting.
[info] channel_tls_process_versions_cell(): Negotiated version 5 with [scrubbed]:1; Waiting for CERTS cell
[info] connection_or_client_learned_peer_id(): learned peer id for 0x55c3f9356770 ([scrubbed]): 2B280B23E1107BB62ABFC40DDCC8824814F80A72, 1zOHpg+FxqQfi/6jDLtCpHHqBTH8gjYmCKXkus1D5Ko
[info] channel_tls_process_certs_cell(): Got some good certificates from [scrubbed]:1: Authenticated it with RSA and Ed25519
[info] circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3'
[notice] Bootstrapped 90%: Establishing a Tor circuit
[info] circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3'
[info] circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3'
[info] circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3'
[info] circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3'
[info] circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 192.0.2.3'
[info] channel_tls_process_netinfo_cell(): Got good NETINFO cell from [scrubbed]:1; OR connection is now open, using protocol version 5. Its ID digest is 2B280B23E1107BB62ABFC40DDCC8824814F80A72. Our address is apparently [scrubbed].
[info] onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4
[info] circuit_mark_for_close_(): Circuit 3457244666 (id: 1) marked for close at ../src/core/or/command.c:443 (orig reason: 1, new reason: 0)
[info] onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4
[info] circuit_mark_for_close_(): Circuit 4237434553 (id: 2) marked for close at ../src/core/or/command.c:443 (orig reason: 1, new reason: 0)
[info] onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4
[info] circuit_mark_for_close_(): Circuit 3082862549 (id: 6) marked for close at ../src/core/or/command.c:443 (orig reason: 1, new reason: 0)
[info] onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4
[info] circuit_mark_for_close_(): Circuit 2596950236 (id: 4) marked for close at ../src/core/or/command.c:443 (orig reason: 1, new reason: 0)
[info] circuit_build_failed(): Our circuit 3457244666 (id: 1) failed to get a response from the first hop (192.0.2.3:1). I'm going to try to rotate to a better connection.
[info] connection_ap_fail_onehop(): Closing one-hop stream to '$2B280B23E1107BB62ABFC40DDCC8824814F80A72/192.0.2.3' because the OR conn just failed.
[info] circuit_free_(): Circuit 0 (id: 1) has been freed.
[info] circuit_build_failed(): Our circuit 4237434553 (id: 2) failed to get a response from the first hop (192.0.2.3:1). I'm going to try to rotate to a better connection.
[info] circuit_free_(): Circuit 0 (id: 2) has been freed.
[info] circuit_build_failed(): Our circuit 3082862549 (id: 6) failed to get a response from the first hop (192.0.2.3:1). I'm going to try to rotate to a better connection.
[info] circuit_free_(): Circuit 0 (id: 6) has been freed.
[info] circuit_build_failed(): Our circuit 2596950236 (id: 4) failed to get a response from the first hop (192.0.2.3:1). I'm going to try to rotate to a better connection.
[info] circuit_free_(): Circuit 0 (id: 4) has been freed.
[info] connection_free_minimal(): Freeing linked Socks connection [waiting for circuit] with 121 bytes on inbuf, 0 on outbuf.
[info] connection_dir_client_reached_eof(): 'fetch' response not all here, but we're at eof. Closing.
[info] entry_guards_note_guard_failure(): Recorded failure for primary confirmed guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72)
[info] connection_dir_client_request_failed(): Giving up on serverdesc/extrainfo fetch from directory server at '192.0.2.3'; retrying
[info] connection_free_minimal(): Freeing linked Directory connection [client reading] with 0 bytes on inbuf, 0 on outbuf.
[info] onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4
[info] circuit_mark_for_close_(): Circuit 2912328161 (id: 5) marked for close at ../src/core/or/command.c:443 (orig reason: 1, new reason: 0)
[info] onion_skin_ntor_client_handshake(): Invalid result from curve25519 handshake: 4
[info] circuit_mark_for_close_(): Circuit 2793970028 (id: 3) marked for close at ../src/core/or/command.c:443 (orig reason: 1, new reason: 0)
[info] circuit_build_failed(): Our circuit 2912328161 (id: 5) failed to get a response from the first hop (192.0.2.3:1). I'm going to try to rotate to a better connection.
[info] circuit_free_(): Circuit 0 (id: 5) has been freed.
[info] circuit_build_failed(): Our circuit 2793970028 (id: 3) failed to get a response from the first hop (192.0.2.3:1). I'm going to try to rotate to a better connection.
[info] circuit_free_(): Circuit 0 (id: 3) has been freed.
[info] connection_ap_make_link(): Making internal direct tunnel to [scrubbed]:1 ...
[info] connection_ap_make_link(): ... application connection created and linked.
[info] should_delay_dir_fetches(): Delaying dir fetches (no running bridges known)
[notice] Delaying directory fetches: No running bridges

As you suggested, CREATE_FAST in place of CREATE works. I hacked `should_use_create_fast_for_circuit` to always return true:

diff --git a/src/core/or/circuitbuild.c b/src/core/or/circuitbuild.c
index 2bcc642a97..4005ba56ce 100644
--- a/src/core/or/circuitbuild.c
+++ b/src/core/or/circuitbuild.c
@@ -801,6 +801,7 @@ should_use_create_fast_for_circuit(origin_circuit_t *circ)
   tor_assert(circ->cpath);
   tor_assert(circ->cpath->extend_info);

+  return true;
   return ! circuit_has_usable_onion_key(circ);
 }

And then the mixed configuration with the "staging" bridge and the "prod" DataDirectory bootstraps. Notice `new bridge descriptor 'flakey' (cached)` followed later by `new bridge descriptor 'flakey1' (fresh)`.

$ ~/tor/src/app/tor -f torrc.staging DataDirectory datadir.prod
[notice] Tor 0.4.6.8 (git-d5efc2c98619568e) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.1d, Zlib 1.2.11, Liblzma 5.2.4, Libzstd N/A and Glibc 2.28 as libc.
[notice] Bootstrapped 0% (starting): Starting
[notice] Starting with guard context "bridges"
[notice] new bridge descriptor 'flakey' (cached): $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey [1zOHpg+FxqQfi/6jDLtCpHHqBTH8gjYmCKXkus1D5Ko] at 192.0.2.3
[notice] Delaying directory fetches: Pluggable transport proxies still configuring
[notice] Bootstrapped 1% (conn_pt): Connecting to pluggable transport
[notice] Bootstrapped 2% (conn_done_pt): Connected to pluggable transport
[notice] Bootstrapped 10% (conn_done): Connected to a relay
[notice] Bootstrapped 14% (handshake): Handshaking with a relay
[notice] Bootstrapped 15% (handshake_done): Handshake with a relay done
[notice] Bootstrapped 75% (enough_dirinfo): Loaded enough directory info to build circuits
[notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
[notice] new bridge descriptor 'flakey1' (fresh): $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey1 [1zOHpg+FxqQfi/6jDLtCpHHqBTH8gjYmCKXkus1D5Ko] at 192.0.2.3
[notice] Bootstrapped 100% (done): Done

If this is an issue, we can definitely work around it, by e.g. disabling the onion key rotation on the bridges, or setting up a periodic rsync+hup between the bridges, or teaching clients to use createfast cells in this situation (this type of circuit crypto doesn't use the onion key at all, and just relies on TLS for security -- which can only be done for the first hop of the circuit but that's the one we're talking about here).

What do you recommend trying? I guess the quickest way to get more capacity on the snowflake bridge is to disable onion key rotation by patching the tor source code, though I wouldn't want to maintain that long-term.

Gary, I was wondering how you are dealing with the changing onion key issue, and I suppose it is [this]([tor-relays] How to reduce tor CPU load on a single bridge? - #13 by tor-relays):

use Source IP Address Sticky Sessions (Pinning)

The same client source address gets pinned to the same tor instance and therefore the same onion key. If I understand correctly, there's a potential failure if a client changes its IP address and later gets mapped to a different instance. Is that right?

···

On Tue, Jan 04, 2022 at 11:57:36PM -0500, Roger Dingledine wrote:
_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

Yes… That is correct. As long as circuits originate from the same Source IP Address, Nginx/HAProxy ensures they are pinned to the same loadbalanced Upstream Tor Node; unless, the originating Source IP Address changes (low-risk) or one of the Upstream Tor Nodes goes down (low-risk with UPS) and surviving circuits migrate to the remaining Upstream Tor Nodes, which effectively forces building of new circuits with relavent keys.

The issue I find more challenging, in loadbalancing Upstream Tor Nodes, is when the Medium-Term Key is updated after running for some time (it’s consistent with the previously mentioned 4 - 5 week time period). It is at this point that I notice all circuits bleed-off from the Upstream Tor Nodes with the exception of the Tor Node where the Medium-Term Key was successfully updated. It’s at this point that I am forced to shutdown all Upstream Tor Nodes, copy the .tordb containing the updated Medium-Term Key to the other Upstream Tor Nodes, and restart all Upstream Tor Nodes.

If there was a way for a Family of Tor Instances to share a Medium-Term Key, I believe that might solve the long-term issue of running a Loadbalanced Tor Relay.

As it stands… I can run my Loadbalanced Tor Relay for 4 - 5 weeks without any intervention.

Hope that answers your question.

Respectfully,

Gary

···

On Monday, January 17, 2022, 11:47:11 AM MST, David Fifield david@bamsoftware.com wrote:

Gary, I was wondering how you are dealing with the changing onion key issue, and I suppose it is this:

use Source IP Address Sticky Sessions (Pinning)

The same client source address gets pinned to the same tor instance and therefore the same onion key. If I understand correctly, there’s a potential failure if a client changes its IP address and later gets mapped to a different instance. Is that right?


This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)
1 Like

The DNS record for the Snowflake bridge was switched to a temporary staging server, running the load balancing setup, at 2022-01-25 17:41:00. We were debugging some initial problems until 2022-01-25 18:47:00. You can read about it here:

Snowflake sessions are now using the staging bridge, except for those that started before the change happened and haven't finished yet, and perhaps some proxies that still have the IP address of the production bridge in their DNS cache. I am not sure yet what will happen with metrics, but we'll see after a few days.

On the matter of onion key rotation, I had the idea of making the onion key files read-only. Roger did some source code investigation and said that it might work to prevent onion key rotation, with some minor side effects. I plan to give the idea a try on a different bridge. The possible side effects are that tor will continue trying and failing to rotate the onion key every hour, and "force a router descriptor rebuild, so it will try to publish a new descriptor each hour."

https://gitweb.torproject.org/tor.git/tree/src/feature/relay/router.c?h=tor-0.4.6.9#n523

  if (curve25519_keypair_write_to_file(&new_curve25519_keypair, fname,
                                       "onion") < 0) {
    log_err(LD_FS,"Couldn't write curve25519 onion key to \"%s\".",fname);
    goto error;
  }
  // ...
 error:
  log_warn(LD_GENERAL, "Couldn't rotate onion key.");
  if (prkey)
    crypto_pk_free(prkey);
···

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

1 Like

David,

Excellent documentation of your loadbalanced Snowflake endeavors!

The DNS record for the Snowflake bridge was switched to a temporary staging server, running the load balancing setup, at 2022-01-25 17:41:00. We were debugging some initial problems until 2022-01-25 18:47:00. You can read about it here:

https://bugs.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/40095#note_2772325

It’s nice to see that the Snowflake daemon offers a native configuration option for LimitNOFile. I ran into a similar issue with my initial loadbalanced Tor Relay Nodes that was solved at the O/S level using ulimit. It would be nice if torrc had a similar option.

From your documentation, it sounds like you’re running everything on the same machine? When expanding to additional machines, similar to the file limit issue, you’ll have to expand the usable ports as well.

I’d like to see more of your HAProxy configuration. Do you not have to use transparent proxy mode with Snowflake instances as you do with Tor Relay instances? I hadn’t realized HAProxy had a client timeout. Thank you for that tidbit. And thank you for referencing my comments as well.

Snowflake sessions are now using the staging bridge, except for those that started before the change happened and haven’t finished yet, and perhaps some proxies that still have the IP address of the production bridge in their DNS cache. I am not sure yet what will happen with metrics, but we’ll see after a few days.

Currently, as I only use IPv4, I can’t offer much insight as to the lack of IPv6 connections being reported (that’s what my logs report, too). Your Heartbeat messages are looking good with a symmetric balance of connections and data. They look very similar to my Heartbeat logs; except, you can tell you offer more computing power, which is great to see extrapolated! I’ve found that the Heartbeat logs are key to knowing the health of your loadbalanced Tor implementation. You might consider setting up syslog with a Snowflake filter to aggregate your Snowflake logs for easier readability.

Regarding metrics.torproject.org… I expect you’ll see that written-bytes and read-bytes only reflect that of a single Snowflake instance. However, your consensus weight will reflect the aggregate of all Snowflake instances.

On the matter of onion key rotation, I had the idea of making the onion key files read-only. Roger did some source code investigation and said that it might work to prevent onion key rotation, with some minor side effects. I plan to give the idea a try on a different bridge. The possible side effects are that tor will continue trying and failing to rotate the onion key every hour, and “force a router descriptor rebuild, so it will try to publish a new descriptor each hour.”

I’m interested to hear how the prospective read-only file fix plays out. However, from my observations, I would assume that connects will eventually start bleeding off any instances that fail to update the key. We really need a long-term solution to this issue for this style of deployment.

Keep up the Great Work!

Respectfully,

Gary

···


This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)
1 Like

David,

I’d like to see more of your HAProxy configuration. Do you not have to use transparent proxy mode with Snowflake instances as you do with Tor Relay instances? I hadn’t realized HAProxy had a client timeout. Thank you for that tidbit. And thank you for referencing my comments as well.

I found your HAProxy configuration in your “Draft installation guide.” It seems you’re using regular TCP streaming mode with the Snowflake instances vs transparent TCP streaming mode, which is a notable difference with the directly loadbalanced Tor Relay configuration. I also noticed you’ve configured the backend node timeout globally vs per node, which is just a nuance. You might test using a timeout value of 0s (to disable the timeout at the loadbalancer) and allow the Snowflake instances to preform state checking to ensure HAProxy isn’t throttling your bridge. I’ve tested both and I’m still not sure which timeout configuration makes most sense for this style implementation. Currently, I’m running with the 0s (disabled) timeout.

Any reason why you chose HAProxy over Nginx?

I did notice that you’re using the AssumeReachable 1 directive in your torrc files. Are you running into an issue where your Tor instances are failing the reachability test? Initially, I ran into a reachability issue and after digging through mountains of Tor debug logs discovered I needed to use transparent TCP streaming mode along with the Linux kernel and iptables changes to route the Tor traffic back from the Tor Relay Nodes to the loadbalancer. You shouldn’t need to run your Tor instances with the AssumeReachable 1 directive. This might suggest something in your configuration isn’t quite right.

One of my initial tests was staggering the startup of my instances to see how they randomly reported to the DirectoryAuthorities. It’s how I discovered that Tor instances pushed instead polled meta-data (different uptimes). The later would work better in a loadbalanced style deployment.

Do your Snowflake instances not have issues reporting to different DirectoryAuthorities? My Tor instances have issues if I don’t have them all report to the same DirectoryAuthority.

Keep up the excellent work.

Respectfully,

Gary

···


This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)
1 Like

It's nice to see that the Snowflake daemon offers a native configuration option for LimitNOFile. I ran into a similar issue with my initial loadbalanced Tor Relay Nodes that was solved at the O/S level using ulimit. It would be nice if torrc had a similar option.

LimitNOFile is actually not a Snowflake thing, it's a systemd thing. It's the same as `ulimit -n`. See:
https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process%20Properties

From your documentation, it sounds like you're running everything on the same machine? When expanding to additional machines, similar to the file limit issue, you'll have to expand the usable ports as well.

I don't think I understand your point. At 64K simultaneous connections, you run out of source ports for making connection 4-tuple unique, but I don't see how the same or different hosts makes a difference, in that respect.

I found your HAProxy configuration in your “Draft installation guide.” It seems you’re using regular TCP streaming mode with the Snowflake instances vs transparent TCP streaming mode, which is a notable difference with the directly loadbalanced Tor Relay configuration.

I admit I did not understand your point about transparent proxying. If it's about retaining the client's source IP address for source IP address pinning, I don't think that helps us. This is a bridge, not a relay, and the source IP address that haproxy sees is several steps removed from the client's actual IP address. haproxy receives connections from a localhost web server (the server pluggable transport that receives WebSocket connections); the web server receives connections from Snowflake proxies (which can and do have different IP addresses during the lifetime of a client session); only the Snowflake proxies themselves receive direct traffic from the client's own source IP address. The client's IP address is tunnelled all the way through to tor, for metrics purposes, but that uses the ExtORPort protocol and the load balancer isn't going to understand that. I think that transparent proxying would only transparently proxy the localhost IP addresses from the web server, which doesn't have any benefit, I don't think.

What's written in the draft installation guide is not the whole file. There's additionally the default settings as follows:

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private

        # See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
        ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
        ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http

You might test using a timeout value of 0s (to disable the timeout at the loadbalancer) and allow the Snowflake instances to preform state checking to ensure HAProxy isn’t throttling your bridge.

Thanks for that hint. So far, 10-minute timeouts seem not to be causing a problem. I don't know this software too well, but I think it's an idle timeout, not an absolute limit on connection lifetime.

Currently, as I only use IPv4, I can't offer much insight as to the lack of IPv6 connections being reported (that's what my logs report, too).

On further reflection, I don't think there's a problem here. The instances' bridge-stats and end-stats show a mix of countries and v4/v6.

Regarding metrics.torproject.org... I expect you'll see that written-bytes and read-bytes only reflect that of a single Snowflake instance. However, your consensus weight will reflect the aggregate of all Snowflake instances.

Indeed, the first few data points after the switchover show an apparent decrease in read/written bytes per second, even though the on-bridge bandwidth monitors show much more bandwidth being used than before. I suppose it could be selecting from any of 5 instances that currently share the same identity fingerprint: the 4 new load-balanced instances on the "staging" bridge, plus the 1 instance which is still running concurrently on the "production" bridge. When we finish the upgrade and get all the instances back on the production bridge, if the metrics are wrong, they will at least be uniformly wrong.
https://metrics.torproject.org/rs.html#details/5481936581E23D2D178105D44DB6915AB06BFB7F

Any reason why you chose HAProxy over Nginx?

Shelikhoo drafted a configuration using Nginx, which for the time being you can see here:

https://pad.riseup.net/p/pvKoxaIcejfiIbvVAV7j#L416

I don't have a strong preference and I don't have a lot of experience with either one. haproxy seemed to offer fewer opportunities for error, because the default Nginx installation expects to run a web server, which I would have to disable and ensure it did not fight with snowflake-server for port 443. It just seemed simpler to have one configuration file to edit and restart the daemon.

I did notice that you’re using the AssumeReachable 1 directive in your torrc files. Are you running into an issue where your Tor instances are failing the reachability test?

It's because this bridge does not expose its ORPort, which is the recommended configuration for default bridges. The torrc has `ORPort 127.0.0.1:auto`, so the bridges will never be reachable over their ORPort, which is intentional. Bridges that want to be distributed by BridgeDB need to expose their ORPort, which is an unfortunate technical limitation that makes the bridges more detectable (https://bugs.torproject.org/tpo/core/tor/7349), but for default bridges it's not necessary. To be honest, I'm not sure that `AssumeReachable` is even required anymore for this kind of configuration; it's just something I remember having to do years ago for some reason. It may be superfluous now that we have `BridgeDistribution none`.

Do your Snowflake instances not have issues reporting to different DirectoryAuthorities?

Other than the possible metrics anomalies, I don't know what kind of issue you mean. It could be that, being a bridge, it has fewer constraints than your relays. A bridge doesn't have to be listed in the consensus, for example.

···

On Tue, Jan 25, 2022 at 11:21:10PM +0000, Gary C. New via tor-relays wrote:
_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

1 Like

David,

Snowflake sessions are now using the staging bridge, except for those that started before the change happened and haven’t finished yet, and perhaps some proxies that still have the IP address of the production bridge in their DNS cache. I am not sure yet what will happen with metrics, but we’ll see after a few days.

With regard to loadbalanced Snowflake sessions, I’m curious to know what connections (i.e., inbound, outbound, directory, control, etc) are being displayed within nyx?

Much Appreciated.

Gary

···


This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)

David,

I’ve been following your progress in the “Add load balancing to bridge (#40095)” issue.

The apparent decrease has to be spurious, since even at the beginning the bridge was moving more than 10 MB/s in both directions. A couple of hypotheses about what might be happening:

  • Onionoo is only showing us one instance out of the four. The actual numbers are four times higher.
    Per my previous response, my findings are consistent with yours in that Onionoo only shows metrics for a single instance; except, for consensus weight.

Here are the most recent heartbeat logs. It looks like the load is fairly balanced, with each of the four tor instances having sent between 400 and 500 GB since being started.

Your Heartbeat logs continue to appear to be in good health. When keys are rotated, the Heartbeat logs will be a key indicator in validating health whether connections are bleeding off from or remaining with a particular instance.

I worried a bit about the “0 with IPv6” in a previous comment. Looking at the bridge-stats files, I don’t think there’s a problem.

I’m glad to hear you feel the IPv6 reporting appears to be a false-negative. Does this mean there’s something wrong with IPv6 Heartbeat reporting?

Despite the load balancing, the 8 CPUs are pretty close to maxed. I would not mind having 16 cores right now. We may be in an induced demand situation where we make the bridge faster → the bridge gets more users → bridge gets slower.

I believe your observation is correct with regard to an induced traffic situation. As cpu resources increase, it will likely be lagged by increased traffic, until demand is satisfied or you run out of cpu resources, again. Are your existing 8 cpu’s only single cores? Is it too difficult to upgrade with your VPS provider? The O/S should detect the virtual hardware changes and add them accordingly. My current resource constraint is RAM, but I’m using bare-metal machines.

Great Progress!

Gary

···


This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)

With regard to loadbalanced Snowflake sessions, I'm curious to know what connections (i.e., inbound, outbound, directory, control, etc) are being displayed within nyx?

I'm not using nyx. I'm just looking at the bandwidth on the network
interface.

Your Heartbeat logs continue to appear to be in good health. When keys are rotated,

We're trying to avoid rotating keys at all. If the read-only files do not work, we'll instead probably periodically rewrite the state file to push the rotation into the future.

> I worried a bit about the "0 with IPv6" in a previous comment. Looking at the bridge-stats files, I don't think there's a problem.

I'm glad to hear you feel the IPv6 reporting appears to be a false-negative. Does this mean there's something wrong with IPv6 Heartbeat reporting?

I don't know if it's wrong, exactly. It's reporting something different than what ExtORPort is providing. The proximate connections to tor are indeed all IPv4.

Are your existing 8 cpu's only single cores? Is it too difficult to upgrade with your VPS provider?

Sure, there are plenty of ways to increase resources of the bridge, but I feel that's a different topic.

Thanks for your comments.

···

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

On the matter of onion key rotation, I had the idea of making the onion key files read-only. Roger did some source code investigation and said that it might work to prevent onion key rotation, with some minor side effects. I plan to give the idea a try on a different bridge. The possible side effects are that tor will continue trying and failing to rotate the onion key every hour, and "force a router descriptor rebuild, so it will try to publish a new descriptor each hour."

Making secret_onion_key and secret_onion_key_ntor read-only does not quite work, because tor first renames them to secret_onion_key.old and secret_onion_key_ntor.old before writing new files. (Making the *.old files read-only does not work either, because the `tor_rename` function first unlinks the destination.)

But a slight variation does work: make secret_onion_key.old and secret_onion_key_ntor.old *directories*, so that tor_rename cannot rename a file over them. It does result in an hourly `BUG` stack trace, but otherwise it seems effective.

I did a test with two tor instances. The rot1 instance had the directory hack to prevent onion key rotation. The rot2 had nothing to prevent onion key rotation.

# tor-instance-create rot1
# tor-instance-create rot2

/etc/tor/instances/rot1/torrc:

Log info file /var/lib/tor-instances/rot1/onionrotate.info.log
BridgeRelay 1
AssumeReachable 1
BridgeDistribution none
ORPort 127.0.0.1:auto
ExtORPort auto
SocksPort 0
Nickname onionrotate1

/etc/tor/instances/rot2/torrc:

Log info file /var/lib/tor-instances/rot2/onionrotate.info.log
BridgeRelay 1
AssumeReachable 1
BridgeDistribution none
ORPort 127.0.0.1:auto
ExtORPort auto
SocksPort 0
Nickname onionrotate2

Start rot1, copy its keys to rot2, then start rot2:

# service tor@rot1 start
# cp -r /var/lib/tor-instances/rot1/keys /var/lib/tor-instances/rot2/
# chown -R _tor-rot2:_tor-rot2 /var/lib/tor-instances/rot2/keys
# service tor@rot2 start

Stop the two instances, check that the onion keys are the same, and that `LastRotatedOnionKey` is set in both state files:

# service tor@rot1 stop
# service tor@rot2 stop
# ls -l /var/lib/tor-instances/rot*/keys/secret_onion_key*
-rw------- 1 _tor-rot1 _tor-rot1 888 Jan 28 22:57 /var/lib/tor-instances/rot1/keys/secret_onion_key
-rw------- 1 _tor-rot1 _tor-rot1  96 Jan 28 22:57 /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor
-rw------- 1 _tor-rot2 _tor-rot2 888 Jan 28 23:05 /var/lib/tor-instances/rot2/keys/secret_onion_key
-rw------- 1 _tor-rot2 _tor-rot2  96 Jan 28 23:05 /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor
# md5sum /var/lib/tor-instances/rot*/keys/secret_onion_key*
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot1/keys/secret_onion_key
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot2/keys/secret_onion_key
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor
# grep LastRotatedOnionKey /var/lib/tor-instances/rot*/state
/var/lib/tor-instances/rot1/state:LastRotatedOnionKey 2022-01-28 22:57:14
/var/lib/tor-instances/rot2/state:LastRotatedOnionKey 2022-01-28 23:11:04

Set `LastRotatedOnionKey` 6 weeks into the past to force an attempt to rotate the keys the next time tor is restarted:

# sed -i -e 's/^LastRotatedOnionKey .*/LastRotatedOnionKey 2021-12-15 00:00:00/' /var/lib/tor-instances/rot*/state
# grep LastRotatedOnionKey /var/lib/tor-instances/rot*/state
/var/lib/tor-instances/rot1/state:LastRotatedOnionKey 2021-12-15 00:00:00
/var/lib/tor-instances/rot2/state:LastRotatedOnionKey 2021-12-15 00:00:00

Create the secret_onion_key.old and secret_onion_key_ntor.old directories in the rot1 instance.

# mkdir -m 700 /var/lib/tor-instances/rot1/keys/secret_onion_key{,_ntor}.old

Check the identity of keys before starting:

# md5sum /var/lib/tor-instances/rot*/keys/secret_onion_key*
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot1/keys/secret_onion_key
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor
md5sum: /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor.old: Is a directory
md5sum: /var/lib/tor-instances/rot1/keys/secret_onion_key.old: Is a directory
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot2/keys/secret_onion_key
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor

Start both instances:

# service tor@rot1 start
# service tor@rot2 start

Verify that the rot1 instance is still using the same onion keys, while rot2 has rotated them:

# ls -ld /var/lib/tor-instances/rot*/keys/secret_onion_key*
-rw------- 1 _tor-rot1 _tor-rot1  888 Jan 28 23:45 /var/lib/tor-instances/rot1/keys/secret_onion_key
-rw------- 1 _tor-rot1 _tor-rot1   96 Jan 28 23:45 /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor
drwx--S--- 2 root      _tor-rot1 4096 Jan 28 23:44 /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor.old
drwx--S--- 2 root      _tor-rot1 4096 Jan 28 23:44 /var/lib/tor-instances/rot1/keys/secret_onion_key.old
-rw------- 1 _tor-rot2 _tor-rot2  888 Jan 28 23:47 /var/lib/tor-instances/rot2/keys/secret_onion_key
-rw------- 1 _tor-rot2 _tor-rot2   96 Jan 28 23:47 /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor
-rw------- 1 _tor-rot2 _tor-rot2   96 Jan 28 23:05 /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor.old
-rw------- 1 _tor-rot2 _tor-rot2  888 Jan 28 23:05 /var/lib/tor-instances/rot2/keys/secret_onion_key.old
# md5sum /var/lib/tor-instances/rot*/keys/secret_onion_key*
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot1/keys/secret_onion_key
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor
md5sum: /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor.old: Is a directory
md5sum: /var/lib/tor-instances/rot1/keys/secret_onion_key.old: Is a directory
fb8a5e8787141dba4e935267f818cc2a  /var/lib/tor-instances/rot2/keys/secret_onion_key
2c3f7d81e96641e2c04fb9c452296337  /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor.old
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot2/keys/secret_onion_key.old

The rot1 instance's `LastRotatedOnionKey` remains the same, while rot2's is updated:

# grep LastRotatedOnionKey /var/lib/tor-instances/rot*/state
/var/lib/tor-instances/rot1/state:LastRotatedOnionKey 2021-12-15 00:00:00
/var/lib/tor-instances/rot2/state:LastRotatedOnionKey 2022-01-28 23:47:02

The rot1 instance's log shows the failure to rotate the keys:

/var/lib/tor-instances/rot1/onionrotate.info.log

Jan 28 23:46:59.000 [info] rotate_onion_key_callback(): Rotating onion key.
Jan 28 23:46:59.000 [warn] Couldn't rotate onion key.
Jan 28 23:46:59.000 [info] router_rebuild_descriptor(): Rebuilding relay descriptor (forced)
...
Jan 28 23:46:59.000 [info] check_onion_keys_expiry_time_callback(): Expiring old onion keys.

While the rot2 rotation was successful:

/var/lib/tor-instances/rot2/onionrotate.info.log

Jan 28 23:47:02.000 [info] rotate_onion_key_callback(): Rotating onion key.
Jan 28 23:47:02.000 [info] rotate_onion_key(): Rotating onion key
Jan 28 23:47:02.000 [info] mark_my_descriptor_dirty(): Decided to publish new relay descriptor: rotated onion key

After 1 hour, the rot1 instance tries to rebuild its relay descriptor, and triggers a `BUG` non-fatal assertion failure in [`router_rebuild_descriptor`](src/feature/relay/router.c · HEAD · The Tor Project / Core / Tor · GitLab). I let it run for 1 more hour after that, and it happened again.

/var/lib/tor-instances/rot1/onionrotate.info.log

Jan 29 00:46:59.000 [info] router_rebuild_descriptor(): Rebuilding relay descriptor (forced)
Jan 29 00:46:59.000 [warn] The IPv4 ORPort address 127.0.0.1 does not match the descriptor address 172.105.3.197. If you have a static public IPv4 address, use 'Address <IPv4>' and 'OutboundBindAddress <IPv4>'. If you are behind a NAT, use two ORPort lines: 'ORPort <PublicPort> NoListen' and 'ORPort <InternalPort> NoAdvertise'.
Jan 29 00:46:59.000 [info] extrainfo_dump_to_string_stats_helper(): Adding stats to extra-info descriptor.
Jan 29 00:46:59.000 [info] read_file_to_str(): Could not open "/var/lib/tor-instances/rot1/stats/bridge-stats": No such file or directory
Jan 29 00:46:59.000 [warn] tor_bug_occurred_(): Bug: ../src/feature/relay/router.c:2452: router_rebuild_descriptor: Non-fatal assertion !(desc_gen_reason == NULL) failed. (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug: Tor 0.4.5.10: Non-fatal assertion !(desc_gen_reason == NULL) failed in router_rebuild_descriptor at ../src/feature/relay/router.c:2452. Stack trace: (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(log_backtrace_impl+0x57) [0x5638b9538047] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(tor_bug_occurred_+0x16b) [0x5638b954327b] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(router_rebuild_descriptor+0x13d) [0x5638b94f4e1d] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(+0x21f163) [0x5638b9665163] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(+0x83577) [0x5638b94c9577] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /lib/x86_64-linux-gnu/libevent-2.1.so.7(+0x239ef) [0x7f701bae49ef] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /lib/x86_64-linux-gnu/libevent-2.1.so.7(event_base_loop+0x52f) [0x7f701bae528f] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(do_main_loop+0x101) [0x5638b94b1321] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(tor_run_main+0x1d5) [0x5638b94acdd5] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(tor_main+0x49) [0x5638b94a92e9] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(main+0x19) [0x5638b94a8ec9] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7f701b391d0a] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [warn] Bug:     /usr/bin/tor(_start+0x2a) [0x5638b94a8f1a] (on Tor 0.4.5.10 )
Jan 29 00:46:59.000 [info] router_upload_dir_desc_to_dirservers(): Uploading relay descriptor to directory authorities
Jan 29 00:46:59.000 [info] directory_post_to_dirservers(): Uploading an extrainfo too (length 822)
Jan 29 00:46:59.000 [info] rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 3332 more seconds.
Jan 29 00:46:59.000 [info] connection_ap_make_link(): Making internal anonymized tunnel to [scrubbed]:9001 ...
Jan 29 00:46:59.000 [info] connection_ap_make_link(): ... application connection created and linked.
Jan 29 00:46:59.000 [info] check_onion_keys_expiry_time_callback(): Expiring old onion keys.

Stopping and restarting the tor1 instance keeps the same onion keys, and the first rotation does not hit the assertion failure:

# service tor@rot1 stop
# service tor@rot1 start
# md5sum /var/lib/tor-instances/rot*/keys/secret_onion_key*
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot1/keys/secret_onion_key
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor
md5sum: /var/lib/tor-instances/rot1/keys/secret_onion_key_ntor.old: Is a directory
md5sum: /var/lib/tor-instances/rot1/keys/secret_onion_key.old: Is a directory
fb8a5e8787141dba4e935267f818cc2a  /var/lib/tor-instances/rot2/keys/secret_onion_key
2c3f7d81e96641e2c04fb9c452296337  /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor
2066ab7e01595adf42fc791ad36e1fc5  /var/lib/tor-instances/rot2/keys/secret_onion_key_ntor.old
fb2a8a8f9de56f061eccbb3fedd700c4  /var/lib/tor-instances/rot2/keys/secret_onion_key.old

/var/lib/tor-instances/rot1/onionrotate.info.log

Jan 29 02:06:13.000 [info] rotate_onion_key_callback(): Rotating onion key.
Jan 29 02:06:13.000 [warn] Couldn't rotate onion key.
Jan 29 02:06:13.000 [info] router_rebuild_descriptor(): Rebuilding relay descriptor (forced)
...
Jan 29 02:06:13.000 [info] check_onion_keys_expiry_time_callback(): Expiring old onion keys.
···

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

1 Like

David,

It’s nice to see that the Snowflake daemon offers a native configuration option for LimitNOFile. I ran into a similar issue with my initial loadbalanced Tor Relay Nodes that was solved at the O/S level using ulimit. It would be nice if torrc had a similar option.

LimitNOFile is actually not a Snowflake thing, it’s a systemd thing. It’s the same as ulimit -n. See:

https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process%20Properties

Ah… My mistake. In my cursory review of your “Draft installation guide” I only saw snowflake-server. and assumed it was .conf where in actuality it is .service. I should have noticed the /etc/systemd path. Thank you for the correction.

From your documentation, it sounds like you’re running everything on the same machine? When expanding to additional machines, similar to the file limit issue, you’ll have to expand the usable ports as well.

I don’t think I understand your point. At 64K simultaneous connections, you run out of source ports for making connection 4-tuple unique, but I don’t see how the same or different hosts makes a difference, in that respect.

On many Linux distros, the default ip_local_port_range is between 32768 - 61000.

cat /proc/sys/net/ipv4/ip_local_port_range

32768 61000

The Tor Project recommends increasing it.

echo 15000 64000 > /proc/sys/net/ipv4/ip_local_port_range

I found your HAProxy configuration in your “Draft installation guide.” It seems you’re using regular TCP streaming mode with the Snowflake instances vs transparent TCP streaming mode, which is a notable difference with the directly loadbalanced Tor Relay configuration.

I admit I did not understand your point about transparent proxying. If it’s about retaining the client’s source IP address for source IP address pinning, I don’t think that helps us.

In Transparent TCP Steam mode, the Loadbalancer clones the IP address of the connecting Tor Client/Relay for use on the internal interface with connections to the upstream Tor Relay Nodes, so the Upstream Tor Relay Nodes believe they’re talking to the actual connecting Tor Client/Relay.

This is a bridge, not a relay, and the source IP address that haproxy sees is several steps removed from the client’s actual IP address. haproxy receives connections from a localhost web server (the server pluggable transport that receives WebSocket connections); the web server receives connections from Snowflake proxies (which can and do have different IP addresses during the lifetime of a client session); only the Snowflake proxies themselves receive direct traffic from the client’s own source IP address.

You are correct. This makes more sense why HAProxy’s Regular TCP Streaming Mode works in this paradigm. I believe what was confusing was the naming convention of your Tor instances (i.e., snowflake#), which lead me to believe that your Snowflake proxy instances were upstream and not downstream. However, correlating the IP address assignments between configurations confirms HAProxy is loadbalancing upstream to your Tor Nodes.

The client’s IP address is tunnelled all the way through to tor, for metrics purposes, but that uses the ExtORPort protocol and the load balancer isn’t going to understand that.

As long as HAProxy is configured to use TCP Streaming Mode, it doesn’t matter what protocol is used as it will be passed through encapsulated in TCP. That’s the beauty of TCP Streaming Mode.

I think that transparent proxying would only transparently proxy the localhost IP addresses from the web server, which doesn’t have any benefit, I don’t think.

Agreed.

You might test using a timeout value of 0s (to disable the timeout at the loadbalancer) and allow the Snowflake instances to preform state checking to ensure HAProxy isn’t throttling your bridge.

Thanks for that hint. So far, 10-minute timeouts seem not to be causing a problem. I don’t know this software too well, but I think it’s an idle timeout, not an absolute limit on connection lifetime.

It’s HAProxy’s Passive Health Check Timeout. The reason why I disabled (0s) this timeout is I felt that the Tor instances know their state threshold better and if they became overloaded would tell the DirectoryAuthorities. One scenario where a lengthy HAProxy timeout might be of value is if a single instance was having issues and causing a reported overloaded state for the rest. However, this would more likely occur in a multi-physical/virtual-node environment. You’ll have to continue to update me with your thoughts on this subject as you continue your testing.

Any reason why you chose HAProxy over Nginx?

Shelikhoo drafted a configuration using Nginx, which for the time being you can see here:

https://bugs.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/40091#note_2768891

https://pad.riseup.net/p/pvKoxaIcejfiIbvVAV7j#L416

I don’t have a strong preference and I don’t have a lot of experience with either one. haproxy seemed to offer fewer opportunities for error, because the default Nginx installation expects to run a web server, which I would have to disable and ensure it did not fight with snowflake-server for port 443. It just seemed simpler to have one configuration file to edit and restart the daemon.

My Nginx configuration is actually smaller than my HAProxy configuration. All you really need from either Nginx/HAProxy configurations are the Global Default settings (especially the file/connection limits) and your TCP Streaming settings. As stated previously, I would recommend using Nginx simply for the fact that it forks additional child processes as connections/demand increases, which I could never figured out with HAProxy.

I did notice that you’re using the AssumeReachable 1 directive in your torrc files. Are you running into an issue where your Tor instances are failing the reachability test?

It’s because this bridge does not expose its ORPort, which is the recommended configuration for default bridges. The torrc has ORPort 127.0.0.1:auto, so the bridges will never be reachable over their ORPort, which is intentional. Bridges that want to be distributed by BridgeDB need to expose their ORPort, which is an unfortunate technical limitation that makes the bridges more detectable (https://bugs.torproject.org/tpo/core/tor/7349), but for default bridges it’s not necessary. To be honest, I’m not sure that AssumeReachable is even required anymore for this kind of configuration; it’s just something I remember having to do years ago for some reason. It may be superfluous now that we have BridgeDistribution none.

Interesting… This shows my lack of knowledge regarding bridges as I have never run a bridge. Additionally, it highlights the major differences in running a Loadbalanced Tor Bridge vs a Loadbalanced Tor Relay and the necessity of using Transparent TCP Streaming Mode when the ORPort is exposed vs using Regular TCP Streaming Mode when the ORPort is not exposed. My Nginx Loadbalancer sits on the border of my network, listens on ORPort 9001, and uses Transparent TCP Streaming to loadbalance connections upstream to my Tor Relay Nodes.

Do your Snowflake instances not have issues reporting to different DirectoryAuthorities?

Other than the possible metrics anomalies, I don’t know what kind of issue you mean. It could be that, being a bridge, it has fewer constraints than your relays. A bridge doesn’t have to be listed in the consensus, for example.

Yes… It’s issues with consensus that I run into, if I don’t configure my Tor Relay Nodes to send updates to a single DirectoryAuthority. This appears to be another major difference between running a Loadbalanced Tor Bridge vs a Loadbalanced Tor Relay.

With regard to loadbalanced Snowflake sessions, I’m curious to know what connections (i.e., inbound, outbound, directory, control, etc) are being displayed within nyx?

I’m not using nyx. I’m just looking at the bandwidth on the network

interface.

If you have time, would you mind installing nyx to validate observed similarities/differences between our loadbalanced configurations?

Your Heartbeat logs continue to appear to be in good health. When keys are rotated,

We’re trying to avoid rotating keys at all. If the read-only files do not work, we’ll instead probably periodically rewrite the state file to push the rotation into the future.

I’m especially interested in this topic. Please keep me updated!

I worried a bit about the “0 with IPv6” in a previous comment. Looking at the bridge-stats files, I don’t think there’s a problem.

I’m glad to hear you feel the IPv6 reporting appears to be a false-negative. Does this mean there’s something wrong with IPv6 Heartbeat reporting?

I don’t know if it’s wrong, exactly. It’s reporting something different than what ExtORPort is providing. The proximate connections to tor are indeed all IPv4.

I see. Perhaps IPv6 connections are less prolific and require more time to ramp?

Are your existing 8 cpu’s only single cores? Is it too difficult to upgrade with your VPS provider?

Sure, there are plenty of ways to increase resources of the bridge, but I feel that’s a different topic.

After expanding my reading of your related “issues,” I see that your VPS provider only offers up to 8 cores. Is it possible to spin-up another VPS environment, with the same provider, on a separate VLAN, allowing route/firewall access between the two VPS environments? This way you could test loadbalancing a Tor Bridge over a local network using multiple virtual environments. Perhaps, the Tor Project might even assist you with such a short-term investment (I read the meeting notes). :wink:

Thanks for your comments.

Thank you for your responses.

Respectfully,

Gary

···

On Thursday, January 27, 2022, 1:03:25 AM MST, David Fifield david@bamsoftware.com wrote:

This Message Originated by the Sun.

iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks

= iPhone XS Max 512GB (~2 Weeks Charged)

David,

Making secret_onion_key and secret_onion_key_ntor read-only does not quite work, because tor first renames them to secret_onion_key.old and secret_onion_key_ntor.old before writing new files. (Making the *.old files read-only does not work either, because the tor_rename function first unlinks the destination.)
https://gitweb.torproject.org/tor.git/tree/src/feature/relay/router.c?h=tor-0.4.6.9#n497

But a slight variation does work: make secret_onion_key.old and secret_onion_key_ntor.old directories, so that tor_rename cannot rename a file over them. It does result in an hourly BUG stack trace, but otherwise it seems effective.

Directories instead of read-only files. Nice Out-Of-The-Box Thinking!

Now, the question becomes whether there are any adverse side-effects, with the DirectoryAuthorities, from the secret_onion_keys not being updated over time?

Excellent Work!

Much Respect.

Gary

···


This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)

> > From your documentation, it sounds like you're running everything on the same machine? When expanding to additional machines, similar to the file limit issue, you'll have to expand the usable ports as well.

> I don't think I understand your point. At 64K simultaneous connections, you run out of source ports for making connection 4-tuple unique, but I don't see how the same or different hosts makes a difference, in that respect.

On many Linux distros, the default ip_local_port_range is between 32768 - 61000.

The Tor Project recommends increasing it.

# echo 15000 64000 > /proc/sys/net/ipv4/ip_local_port_range

Thanks, that's a good tip. I added it to the installation guide.

> I'm not using nyx. I'm just looking at the bandwidth on the network interface.

If you have time, would you mind installing nyx to validate observed similarities/differences between our loadbalanced configurations?

I don't have plans to do that.

> > I'm glad to hear you feel the IPv6 reporting appears to be a false-negative. Does this mean there's something wrong with IPv6 Heartbeat reporting?

> I don't know if it's wrong, exactly. It's reporting something different than what ExtORPort is providing. The proximate connections to tor are indeed all IPv4.

I see. Perhaps IPv6 connections are less prolific and require more time to ramp?

No, it's not that. The bridge has plenty of connections from clients that use an IPv6 address, as the bridge-stats file shows:

bridge-ip-versions v4=15352,v6=1160

It's just that, unlike a direct TCP connection as the the case with a guard relay, the client connections pass through a chain of proxies and processes on the way to the tor: client → Snowflake proxy → snowflake-server WebSocket server → extor-static-cookie adapter → tor. The last link in the chain is IPv4, and evidently that is what the heartbeat log reports. The client's actual IP address is tunnelled, for metrics purposes, through this chain of proxies and processes, to tor using a special protocol called ExtORPort (see USERADDR at https://gitweb.torproject.org/torspec.git/tree/proposals/196-transport-control-ports.txt). It looks like the bridge-stats descriptor pays attention to the USERADDR information and the heartbeat log does not, that's all.

After expanding my reading of your related "issues," I see that your VPS provider only offers up to 8 cores. Is it possible to spin-up another VPS environment, with the same provider, on a separate VLAN, allowing route/ firewall access between the two VPS environments? This way you could test loadbalancing a Tor Bridge over a local network using multiple virtual environments.

Yes, there are many other potential ways to further expand the deployment, but I do not have much interest in that topic right now. I started the thread for help with a non-obvious point, namely getting past the bottleneck of a single-core tor process. I think that we have collectively found a satisfactory solution for that. The steps after that for further scaling are relatively straightforward, I think. Running one instance of snowflake-server on one host and all the instances of tor on a nearby host is a logical next step.

···

On Sat, Jan 29, 2022 at 02:54:40AM +0000, Gary C. New via tor-relays wrote:
_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

I did not follow the thread closely, but if you want a file or directory
contents unchangeable, and not allowed to rename/delete even by root, there's
the "immutable" attribute (chattr +i).

···

On Fri, 28 Jan 2022 19:58:49 -0700 David Fifield <david@bamsoftware.com> wrote:

> On the matter of onion key rotation, I had the idea of making the onion key files read-only. Roger did some source code investigation and said that it might work to prevent onion key rotation, with some minor side effects. I plan to give the idea a try on a different bridge. The possible side effects are that tor will continue trying and failing to rotate the onion key every hour, and "force a router descriptor rebuild, so it will try to publish a new descriptor each hour."

Making secret_onion_key and secret_onion_key_ntor read-only does not quite work, because tor first renames them to secret_onion_key.old and secret_onion_key_ntor.old before writing new files. (Making the *.old files read-only does not work either, because the `tor_rename` function first unlinks the destination.)
src/feature/relay/router.c · HEAD · The Tor Project / Core / Tor · GitLab

But a slight variation does work: make secret_onion_key.old and secret_onion_key_ntor.old *directories*, so that tor_rename cannot rename a file over them. It does result in an hourly `BUG` stack trace, but otherwise it seems effective.

I did a test with two tor instances. The rot1 instance had the directory hack to prevent onion key rotation. The rot2 had nothing to prevent onion key rotation.

--
With respect,
Roman
_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

1 Like

I’m not using nyx. I’m just looking at the bandwidth on the network interface.

If you have time, would you mind installing nyx to validate observed similarities/differences between our loadbalanced configurations?

I don’t have plans to do that.

I appreciate you setting expectations.

I’m glad to hear you feel the IPv6 reporting appears to be a false-negative. Does this mean there’s something wrong with IPv6 Heartbeat reporting?

I don’t know if it’s wrong, exactly. It’s reporting something different than what ExtORPort is providing. The proximate connections to tor are indeed all IPv4.

I see. Perhaps IPv6 connections are less prolific and require more time to ramp?

No, it’s not that. The bridge has plenty of connections from clients that use an IPv6 address, as the bridge-stats file shows:

bridge-ip-versions v4=15352,v6=1160

It’s just that, unlike a direct TCP connection as the the case with a guard relay, the client connections pass through a chain of proxies and processes on the way to the tor: client → Snowflake proxy → snowflake-server WebSocket server → extor-static-cookie adapter → tor. The last link in the chain is IPv4, and evidently that is what the heartbeat log reports. The client’s actual IP address is tunnelled, for metrics purposes, through this chain of proxies and processes, to tor using a special protocol called ExtORPort (see USERADDR at https://gitweb.torproject.org/torspec.git/tree/proposals/196-transport-control-ports.txt). It looks like the bridge-stats descriptor pays attention to the USERADDR information and the heartbeat log does not, that’s all.

Ah… Gotcha. Thank you for clarifying.

After expanding my reading of your related “issues,” I see that your VPS provider only offers up to 8 cores. Is it possible to spin-up another VPS environment, with the same provider, on a separate VLAN, allowing route/ firewall access between the two VPS environments? This way you could test loadbalancing a Tor Bridge over a local network using multiple virtual environments.

Yes, there are many other potential ways to further expand the deployment, but I do not have much interest in that topic right now. I started the thread for help with a non-obvious point, namely getting past the bottleneck of a single-core tor process. I think that we have collectively found a satisfactory solution for that. The steps after that for further scaling are relatively straightforward, I think. Running one instance of snowflake-server on one host and all the instances of tor on a nearby host is a logical next step.

Understand. I appreciate the work you have done and the opportunity to compare and contrast Loadbalanced Tor Bridges vs Loadbalanced Tor Relays.

Please update the tor-relays mailing-list with any new findings related to subversion of the onion keys rotation.

Excellent Work!

Respectfully,

Gary

···

On Saturday, January 29, 2022, 9:46:59 PM PST, David Fifield david@bamsoftware.com wrote:

This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)
1 Like

contents unchangeable, and not allowed to rename/delete even by root, there’s
the “immutable” attribute (chattr +i).

I like the immutable attribute approach. It can be applied to the original secret_onion_key and secret_onion_key_ntor files.

Appreciate the input.

Respectfully,

Gary

···

On Sunday, January 30, 2022, 2:26:08 AM PST, Roman Mamedov rm@romanrm.net wrote:

On Fri, 28 Jan 2022 19:58:49 -0700 David Fifield <david@bamsoftware.com> wrote:

But a slight variation does work: make secret_onion_key.old and secret_onion_key_ntor.old directories, so that tor_rename cannot rename a file over them. It does result in an hourly BUG stack trace, but otherwise it seems effective.

I did a test with two tor instances. The rot1 instance had the directory hack to prevent onion key rotation. The rot2 had nothing to prevent onion key rotation.

I did not follow the thread closely, but if you want a file or directory

This Message Originated by the Sun.
iBigBlue 63W Solar Array (~12 Hour Charge)

  • 2 x Charmast 26800mAh Power Banks
    = iPhone XS Max 512GB (~2 Weeks Charged)
1 Like

The load-balanced Snowflake bridge is running in production since
2022-01-31. Thanks Roger, Gary, Roman for your input.

Hopefully reproducible installation instructions:
  Snowflake Bridge Installation Guide · Wiki · The Tor Project / Anti-censorship / Team · GitLab
Observations since:
  Add load balancing to bridge (#40095) · Issues · The Tor Project / Anti-censorship / Pluggable Transports / Snowflake · GitLab

Metrics graphs are currently confused by multiple instances of tor
uploading descriptors under the same fingerprint. Particularly in the
interval between 2022-01-25 and 2022-02-03, when a production bridge and
staging bridge were running in parallel, with four instances being used
and another four being mostly unused.
  Relay Search
  Users – Tor Metrics
Since 2022-02-03, it appears that Metrics is showing only one of the
four running instances per day. Because all four instances are about
equally used (as if load balanced, go figure), the values on the graph
are 1/4 what they should be. The reported bandwidth of 5 MB/s is
actually 20 MB/s, and the 2500 clients are actually 10000. All the
necessary data are present in Collector, it's just a question of data
processing. I opened an issue for the Metrics graphs, where you can also
see some manually made graphs that are closer to the true values.
  Graphs for multiple relays that have the same fingerprint (#40022) · Issues · The Tor Project / Network Health / Metrics / Onionoo · GitLab

I started a thread on tor-dev about the issues of onion key rotation and
ExtORPort authentication.
  The tor-dev February 2022 Archive by thread

···

_______________________________________________
tor-relays mailing list
tor-relays@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays

1 Like