Anyone experiencing problems with Snowflake proxy?

Enkidu-6 · March 17, 2023, 2:25pm

I run about 30 Snowflake proxies on various IP addresses. They are all docker containers and generally each proxy gets about 360-400 connections per hour and relaying an average of 1GB per hour. They’ve been running mostly smoothly until recently.

For the past few days, They have started to consume a huge amount of RAM and I’m getting huge spikes in CPU. They consume all the RAM I give them whether it be 1GB or 7GB of RAM and start crashing and restarting one by one. What’s interesting is that now I’m getting half as many connections per hour but still relaying about the same traffic.

Anyone here experiencing the same thing? Is the snowflake under some kind of an attack or is it perhaps caused by some sort of change in the configuration of the brokers or perhaps something else?

I don’t think the problem is on my side because as I said they are running on different systems and different IP addresses and 30 proxies can’t have the same problem all at the same time.

Thank you.

Quartermarsh · March 17, 2023, 3:32pm

Hi there. Have you updated them? There were changes merged recently. Have a look at the merge requests on the snowflake repo. There may be some clues there.

fwiw I have one very long running proxy which hasn’t exhibited any of the behaviour you’re seeing, although its on a 12 core machine with 32gb RAM and all its running is snowflake so there no performance hit. My proxies on lower powered machines run the RAM and CPU usage up on the snowflake process itself but don’t really effect overall memory pressue that much.

Enkidu-6 · March 17, 2023, 6:43pm

Thank you for the response. I doubt those issues have anything to do with it. As for updating, I’m using the docker image and it’s the latest image which is 2 months old. Nevertheless I have pruned my images and pulled a new image a few times in the past couple of days with no effect.

The problem is the huge amount of data being relayed that is very unusual. Normally with 400 connections an hour, I’d relay about 1.5 GB of data at the busiest times. Now it’s up to as high as 9 GB at times with half as many connections. Here’s a sample of my logs:

2023/03/17 12:39:36 Proxy starting
2023/03/17 12:39:43 NAT type: unrestricted
2023/03/17 13:39:36 In the last 1h0m0s, there were 156 connections. Traffic Relayed ↑ 1964300 KB, ↓ 252717 KB.
2023/03/17 14:39:36 In the last 1h0m0s, there were 221 connections. Traffic Relayed ↑ 2671648 KB, ↓ 546554 KB.
2023/03/17 15:39:36 In the last 1h0m0s, there were 209 connections. Traffic Relayed ↑ 3924139 KB, ↓ 530694 KB.
2023/03/17 16:39:36 In the last 1h0m0s, there were 229 connections. Traffic Relayed ↑ 6827489 KB, ↓ 580653 KB.
2023/03/17 17:39:36 In the last 1h0m0s, there were 222 connections. Traffic Relayed ↑ 4747165 KB, ↓ 497418 KB.

Enkidu-6 · March 17, 2023, 6:43pm

P.S.
This is a 6 core 12 thread bare metal with 64 GB of RAM. I can increase or decrease the RAM for the container as needed but no matter what I set the RAM it’s maxed out within hours.

The Spikes are due to the higher than normal amount of data being processed and relayed. If this was a Tor relay I would say it’s under a DDoS attack but I’m not an expert in the ins and outs of snowflake proxy to make an accurate guess here.

SirNeo · March 17, 2023, 9:50pm

I don’t see this issue at my docker setup.

2 snowflake instances - each ~between 100-200users and RAM is at 3.5 GB usage.
Also traffic looks a bit below 1GB/hour/instance.

2023/03/17 12:34:22 In the last 1h0m0s, there were 164 connections. Traffic Relayed ↑ 701163 KB, ↓ 304479 KB.
2023/03/17 13:34:22 In the last 1h0m0s, there were 156 connections. Traffic Relayed ↑ 730531 KB, ↓ 178194 KB.
2023/03/17 14:34:22 In the last 1h0m0s, there were 132 connections. Traffic Relayed ↑ 538262 KB, ↓ 111091 KB.
2023/03/17 15:34:22 In the last 1h0m0s, there were 150 connections. Traffic Relayed ↑ 664891 KB, ↓ 191459 KB.
2023/03/17 16:34:22 In the last 1h0m0s, there were 204 connections. Traffic Relayed ↑ 811295 KB, ↓ 232819 KB.
2023/03/17 17:34:22 In the last 1h0m0s, there were 120 connections. Traffic Relayed ↑ 239254 KB, ↓ 146834 KB.
2023/03/17 18:34:22 In the last 1h0m0s, there were 116 connections. Traffic Relayed ↑ 231836 KB, ↓ 245149 KB.

Vort · March 17, 2023, 9:50pm

My guess that after this fix, connections started to live longer.
In combination with memory leak, which correlates with user/connection count, this results in increase of RAM consumption.

Quartermarsh · March 17, 2023, 9:50pm

Yes, I’m noticing the same data spikes and corresponding impact on RAM usage.

Quartermarsh · March 17, 2023, 9:50pm

I had to bail on Docker, because depite the ability to dial back RAM for the container it was still a brutal performance hit on the laptop I was using at the time. Even without docker there is still a noticable impact, even on more powerful machines.

I know that’s kinda beside the point here since the issue isn’t just at the operator level. So thanks for opening this issue as it’s good feedback for the snowflake developers who’ve been trying to deal with the effect of the massive increase in usage. Would you consider opening an issue on the snowflake repo?

dcf · March 17, 2023, 9:50pm

It most likely is snowflake!140 / snowflake#40262. That fix / deployment uncorked a big performance increase that enabled more use of bandwidth. See the snowflake-01 graph at Deploy snowflake-server for QueuePacketConn buffer reuse fix (#40260) (#40262) · Issues · The Tor Project / Anti-censorship / Pluggable Transports / Snowflake · GitLab

The reduced number of connections is probably because the bug that was fixed could cause spurious disconnection errors; see here.

Enkidu-6 · March 18, 2023, 2:11am

Thank you for the response.

Yes, thank you for the links. I’d read them. Just to clarify your conclusion, are you saying that in the past we were falsely reporting twice as many snowflake users because they would get disconnected and come back over and over again, and the current numbers are the accurate ones?

Also, based on your conclusion, would I be correct to assume that the reason we have a spike in bandwidth usage is because in the past, people were getting cut off and couldn’t use snowflake as freely and now that they have a stable connection, they are using 5-6 times the bandwidth? In other words, this is not a bug, it’s a feature.

Do we have any reports from users, indicating a huge performance spike when using snowflake?

Currently I don’t have a bandwidth limitation as they are running on two bare metal servers, at two different providers. But if I did have a bandwidth limit, like majority of people, at this rate I’d have to shut down 20-25 of my proxies and only run 5-10 to use the same bandwidth as before. Not to mention the increase in RAM usage.

I personally don’t have a problem with that if this is indeed a feature and benefits the users, but are we sure?

Thank you for your time.

dcf · March 18, 2023, 5:22am

No, estimating the number of users doesn’t work by counting connections, it works by counting directory requests, which Tor clients do every few hours. The number of distinct connections does not affect how often a Tor client does a directory request, so the estimated number of users doesn’t change. See:

In the snowflake-01 relay search graphs, you can see that the increase in bandwidth since 2023-03-13 has not been matched by an increase in users, at least not yet. It looks like the same number of users is getting better bandwidth.

Yes, I think that’s right. I don’t know why it wasn’t detected as a larger problem before now. Because of the nature of the bug, it may be that it wasn’t a bigger problem when there were fewer overall users.

I don’t think it’s a 5–6× increase in bandwidth for all users or all proxies. More like 2× on average. Your proxies might be disproportionately affected if they have good connectivity.

I am not aware of any, but I can testify to it myself.

That’s fine. Shut down however many proxies you need to stay within your resource limits. I have little doubt that the snowflake#40262 deployment is the cause of traffic changes you are seeing. I restarted the process myself and watched the bandwidth use immediately increase as a result.

Thanks for running Snowflake proxies.

Quartermarsh · March 18, 2023, 6:17am

Yes. I can also testify to it. Here’s one example:

2023/03/18 01:56:45 In the last 1h0m0s, there were 8 connections. Traffic Relayed ↑ 1182862 KB, ↓ 179716 KB.

That’s the middle of the night in Iran.

I’m also seeing hours with ~200 connections producing 8 GB + of traffic where prior to the snowflake#40262 deployment it would have been about 2 GB.

Enkidu-6 · March 18, 2023, 11:34am

Thanks for clarifying things. I have no doubt the tweak is responsible for the lower connection numbers and the accuracy of them but to be honest with you I don’t believe the tweak is responsible for the huge spike I’m experiencing in traffic.

Any effect that any tweak might have would generally be proportionate. In this case, you’d see higher proportionate bandwidth across the board as you observed. It shouldn’t increase the traffic 8 fold on some containers and have little effect on others.

As much as I like to avoid the use of the word “attack”, everything about this smells like it.

How likely is it for someone to use a snowflake proxy as an entry point to attack a third party or the Tor network? Could someone point at snowflake proxy and flood it with data destined for a web site or any guard relay? Wouldn’t the proxy do its best to relay as much of it as it can to the point of crashing?

Some of my containers are maxing out 8 GB of RAM in 5 hours and run out of memory, crash and restart.

Enkidu-6 · March 18, 2023, 11:35am

I hate to flood the forum with log files but I’m going to post a fraction of what I get in the log to show what’s happening.

runtime stack:
runtime.throw({0x927582?, 0x400000?})
        /usr/local/go/src/runtime/panic.go:992 +0x71
runtime.sysMap(0xc046000000, 0x40b969?, 0xc00003e548?)
        /usr/local/go/src/runtime/mem_linux.go:189 +0x11b
runtime.(*mheap).grow(0xcc7a00, 0xc00003e400?)
        /usr/local/go/src/runtime/mheap.go:1413 +0x225
runtime.(*mheap).allocSpan(0xcc7a00, 0x1, 0x0, 0x11)
        /usr/local/go/src/runtime/mheap.go:1178 +0x171
runtime.(*mheap).alloc.func1()
        /usr/local/go/src/runtime/mheap.go:920 +0x65
runtime.systemstack()
        /usr/local/go/src/runtime/asm_amd64.s:469 +0x49

goroutine 194426 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:436 fp=0xc001d7bbd0 sp=0xc001d7bbc8 pc=0x460d60
runtime.(*mheap).alloc(0x0?, 0xc001d7bd30?, 0x80?)
        /usr/local/go/src/runtime/mheap.go:914 +0x65 fp=0xc001d7bc18 sp=0xc001d7bbd0 pc=0x426705
runtime.(*mcentral).grow(0x2000?)
        /usr/local/go/src/runtime/mcentral.go:244 +0x5b fp=0xc001d7bc60 sp=0xc001d7bc18 pc=0x41751b
runtime.(*mcentral).cacheSpan(0xcd8900)
        /usr/local/go/src/runtime/mcentral.go:164 +0x30f fp=0xc001d7bcb8 sp=0xc001d7bc60 pc=0x41734f
runtime.(*mcache).refill(0x7f2baac835b8, 0x11)
        /usr/local/go/src/runtime/mcache.go:162 +0xaf fp=0xc001d7bcf0 sp=0xc001d7bcb8 pc=0x4169cf
runtime.(*mcache).nextFree(0x7f2baac835b8, 0x11)
        /usr/local/go/src/runtime/malloc.go:886 +0x85 fp=0xc001d7bd38 sp=0xc001d7bcf0 pc=0x40ca65
runtime.mallocgc(0x60, 0x0, 0x1)
        /usr/local/go/src/runtime/malloc.go:1085 +0x4e5 fp=0xc001d7bdb0 sp=0xc001d7bd38 pc=0x40d0e5
runtime.makechan(0x0?, 0x0)
        /usr/local/go/src/runtime/chan.go:96 +0x11d fp=0xc001d7bdf0 sp=0xc001d7bdb0 pc=0x40563d
github.com/pion/sctp.(*ackTimer).start(0xc00143ca80)
        /go/pkg/mod/github.com/pion/sctp@v1.8.2/ack_timer.go:51 +0x9f fp=0xc001d7be40 sp=0xc001d7bdf0 pc=0x6c785f
github.com/pion/sctp.(*Association).handleChunkEnd(0xc000b16a80)
        /go/pkg/mod/github.com/pion/sctp@v1.8.2/association.go:2238 +0x9e fp=0xc001d7be80 sp=0xc001d7be40 pc=0x6d673e
github.com/pion/sctp.(*Association).handleInbound(0xc000b16a80, {0xc03c7f3700?, 0x2000?, 0xc001b7ff70?})
        /go/pkg/mod/github.com/pion/sctp@v1.8.2/association.go:608 +0x2ea fp=0xc001d7bf28 sp=0xc001d7be80 pc=0x6cb54a
github.com/pion/sctp.(*Association).readLoop(0xc000b16a80)
        /go/pkg/mod/github.com/pion/sctp@v1.8.2/association.go:521 +0x1cd fp=0xc001d7bfc8 sp=0xc001d7bf28 pc=0x6ca1ed
github.com/pion/sctp.(*Association).init.func2()
        /go/pkg/mod/github.com/pion/sctp@v1.8.2/association.go:339 +0x26 fp=0xc001d7bfe0 sp=0xc001d7bfc8 pc=0x6c9006
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc001d7bfe8 sp=0xc001d7bfe0 pc=0x462e41
created by github.com/pion/sctp.(*Association).init
        /go/pkg/mod/github.com/pion/sctp@v1.8.2/association.go:339 +0xd0

goroutine 107 [IO wait]:
internal/poll.runtime_pollWait(0x7f2b84015f68, 0x72)
        /usr/local/go/src/runtime/netpoll.go:302 +0x89
internal/poll.(*pollDesc).wait(0xc0001da480?, 0xc00024c000?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Read(0xc0001da480, {0xc00024c000, 0x13d6, 0x13d6})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc0001da480, {0xc00024c000?, 0xc00028b820?, 0xc00024c02a?})
        /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc000112108, {0xc00024c000?, 0x1ffffffffffffff?, 0x39?})
        /usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc04416ca08, {0xc00024c000?, 0x0?, 0x8?})
        /usr/local/go/src/crypto/tls/conn.go:785 +0x3d
bytes.(*Buffer).ReadFrom(0xc000155078, {0x9daf60, 0xc04416ca08})
        /usr/local/go/src/bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc000154e00, {0x9db920?, 0xc000112108}, 0x13b1?)
        /usr/local/go/src/crypto/tls/conn.go:807 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc000154e00, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:614 +0x116
crypto/tls.(*Conn).readRecord(...)
        /usr/local/go/src/crypto/tls/conn.go:582
crypto/tls.(*Conn).Read(0xc000154e00, {0xc0002c2000, 0x1000, 0x7f0360?})
        /usr/local/go/src/crypto/tls/conn.go:1285 +0x16f
bufio.(*Reader).Read(0xc0001b3c80, {0xc0001244a0, 0x9, 0x7fe722?})
        /usr/local/go/src/bufio/bufio.go:236 +0x1b4
io.ReadAtLeast({0x9dae00, 0xc0001b3c80}, {0xc0001244a0, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:331 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:350
net/http.http2readFrameHeader({0xc0001244a0?, 0x9?, 0xc0442860f0?}, {0x9dae00?, 0xc0001b3c80?})
        /usr/local/go/src/net/http/h2_bundle.go:1566 +0x6e
net/http.(*http2Framer).ReadFrame(0xc000124460)
        /usr/local/go/src/net/http/h2_bundle.go:1830 +0x95
net/http.(*http2clientConnReadLoop).run(0xc00004bf98)
        /usr/local/go/src/net/http/h2_bundle.go:8819 +0x130
net/http.(*http2ClientConn).readLoop(0xc00024a300)
        /usr/local/go/src/net/http/h2_bundle.go:8715 +0x6f
created by net/http.(*http2Transport).newClientConn
        /usr/local/go/src/net/http/h2_bundle.go:7443 +0xa65
~~~
This goes on for pages and pages in the log.

dcf · March 18, 2023, 11:42pm

I hope I can reassure you. Clients cannot cause a proxy to attack or even connect to an arbitrary web site or relay. There is no way for a client to directly control what endpoint the proxy connects to; the client can only give the broker a bridge fingerprint and then the broker maps the bridge fingerprint to a domain name using a local JSON database. The broker doesn’t permit arbitrary bridge fingerprints, only the ones it knows about (currently 2 of them). On top of that, proxies will not connect to just any domain name the broker gives them; proxies themselves enforce the rule that they will only connect to subdomains of snowflake.torproject.net (e.g. 01.snowflake.torproject.net, 02.snowflake.torproject.net). So even a compromise of the broker could not cause proxies to attack arbitrary targets.

I don’t see the logic behind your claim that an increase in bandwidth must be proportional across all proxies. Some proxies may have been on lower-speed residential connections, and been at their bandwidth maximum even before the fix on 2023-03-13 that let clients use more bandwidth. Those proxies that were maxed out will not have seen any increase in total bandwidth. Your proxies, on the other hand, have unusually good connectivity, which means that they have room to grow when clients start to use more bandwidth. Evidently the total increase in bandwidth use was about 2× on average; some proxies saw less than that, some more.

I’m sorry your proxies are crashing. That’s clearly a bug and should not happen, but there’s a clear reason why it should have begun happening a few days ago. I suggest that you try using the -capacity command-line option to limit the number of clients served concurrently. Start with -capacity 10, see how it goes, and increase from there if resources permit.

Enkidu-6 · March 19, 2023, 2:10am

Sounds good. Thank you for the time you’ve taken to explain and clarify things for me. It’s very much appreciated.

Yes, It all depends on how your provider implements their network. It’s one of the main reasons people in Iran have a much harder time accessing some servers as opposed to others and why you hear some say snowflake or V2ray proxies work in Iran and some say they don’t work. But that’s a subject for another post.

To be honest, the main reason I started serving snowflake proxies was when I heard about what’s going on in Iran, and if the proxy can help 200 people as opposed to 10, I’ll take 200. I just have to figure out a way to make it work and hopefully figure out what the bug is. The resources are not an issue. The problem is that the proxy maxes out whatever amount of RAM you give it within hours whether it’s 2GB of RAM or 10.

Again I thank you for the time you’ve taken.

Vort · March 19, 2023, 4:30pm

To do this you need either to know how Snowflake in particular and programming in general works, or to have social skills to convince developers to look into this problem.

If you have unlimited amount of RAM, then what’s the problem?

Another method of helping with bug hunting is collection of data.
You can collect data about RAM usage over time, test what I said about correlation and share resulting information with other people in corresponding bug report.
If you don’t know what correlation means, I will try to explain it.

Enkidu-6 · March 19, 2023, 10:00pm

Looks like someone got off the wrong side of the bed. I say that because you generally seem very helpful in these forums and I appreciate that.

I see that you answered your own question by quoting me answering that question.

The subject of this post clearly explains my intentions. I’m asking if others are experiencing this problem. If there are enough people experiencing it, then it’s a bug . If it’s just me and a couple of other people, then it might be my particular setup, my OS or a bunch of other things and that would make it my problem to solve and not the subject of a bug report.

If this is indeed a bug, I will decide how much time I’d like to spend filing a bug report. I promise to ask you to teach me what correlation means if I decide to do so.

My guess is that your guess is right. By the way that bug report is two months old with no clear response and I doubt that has anything to do with lack of social skills on the part of the participants.

Cheers.

Gabriel258 · March 19, 2023, 10:42pm

Thank you for your feedback :
@Enkidu-6 & @Quartermarsh
that will be useful for those who are looking to understand and do not have your expertise on this type of incident
best regards

Vort · March 20, 2023, 10:12am

Increase in RAM consumption over time other people see as well.
However, consumption growth almost stops after several days of uptime.
For me “target” value is somewhere between 1 and 2 GB with 100-200 simultaneously connected users.
Growth to ~8 GB I saw only once and it did not reproduced since then.

If you show how exactly RAM consumption behaves for you, other people can compare.
If you test “maxing whatever amount” behaviour with low -capacity, it may help narrow down range of possible problems.

And it looks like developers have enough technical skills to solve such problems.
But something is still wrong with how things are going on here.
And it is better either to solve problems with solving problems or at least clarify what is wrong and why.
Tor is useful project overall, but weeks long lags in conversations makes too hard to make useful contributions, at least for me.
(However, I can’t promise to make huge contribution)