[tor-project] Constructing a real-world dataset for studying website fingerprinting

Hello Tor friends,

We are planning to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. We believe the dataset will help us make Tor safer, because it will
allow us to design defenses that can be shown to protect *real* Tor traffic
instead of *synthetic* traffic. This will help ground our evaluation of proposed
defenses in reality and help us more confidently decide which, if any, defense
is worth deploying in Tor.

We have submitted detailed technical plans for constructing a dataset to the Tor
Research Safety Board and after some iteration have arrived at a plan in which
we believe the benefits outweigh the risks. We are now sharing an overview of
our plan with the broader community to provide an opportunity for comment.

More details are below. Please let us know if you have comments.

Peace, love, and positivity,
Rob

P.S. Apologies for posting near the end of the work-week, but I wanted to get
this out in case people want to talk to me about it in Costa Rica.

···

===

BACKGROUND

Website fingerprinting attacks distill traffic patterns observed between a
client and Tor entry into a sequence of packet directions: -1 if a packet is
sent toward the destination, +1 if a packet is sent from toward the client. An
attacker can collect a list of these directions and then train machine learning
classifiers to associate a website domain name or url with the particular list
of directions observed when visiting that website. Once it does this training,
then when it observes a new list of directions it can use the trained model to
predict which website corresponds to that pattern.

For example, suppose [-1,-1,+1,+1] is associated with website1 and [-1,+1,-1,+1]
is associated with website2. There are two steps in an attack:

Step 1:
In the first step the attacker itself visits website1 and website2 many times
and learns:
[-1,-1,+1,+1] -> website1
[-1,+1,-1,+1] -> website2
It trains a machine learning model to learn this association.

Step 2:
In the second step, with the trained model in hand, the attacker monitors a Tor
client (maybe the attacker is the client’s ISP, or some other entity in a
position to observe a client’s traffic) and when it observes the pattern:
[-1,-1,+1,+1]
the model will predict that the client went to website1. This example is
*extremely* simplified, but I hope gives an idea how the attack works.

PROBLEM

Because researchers don’t know which websites Tor users are visiting, it’s hard
to do a very good job creating a representative dataset that can be used to
accurately evaluate attacks or defenses (i.e., to emulate steps 1 and 2). The
standard technique has been to just select popular websites from top website
lists (e.g., Alexa or Tranco) and then set up a Tor webpage crawler to visit the
front-pages of those websites over and over and over again. Then they use that
data to write papers. This approach has several problems:

- Low traffic diversity: Tor users don’t only visit front-pages. For example,
they may conduct a web search and then click a link that brings them directly to
an internal page of a website. The patterns produced from front-page visits may
be simpler and unrepresentative of the patterns that would be observed from more
complicated internal pages.

- Low browser diversity: It has been shown by research from Marc Juarez [0] and
others that webpage crawlers used by researchers lack diversity in important
aspects that cause us to overestimate the accuracy of WF attacks. For example,
the browser versions, configuration choices, variation in behavior (e.g., using
multiple tabs at once), and network location of the client can all significantly
affect the observable traffic patterns in ways that a crawler methodology does
not capture.

- Data staleness: Researchers collect data over a short time-frame and then
evaluate the attacks assuming this static dataset. In the real world, websites
are being updated over time, and a model trained on an old version of a website
may not transfer to the new version.

In addition to the above problems in methodology, current research also causes
incidental consequences for the Tor network:

- Network overhead: machine learning is a hot topic and several research groups
have crawled tens of thousands of websites over Tor many times each. While each
individual page load might be insignificant compared with the normal usage of
Tor, crawling does add additional load to the network and can contribute to
congestion and performance bottlenecks.

Researchers have been designing attacks that are shown to be extremely accurate
using the above synthetic crawling methodology. But because of the above
problems, we don’t properly understand the *true* threat of the attack against
the Tor network. It is possible that the simplicity of the crawling approach is
what makes the attacks work well, and that the attacks would not work as well if
evaluated with more realistic traffic and browser diversity.

PLAN

So our goal is to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. This dataset would enable researchers to use a methodology that does
not have any of the above limitations. We believe that such a dataset will help
us make Tor safer, because it will allow us to design defenses that can be shown
to protect *real* Tor traffic instead of *synthetic* traffic. This would lead to
a better understanding of proposed defenses and enable us to more confidently
decide which, if any, defense is worth deploying in Tor.

The dataset will be constructed from a 13-week exit relay measurement that is
based on the measurement process established in recent work [1]. The primary
information being measured is the directionality of the first 5k cells sent on a
measurement circuit, and a keyed-HMAC of the first domain name requested on the
circuit. We also measure relative circuit and cell timestamps (relative to the
start of measurement). The measurement data is compressed, encrypted using a
public-key encryption scheme (the secret key is stored offline), and then
temporarily written to persistent storage before being securely retrieved from
the relay machine.

We hope that this dataset can become a standard tool that website fingerprinting
researchers and developers can use to (1) accelerate their study of attacks and
defenses, and (2) produce evaluation and results that are more directly
applicable to the Tor network. We plan to share it upon request only to other
researchers who appear to come from verifiable research organizations, such as
students from well-known universities. We will require researchers with whom we
share the data to (1) keep the data private, and (2) direct others who want a
copy of the data to us to mitigate unauthorized sharing.

[0] A Critical Evaluation of Website Fingerprinting Attacks. Juarez et al., CCS 2014. https://www1.icsi.berkeley.edu/~sadia/papers/ccs-webfp-final.pdf

[1] Online Website Fingerprinting: Evaluating Website Fingerprinting Attacks on Tor in the Real World. Cherubin et al., USENIX Security 2022. Online Website Fingerprinting: Evaluating Website Fingerprinting Attacks on Tor in the Real World | USENIX

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

2 Likes

I suppose this is kind of a non-question, since you wouldn’t be doing it otherwise, but I am surprised that associating the traffic patterns to a single key, that of the first domain name, is sufficient. Every page or query made to that domain (e.g. duckduckgo) will have the same key, with potentially a lot of entirely disparate traffic patterns.

Obviously this is limited by what you can technically achieve in this scenario: you have the plaintext DNS requests, and everything else is going to be TLS-encrypted. The alternative would be to instrument a tor client/browser and find volunteers to opt-in to their data collection.

-tom

···

On Thu, 20 Apr 2023 at 17:16, Jansen, Robert G CIV USN NRL (5543) Washington DC (USA) via tor-project <tor-project@lists.torproject.org> wrote:

The primary
information being measured is the directionality of the first 5k cells sent on a
measurement circuit, and a keyed-HMAC of the first domain name requested on the
circuit.

The primary
information being measured is the directionality of the first 5k cells sent on a
measurement circuit, and a keyed-HMAC of the first domain name requested on the
circuit.

I suppose this is kind of a non-question, since you wouldn't be doing it otherwise, but I am surprised that associating the traffic patterns to a single key, that of the first domain name, is sufficient. Every page or query made to that domain (e.g. duckduckgo) will have the same key, with potentially a lot of entirely disparate traffic patterns.

You are absolutely correct! I think it’s worth exploring the extent to which those different traffic patterns from the different subpages, perhaps even loaded in different orders, can be combined to identify the site, and how we can protect traffic in this scenario. This really is web*site* fingerprinting more than web*page* fingerprinting.

Obviously this is limited by what you can technically achieve in this scenario: you have the plaintext DNS requests, and everything else is going to be TLS-encrypted. The alternative would be to instrument a tor client/browser and find volunteers to opt-in to their data collection.

Yes, the volunteer approach has it’s own limitations too, particularly in terms of potential bias, lower traffic/browser diversity, etc. It’s also an entirely different beast from a research perspective because it involves direct participation from human subjects. I think this approach could be useful, especially if the volunteer pool was very large. But I think the one we’ve proposed is easier to get started.

-tom

Thanks for the comments, tom!

Peace, love, and positivity,
Rob

···

On Apr 20, 2023, at 9:34 PM, Tom Ritter <tom@ritter.vg> wrote:
On Thu, 20 Apr 2023 at 17:16, Jansen, Robert G CIV USN NRL (5543) Washington DC (USA) via tor-project <tor-project@lists.torproject.org> wrote:

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

Hi Rob,

Your earlier work on online WF and this proposal are exciting. We can learn a lot from such a proposed dataset, so please collect it, but I think that there are some challenges around framing it as a dataset for assessing WF attacks and defenses.

To be brief, I think we solve traffic diversity with BigEnough-style datasets that never overlap subpages [0] and that data staleness and network overheads are minor problems. Capable attackers (we should consider) can build closed worlds of ~100 websites [1], no need for open-world data.

The biggest strength of the proposed dataset is also its biggest weakness: capturing real-world user diversity in traces. It is more user/client diversity than browser diversity (in the sense of more broad), because the dataset captures various browsers and as well as different Tor clients implementations, other Tor configurations like VPN mode, running on routers, in strange VMs, headless browsers etc., as well as a wide range of network configurations and probably more I missed used by the very diverse Tor-network userbase. While capturing this diversity is super interesting and valuable in different ways, it also risks being too filled with junk to help assess WF attacks or defenses.

Do you have any approaches or thoughts around pruning the dataset or further refining labels beyond the first domain? Some worries:

- I fear that we will likely have to do much guesswork on interpreting results based on the dataset. Do we want to be assessing WF defenses and attacks based on random torified curl scrapers, python scripts, and who knows what? Right now, we are creeping closer and closer to Internet~web, reflected in Tor-traffic. Without any ground truth, how can we avoid that most labels aren't mostly junk? What does it mean to reach 50% accuracy in evaluating the dataset? If a WF attack trained on TB "fails" to associate a website visit using curl that's probably a feature (not what the attacker was after anyway), not a bug. It would be fantastic if it were possible to have a sizeable subset of traces for some/most labels containing known configurations (or even just confirmed website visits with, say vanilla TB, maybe some more inline domain fingerprinting to set a bool for key website domains of Tranco top-100 requested on the circuit or something?).

- Filled with an unknown rate of junk, the dataset alone will be insufficient to train WF attacks in noisy environments (you want ~1000+ samples per class of something coherent, at least with current sota DL models pushed to their limits: as we move on to transformers probably even more). Suppose you subscribe to the claims of data staleness being a factor. In that case, researchers cannot even go through the very consuming process of collecting adequate labeled training data in the same way to show success on the proposed dataset. The exit collection vantage point makes this worse.

- Using the dataset as a basis for simulating defenses, one would have to simulate the corresponding client and middle traces to feed into the simulated defenses based on exit traces. Kinda messy with poor signalling for Tor-network characteristics in the dataset. When collecting at the client, getting realistic client traces (for the particular configuration) is basically for free. At the same time, middles change for every circuit, so a half-assed approach seems to get you far (speaking from experience of half-assing things here!). I worry about half-assing both client and middle traces, however.

If we make this proposed dataset and its method the bar of doing "real-world WF", it might lead to too high of a bar. We want more real-world implementations and data collection with defenses, not less, I think.

Sorry if the above may come across as a bit negative, it's not my intent: I *want* the dataset you describe, we can learn a lot from it for sure. Wish I had a chance to chat in person in Costa Rica! Please don't feel obliged to reply, just food for thought.

Best,
Tobias

[0]: "SoK: A Critical Evaluation of Efficient Website Fingerprinting Defenses", https://www-users.cse.umn.edu/~hoppernj/sok_wf_def_sp23.pdf
[1]: "Website Fingerprinting with Website Oracles", https://petsymposium.org/2020/files/papers/issue1/popets-2020-0013.pdf

···

On 20/04/2023 23:16, Jansen, Robert G CIV USN NRL (5543) Washington DC (USA) via tor-project wrote:

Hello Tor friends,

We are planning to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. We believe the dataset will help us make Tor safer, because it will
allow us to design defenses that can be shown to protect *real* Tor traffic
instead of *synthetic* traffic. This will help ground our evaluation of proposed
defenses in reality and help us more confidently decide which, if any, defense
is worth deploying in Tor.

We have submitted detailed technical plans for constructing a dataset to the Tor
Research Safety Board and after some iteration have arrived at a plan in which
we believe the benefits outweigh the risks. We are now sharing an overview of
our plan with the broader community to provide an opportunity for comment.

More details are below. Please let us know if you have comments.

Peace, love, and positivity,
Rob

P.S. Apologies for posting near the end of the work-week, but I wanted to get
this out in case people want to talk to me about it in Costa Rica.

===

BACKGROUND

Website fingerprinting attacks distill traffic patterns observed between a
client and Tor entry into a sequence of packet directions: -1 if a packet is
sent toward the destination, +1 if a packet is sent from toward the client. An
attacker can collect a list of these directions and then train machine learning
classifiers to associate a website domain name or url with the particular list
of directions observed when visiting that website. Once it does this training,
then when it observes a new list of directions it can use the trained model to
predict which website corresponds to that pattern.

For example, suppose [-1,-1,+1,+1] is associated with website1 and [-1,+1,-1,+1]
is associated with website2. There are two steps in an attack:

Step 1:
In the first step the attacker itself visits website1 and website2 many times
and learns:
[-1,-1,+1,+1] -> website1
[-1,+1,-1,+1] -> website2
It trains a machine learning model to learn this association.

Step 2:
In the second step, with the trained model in hand, the attacker monitors a Tor
client (maybe the attacker is the client’s ISP, or some other entity in a
position to observe a client’s traffic) and when it observes the pattern:
[-1,-1,+1,+1]
the model will predict that the client went to website1. This example is
*extremely* simplified, but I hope gives an idea how the attack works.

PROBLEM

Because researchers don’t know which websites Tor users are visiting, it’s hard
to do a very good job creating a representative dataset that can be used to
accurately evaluate attacks or defenses (i.e., to emulate steps 1 and 2). The
standard technique has been to just select popular websites from top website
lists (e.g., Alexa or Tranco) and then set up a Tor webpage crawler to visit the
front-pages of those websites over and over and over again. Then they use that
data to write papers. This approach has several problems:

- Low traffic diversity: Tor users don’t only visit front-pages. For example,
they may conduct a web search and then click a link that brings them directly to
an internal page of a website. The patterns produced from front-page visits may
be simpler and unrepresentative of the patterns that would be observed from more
complicated internal pages.

- Low browser diversity: It has been shown by research from Marc Juarez [0] and
others that webpage crawlers used by researchers lack diversity in important
aspects that cause us to overestimate the accuracy of WF attacks. For example,
the browser versions, configuration choices, variation in behavior (e.g., using
multiple tabs at once), and network location of the client can all significantly
affect the observable traffic patterns in ways that a crawler methodology does
not capture.

- Data staleness: Researchers collect data over a short time-frame and then
evaluate the attacks assuming this static dataset. In the real world, websites
are being updated over time, and a model trained on an old version of a website
may not transfer to the new version.

In addition to the above problems in methodology, current research also causes
incidental consequences for the Tor network:

- Network overhead: machine learning is a hot topic and several research groups
have crawled tens of thousands of websites over Tor many times each. While each
individual page load might be insignificant compared with the normal usage of
Tor, crawling does add additional load to the network and can contribute to
congestion and performance bottlenecks.

Researchers have been designing attacks that are shown to be extremely accurate
using the above synthetic crawling methodology. But because of the above
problems, we don’t properly understand the *true* threat of the attack against
the Tor network. It is possible that the simplicity of the crawling approach is
what makes the attacks work well, and that the attacks would not work as well if
evaluated with more realistic traffic and browser diversity.

PLAN

So our goal is to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. This dataset would enable researchers to use a methodology that does
not have any of the above limitations. We believe that such a dataset will help
us make Tor safer, because it will allow us to design defenses that can be shown
to protect *real* Tor traffic instead of *synthetic* traffic. This would lead to
a better understanding of proposed defenses and enable us to more confidently
decide which, if any, defense is worth deploying in Tor.

The dataset will be constructed from a 13-week exit relay measurement that is
based on the measurement process established in recent work [1]. The primary
information being measured is the directionality of the first 5k cells sent on a
measurement circuit, and a keyed-HMAC of the first domain name requested on the
circuit. We also measure relative circuit and cell timestamps (relative to the
start of measurement). The measurement data is compressed, encrypted using a
public-key encryption scheme (the secret key is stored offline), and then
temporarily written to persistent storage before being securely retrieved from
the relay machine.

We hope that this dataset can become a standard tool that website fingerprinting
researchers and developers can use to (1) accelerate their study of attacks and
defenses, and (2) produce evaluation and results that are more directly
applicable to the Tor network. We plan to share it upon request only to other
researchers who appear to come from verifiable research organizations, such as
students from well-known universities. We will require researchers with whom we
share the data to (1) keep the data private, and (2) direct others who want a
copy of the data to us to mitigate unauthorized sharing.

[0] A Critical Evaluation of Website Fingerprinting Attacks. Juarez et al., CCS 2014. https://www1.icsi.berkeley.edu/~sadia/papers/ccs-webfp-final.pdf

[1] Online Website Fingerprinting: Evaluating Website Fingerprinting Attacks on Tor in the Real World. Cherubin et al., USENIX Security 2022. https://www.usenix.org/conference/usenixsecurity22/presentation/cherubin

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
tor-project Info Page

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

Hi Rob,

Hi Tobias!

Your earlier work on online WF and this proposal are exciting. We can learn a lot from such a proposed dataset, so please collect it, but I think that there are some challenges around framing it as a dataset for assessing WF attacks and defenses.

To be brief, I think we solve traffic diversity with BigEnough-style datasets that never overlap subpages [0] and that data staleness and network overheads are minor problems. Capable attackers (we should consider) can build closed worlds of ~100 websites [1], no need for open-world data.

The biggest strength of the proposed dataset is also its biggest weakness: capturing real-world user diversity in traces. It is more user/client diversity than browser diversity (in the sense of more broad), because the dataset captures various browsers and as well as different Tor clients implementations, other Tor configurations like VPN mode, running on routers, in strange VMs, headless browsers etc., as well as a wide range of network configurations and probably more I missed used by the very diverse Tor-network userbase. While capturing this diversity is super interesting and valuable in different ways, it also risks being too filled with junk to help assess WF attacks or defenses.

Do you have any approaches or thoughts around pruning the dataset or further refining labels beyond the first domain?

I should note that we will include both the HMAC of first domain and an HMAC of its shortest private suffix (computed using Mozilla’s public suffix list and libpsl (publicsuffix.org <http://publicsuffix.org/&gt;\)).

IMO, the first domain is by far the most important. Other domains accessed on a typical TB circuit will primarily be due to fetching the embedded objects fetched during page loads. We’ll get a new circuit if the domain in the URL bar changes, and that will have its own first domain.

I’m trying to collect the minimal thing here to balance privacy and utility; while not perfect from a ML “give me all the data” perspective, I think the dataset would allow us to make significant moves in the right direction.

Some worries:

- I fear that we will likely have to do much guesswork on interpreting results based on the dataset. Do we want to be assessing WF defenses and attacks based on random torified curl scrapers, python scripts, and who knows what? Right now, we are creeping closer and closer to Internet~web, reflected in Tor-traffic. Without any ground truth, how can we avoid that most labels aren't mostly junk? What does it mean to reach 50% accuracy in evaluating the dataset? If a WF attack trained on TB "fails" to associate a website visit using curl that's probably a feature (not what the attacker was after anyway), not a bug. It would be fantastic if it were possible to have a sizeable subset of traces for some/most labels containing known configurations (or even just confirmed website visits with, say vanilla TB, maybe some more inline domain fingerprinting to set a bool for key website domains of Tranco top-100 requested on the circuit or something?).

First, I don’t think “junk” will be anywhere near the problem you imagine. Yes, there will be some noise from non-TB activity. But I believe that the *overwhelming* activity is people using TB going to top websites (see our IMC 2018 paper [0]). Second, you will be able to count the number of circuits that exist for each domain key. While this may not exactly match up with Tranco, it will probably be pretty close [0]. Third, I’m very happy to introduce this noise into the WF evaluation process. The adversary has to deal with it at least during testing, so you should too :slight_smile:

[0] https://www.robgjansen.com/publications/torusage-imc2018.pdf

- Filled with an unknown rate of junk, the dataset alone will be insufficient to train WF attacks in noisy environments (you want ~1000+ samples per class of something coherent, at least with current sota DL models pushed to their limits: as we move on to transformers probably even more). Suppose you subscribe to the claims of data staleness being a factor. In that case, researchers cannot even go through the very consuming process of collecting adequate labeled training data in the same way to show success on the proposed dataset. The exit collection vantage point makes this worse.

We’ll have enough data that concept drift can be studied within our set exclusively. If our set isn’t big enough, or if it becomes stale after a while and we’re wanting more realistic threat estimates with current data, then yeah that’s another TRSB submission and we’ll have to think through the risks and relative benefits again. As it is now, we have a very large step forward in terms of benefit because we are starting from zero.

If we determine with this first set that we really need a continuous stream of the latest data and lots more of it, and want to establish some permanent measurement setup, that of course would be useful but also has its own risks. Right now I would argue against a permanent measurement because we haven’t even done the more basic thing yet.

- Using the dataset as a basis for simulating defenses, one would have to simulate the corresponding client and middle traces to feed into the simulated defenses based on exit traces. Kinda messy with poor signalling for Tor-network characteristics in the dataset. When collecting at the client, getting realistic client traces (for the particular configuration) is basically for free. At the same time, middles change for every circuit, so a half-assed approach seems to get you far (speaking from experience of half-assing things here!). I worry about half-assing both client and middle traces, however.

This is solvable. You could construct a website simulator using the downstream cell streams, which we collect on the exit before Tor’s queuing plays a part in messing with the timing. Once you have that, use Shadow to fetch the simulated sites through a simulated Tor network, which should do a reasonable (and getting better) job of adding back in the performance effects on top of the cell streams. Not perfect, but maybe passes your half-ass bar?

If we make this proposed dataset and its method the bar of doing "real-world WF", it might lead to too high of a bar. We want more real-world implementations and data collection with defenses, not less, I think.

Yes, I absolutely want us to raise the bar for doing this type of research. If that means professors can write fewer papers, and the ones we get will be more informative, then I will have succeeded!

Sorry if the above may come across as a bit negative, it's not my intent: I *want* the dataset you describe, we can learn a lot from it for sure. Wish I had a chance to chat in person in Costa Rica! Please don't feel obliged to reply, just food for thought.

How could I resist in responding? :slight_smile:

I really do appreciate your thoughts, thanks for sharing!

Peace, love, and positivity,
Rob

···

On Apr 22, 2023, at 7:22 PM, Tobias Pulls <tobias.pulls@kau.se> wrote:

Best,
Tobias

[0]: "SoK: A Critical Evaluation of Efficient Website Fingerprinting Defenses", https://www-users.cse.umn.edu/~hoppernj/sok_wf_def_sp23.pdf
[1]: "Website Fingerprinting with Website Oracles", https://petsymposium.org/2020/files/papers/issue1/popets-2020-0013.pdf

On 20/04/2023 23:16, Jansen, Robert G CIV USN NRL (5543) Washington DC (USA) via tor-project wrote:

Hello Tor friends,
We are planning to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. We believe the dataset will help us make Tor safer, because it will
allow us to design defenses that can be shown to protect *real* Tor traffic
instead of *synthetic* traffic. This will help ground our evaluation of proposed
defenses in reality and help us more confidently decide which, if any, defense
is worth deploying in Tor.
We have submitted detailed technical plans for constructing a dataset to the Tor
Research Safety Board and after some iteration have arrived at a plan in which
we believe the benefits outweigh the risks. We are now sharing an overview of
our plan with the broader community to provide an opportunity for comment.
More details are below. Please let us know if you have comments.
Peace, love, and positivity,
Rob
P.S. Apologies for posting near the end of the work-week, but I wanted to get
this out in case people want to talk to me about it in Costa Rica.

BACKGROUND
Website fingerprinting attacks distill traffic patterns observed between a
client and Tor entry into a sequence of packet directions: -1 if a packet is
sent toward the destination, +1 if a packet is sent from toward the client. An
attacker can collect a list of these directions and then train machine learning
classifiers to associate a website domain name or url with the particular list
of directions observed when visiting that website. Once it does this training,
then when it observes a new list of directions it can use the trained model to
predict which website corresponds to that pattern.
For example, suppose [-1,-1,+1,+1] is associated with website1 and [-1,+1,-1,+1]
is associated with website2. There are two steps in an attack:
Step 1:
In the first step the attacker itself visits website1 and website2 many times
and learns:
[-1,-1,+1,+1] -> website1
[-1,+1,-1,+1] -> website2
It trains a machine learning model to learn this association.
Step 2:
In the second step, with the trained model in hand, the attacker monitors a Tor
client (maybe the attacker is the client’s ISP, or some other entity in a
position to observe a client’s traffic) and when it observes the pattern:
[-1,-1,+1,+1]
the model will predict that the client went to website1. This example is
*extremely* simplified, but I hope gives an idea how the attack works.
PROBLEM
Because researchers don’t know which websites Tor users are visiting, it’s hard
to do a very good job creating a representative dataset that can be used to
accurately evaluate attacks or defenses (i.e., to emulate steps 1 and 2). The
standard technique has been to just select popular websites from top website
lists (e.g., Alexa or Tranco) and then set up a Tor webpage crawler to visit the
front-pages of those websites over and over and over again. Then they use that
data to write papers. This approach has several problems:
- Low traffic diversity: Tor users don’t only visit front-pages. For example,
they may conduct a web search and then click a link that brings them directly to
an internal page of a website. The patterns produced from front-page visits may
be simpler and unrepresentative of the patterns that would be observed from more
complicated internal pages.
- Low browser diversity: It has been shown by research from Marc Juarez [0] and
others that webpage crawlers used by researchers lack diversity in important
aspects that cause us to overestimate the accuracy of WF attacks. For example,
the browser versions, configuration choices, variation in behavior (e.g., using
multiple tabs at once), and network location of the client can all significantly
affect the observable traffic patterns in ways that a crawler methodology does
not capture.
- Data staleness: Researchers collect data over a short time-frame and then
evaluate the attacks assuming this static dataset. In the real world, websites
are being updated over time, and a model trained on an old version of a website
may not transfer to the new version.
In addition to the above problems in methodology, current research also causes
incidental consequences for the Tor network:
- Network overhead: machine learning is a hot topic and several research groups
have crawled tens of thousands of websites over Tor many times each. While each
individual page load might be insignificant compared with the normal usage of
Tor, crawling does add additional load to the network and can contribute to
congestion and performance bottlenecks.
Researchers have been designing attacks that are shown to be extremely accurate
using the above synthetic crawling methodology. But because of the above
problems, we don’t properly understand the *true* threat of the attack against
the Tor network. It is possible that the simplicity of the crawling approach is
what makes the attacks work well, and that the attacks would not work as well if
evaluated with more realistic traffic and browser diversity.
PLAN
So our goal is to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. This dataset would enable researchers to use a methodology that does
not have any of the above limitations. We believe that such a dataset will help
us make Tor safer, because it will allow us to design defenses that can be shown
to protect *real* Tor traffic instead of *synthetic* traffic. This would lead to
a better understanding of proposed defenses and enable us to more confidently
decide which, if any, defense is worth deploying in Tor.
The dataset will be constructed from a 13-week exit relay measurement that is
based on the measurement process established in recent work [1]. The primary
information being measured is the directionality of the first 5k cells sent on a
measurement circuit, and a keyed-HMAC of the first domain name requested on the
circuit. We also measure relative circuit and cell timestamps (relative to the
start of measurement). The measurement data is compressed, encrypted using a
public-key encryption scheme (the secret key is stored offline), and then
temporarily written to persistent storage before being securely retrieved from
the relay machine.
We hope that this dataset can become a standard tool that website fingerprinting
researchers and developers can use to (1) accelerate their study of attacks and
defenses, and (2) produce evaluation and results that are more directly
applicable to the Tor network. We plan to share it upon request only to other
researchers who appear to come from verifiable research organizations, such as
students from well-known universities. We will require researchers with whom we
share the data to (1) keep the data private, and (2) direct others who want a
copy of the data to us to mitigate unauthorized sharing.
[0] A Critical Evaluation of Website Fingerprinting Attacks. Juarez et al., CCS 2014. https://www1.icsi.berkeley.edu/~sadia/papers/ccs-webfp-final.pdf
[1] Online Website Fingerprinting: Evaluating Website Fingerprinting Attacks on Tor in the Real World. Cherubin et al., USENIX Security 2022. https://www.usenix.org/conference/usenixsecurity22/presentation/cherubin
_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
tor-project Info Page

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
tor-project Info Page

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

Do you have any approaches or thoughts around pruning the dataset
or further refining labels beyond the first domain?

I should note that we will include both the HMAC of first domain and
an HMAC of its shortest private suffix (computed using Mozilla’s
public suffix list and libpsl (publicsuffix.org
<http://publicsuffix.org/&gt;\)).

IMO, the first domain is by far the most important. Other domains
accessed on a typical TB circuit will primarily be due to fetching
the embedded objects fetched during page loads. We’ll get a new
circuit if the domain in the URL bar changes, and that will have its
own first domain.

I’m trying to collect the minimal thing here to balance privacy and
utility; while not perfect from a ML “give me all the data”
perspective, I think the dataset would allow us to make significant
moves in the right direction.

Cool with the suffix addition! The public suffix list is a good source. Agree that the first domain is by far the most important. Some noisy function over the later domains could provide plenty of utility with few downsides though.

For example, a less cumbersome approach could be a bucketed count of the number of *distinct* domains looked up in each trace (0-5, 6-10, 11-15, 15+). That, together with the traces, would allow some more informative labels. It's already known that the difference between standard vs. safest TB (aka javascript) is significant, and it'd be great to make the heuristics more solid for labeling.

First, I don’t think “junk” will be anywhere near the problem you
imagine. Yes, there will be some noise from non-TB activity. But I
believe that the *overwhelming* activity is people using TB going to
top websites (see our IMC 2018 paper [0]). Second, you will be able
to count the number of circuits that exist for each domain key. While
this may not exactly match up with Tranco, it will probably be pretty
close [0]. Third, I’m very happy to introduce this noise into the WF
evaluation process. The adversary has to deal with it at least during
testing, so you should too :slight_smile:

[0] https://www.robgjansen.com/publications/torusage-imc2018.pdf

I'm familiar with your IMC paper, it's nice work! Are you inferring TB over non-TB usage from the observed torproject.org primary domain spam or something else in the paper that enables you to differentiate? The ratio of TB traffic in the proposed dataset would be fantastic to know, linked to traces even better.

The issue isn't dealing with the junk traffic in testing, that's great, it's being able to deal with it properly in training without having to repeat the data collection. It's also a plus to reason more about testing results if we can tell more about the traces.

How much % of exit bandwidth are you planning to use to collect for 13 weeks? Even with a 0.1% exit probability over 13 weeks, extrapolating from your IMC paper, there should be 1000+ samples for Tranco top-1k. This is great if mostly TB traffic.

- Filled with an unknown rate of junk, the dataset alone will be
insufficient to train WF attacks in noisy environments (you want
~1000+ samples per class of something coherent, at least with
current sota DL models pushed to their limits: as we move on to
transformers probably even more). Suppose you subscribe to the
claims of data staleness being a factor. In that case, researchers
cannot even go through the very consuming process of collecting
adequate labeled training data in the same way to show success on
the proposed dataset. The exit collection vantage point makes this
worse.

We’ll have enough data that concept drift can be studied within our
set exclusively. If our set isn’t big enough, or if it becomes stale
after a while and we’re wanting more realistic threat estimates with
current data, then yeah that’s another TRSB submission and we’ll have
to think through the risks and relative benefits again. As it is now,
we have a very large step forward in terms of benefit because we are
starting from zero.

Without adequate labeling I have doubts, at least with current attacks. Guess it'll trigger more research.

If we determine with this first set that we really need a continuous
stream of the latest data and lots more of it, and want to establish
some permanent measurement setup, that of course would be useful but
also has its own risks. Right now I would argue against a permanent
measurement because we haven’t even done the more basic thing yet.

The dataset you propose is already massive, it's not about time. More refined labeling please!

- Using the dataset as a basis for simulating defenses, one would
have to simulate the corresponding client and middle traces to feed
into the simulated defenses based on exit traces. Kinda messy with
poor signalling for Tor-network characteristics in the dataset.
When collecting at the client, getting realistic client traces (for
the particular configuration) is basically for free. At the same
time, middles change for every circuit, so a half-assed approach
seems to get you far (speaking from experience of half-assing
things here!). I worry about half-assing both client and middle
traces, however.

This is solvable. You could construct a website simulator using the
downstream cell streams, which we collect on the exit before Tor’s
queuing plays a part in messing with the timing. Once you have that,
use Shadow to fetch the simulated sites through a simulated Tor
network, which should do a reasonable (and getting better) job of
adding back in the performance effects on top of the cell streams.
Not perfect, but maybe passes your half-ass bar?

Cool, maybe it's good enough? A plus is that one could simulate many more traces at clients and middles. For most destinations/circuits, I guess the destination<->exit latency is significantly less than client<->exit, so it should be possible to simulate blocking defenses more accurately. It's not a replacement for implementations and real-defended datasets though, I hope we can agree on?

If we make this proposed dataset and its method the bar of doing
"real-world WF", it might lead to too high of a bar. We want more
real-world implementations and data collection with defenses, not
less, I think.

Yes, I absolutely want us to raise the bar for doing this type of
research. If that means professors can write fewer papers, and the
ones we get will be more informative, then I will have succeeded!

Please think of the professors and phd students! :wink: Attacks are already closed-world perfect on unprotected data, so the dataset will surely help trigger more publishing on attacks. I'm mostly concerned about the bar for defenses and the community emphasizing simulation more than implementation as a consequence. We want more implemented and ultimately deployed defenses, right?

Sorry if the above may come across as a bit negative, it's not my
intent: I *want* the dataset you describe, we can learn a lot from
it for sure. Wish I had a chance to chat in person in Costa Rica!
Please don't feel obliged to reply, just food for thought.

How could I resist in responding? :slight_smile:

I really do appreciate your thoughts, thanks for sharing!

Thanks, likewise. I was thinking the beach would be nicer! :wink:

Best,
Tobias

···

On 24/04/2023 16:17, Jansen, Robert G CIV USN NRL (5543) Washington DC (USA) via tor-project wrote:

Peace, love, and positivity, Rob

Best, Tobias

[0]: "SoK: A Critical Evaluation of Efficient Website
Fingerprinting Defenses",
https://www-users.cse.umn.edu/~hoppernj/sok_wf_def_sp23.pdf [1]:
"Website Fingerprinting with Website Oracles",
https://petsymposium.org/2020/files/papers/issue1/popets-2020-0013.pdf

On 20/04/2023 23:16, Jansen, Robert G CIV USN NRL (5543) Washington DC (USA) via tor-project wrote:

Hello Tor friends, We are planning to construct a real-world
dataset for studying Tor website fingerprinting that researchers
and developers can use to evaluate potential attacks and to
design informed defenses that improve Tor’s resistance to such attacks. We believe the dataset will help us make Tor safer,
because it will allow us to design defenses that can be shown to
protect *real* Tor traffic instead of *synthetic* traffic. This
will help ground our evaluation of proposed defenses in reality
and help us more confidently decide which, if any, defense is
worth deploying in Tor. We have submitted detailed technical
plans for constructing a dataset to the Tor Research Safety Board
and after some iteration have arrived at a plan in which we
believe the benefits outweigh the risks. We are now sharing an
overview of our plan with the broader community to provide an
opportunity for comment. More details are below. Please let us
know if you have comments. Peace, love, and positivity, Rob P.S.
Apologies for posting near the end of the work-week, but I wanted
to get this out in case people want to talk to me about it in
Costa Rica. === BACKGROUND Website fingerprinting attacks distill
traffic patterns observed between a client and Tor entry into a
sequence of packet directions: -1 if a packet is sent toward the
destination, +1 if a packet is sent from toward the client. An attacker can collect a list of these directions and then train
machine learning classifiers to associate a website domain name
or url with the particular list of directions observed when
visiting that website. Once it does this training, then when it
observes a new list of directions it can use the trained model
to predict which website corresponds to that pattern. For
example, suppose [-1,-1,+1,+1] is associated with website1 and
[-1,+1,-1,+1] is associated with website2. There are two steps in
an attack: Step 1: In the first step the attacker itself visits
website1 and website2 many times and learns: [-1,-1,+1,+1] ->
website1 [-1,+1,-1,+1] -> website2 It trains a machine learning
model to learn this association. Step 2: In the second step, with
the trained model in hand, the attacker monitors a Tor client
(maybe the attacker is the client’s ISP, or some other entity in
a position to observe a client’s traffic) and when it observes
the pattern: [-1,-1,+1,+1] the model will predict that the client
went to website1. This example is *extremely* simplified, but I
hope gives an idea how the attack works. PROBLEM Because
researchers don’t know which websites Tor users are visiting,
it’s hard to do a very good job creating a representative dataset
that can be used to accurately evaluate attacks or defenses
(i.e., to emulate steps 1 and 2). The standard technique has been
to just select popular websites from top website lists (e.g.,
Alexa or Tranco) and then set up a Tor webpage crawler to visit
the front-pages of those websites over and over and over again.
Then they use that data to write papers. This approach has
several problems: - Low traffic diversity: Tor users don’t only
visit front-pages. For example, they may conduct a web search and
then click a link that brings them directly to an internal page
of a website. The patterns produced from front-page visits may be
simpler and unrepresentative of the patterns that would be
observed from more complicated internal pages. - Low browser
diversity: It has been shown by research from Marc Juarez [0]
and others that webpage crawlers used by researchers lack
diversity in important aspects that cause us to overestimate the
accuracy of WF attacks. For example, the browser versions,
configuration choices, variation in behavior (e.g., using multiple tabs at once), and network location of the client can
all significantly affect the observable traffic patterns in ways
that a crawler methodology does not capture. - Data staleness:
Researchers collect data over a short time-frame and then evaluate the attacks assuming this static dataset. In the real
world, websites are being updated over time, and a model trained
on an old version of a website may not transfer to the new
version. In addition to the above problems in methodology,
current research also causes incidental consequences for the Tor
network: - Network overhead: machine learning is a hot topic and
several research groups have crawled tens of thousands of
websites over Tor many times each. While each individual page
load might be insignificant compared with the normal usage of Tor, crawling does add additional load to the network and can
contribute to congestion and performance bottlenecks. Researchers
have been designing attacks that are shown to be extremely
accurate using the above synthetic crawling methodology. But
because of the above problems, we don’t properly understand the
*true* threat of the attack against the Tor network. It is
possible that the simplicity of the crawling approach is what
makes the attacks work well, and that the attacks would not work
as well if evaluated with more realistic traffic and browser
diversity. PLAN So our goal is to construct a real-world dataset
for studying Tor website fingerprinting that researchers and
developers can use to evaluate potential attacks and to design
informed defenses that improve Tor’s resistance to such attacks.
This dataset would enable researchers to use a methodology that
does not have any of the above limitations. We believe that such
a dataset will help us make Tor safer, because it will allow us
to design defenses that can be shown to protect *real* Tor
traffic instead of *synthetic* traffic. This would lead to a
better understanding of proposed defenses and enable us to more
confidently decide which, if any, defense is worth deploying in
Tor. The dataset will be constructed from a 13-week exit relay
measurement that is based on the measurement process established
in recent work [1]. The primary information being measured is the
directionality of the first 5k cells sent on a measurement
circuit, and a keyed-HMAC of the first domain name requested on
the circuit. We also measure relative circuit and cell timestamps
(relative to the start of measurement). The measurement data is
compressed, encrypted using a public-key encryption scheme (the
secret key is stored offline), and then temporarily written to
persistent storage before being securely retrieved from the relay
machine. We hope that this dataset can become a standard tool
that website fingerprinting researchers and developers can use to
(1) accelerate their study of attacks and defenses, and (2)
produce evaluation and results that are more directly applicable
to the Tor network. We plan to share it upon request only to
other researchers who appear to come from verifiable research
organizations, such as students from well-known universities. We
will require researchers with whom we share the data to (1) keep
the data private, and (2) direct others who want a copy of the
data to us to mitigate unauthorized sharing. [0] A Critical
Evaluation of Website Fingerprinting Attacks. Juarez et al., CCS
2014.
https://www1.icsi.berkeley.edu/~sadia/papers/ccs-webfp-final.pdf [1] Online Website Fingerprinting: Evaluating Website
Fingerprinting Attacks on Tor in the Real World. Cherubin et al.,
USENIX Security 2022.
https://www.usenix.org/conference/usenixsecurity22/presentation/cherubin

_______________________________________________

tor-project mailing list tor-project@lists.torproject.org tor-project Info Page

_______________________________________________

tor-project mailing list tor-project@lists.torproject.org tor-project Info Page

_______________________________________________ tor-project mailing
list tor-project@lists.torproject.org tor-project Info Page

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

More refined labeling please!

Understood. We’ll consider your feedback and see if we might be able to safely produce more informative labels.

It's not a replacement for implementations and real-defended datasets though, I hope we can agree on?

[snip]

We want more implemented and ultimately deployed defenses, right?

I absolutely agree!

Our dataset might be more immediately useful for evaluating attacks, but I don’t see how it hurts our ability to evaluate defenses. Defenses should still be implemented and evaluated in network-wide tests as before [0]. The best way I know how to do that is to first use Shadow for full network testing of a variety of candidate defenses before moving the best to the live Tor network. And I think our dataset could help make the Shadow part more realistic.

Peace, love, and positivity,
Rob

[0] "Padding-only Defenses Add Delay in Tor”

···

On Apr 24, 2023, at 6:40 PM, Tobias Pulls <tobias.pulls@kau.se> wrote:

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project