[tor-project] Onion Service SRE Kickstart notes and meeting minutes

Hi all :slight_smile:

Here goes the initial notes and first meeting minutes about the Onion
Services SRE work which is part of the Sponsor 123 project.

# Onion Services Site Reliability Engineering - Kickstart

## About

The Onion Services SRE is focusing on procedures for automated setup and
maintenance of high availability Onion Services sites.

### Objectives and key results (OKRs)

Objective: this project is part of the "Increase the adoption of Onion Services"
priority for 2022 (Miro | Online Whiteboard for Visual Collaboration),
for which we can select the following goals:

0. Easy to setup and maintain. How to measure this?

1. Sane defaults. How to measure this?

2. Configurable and extensible. How to measure this?

## Initial plan

0. Meeting with dgoulet, hiro and anarcat to get advice on kickstarting the project:
   what/where to look for about specs, tools, goals, security checklists, limits etc
   (meeting minutes bellow).

1. Research on all relevant deployment technologies: build a first matrix.

2. Then meeting with the media organizations: inventory, compliances check etc.

3. Build the second matrix (use cases).

## Kickstart meeting agenda

### Dimensions

Split discussion in two dimensions:

0. What are the possible architectures to build an Onion balanced service?

1. What are the available stacks/tools to implement those architectures?

### Initial considerations

While brainstorming about this project, the following considerations were
sketched:

0. Software suite: Sponsor 123 project includes provisioning/monitoring onion
   services as deliverables, but the effort could be used to create a generic
   product (a "suite") which would include an Onionbalance deployer.

1. Key generation: such suite could generate all .onion keys locally (sysadmin's
   box), encrypting and pushing to a private repository (at least for the
   frontend onion keypair for each site/service). Then other sysadmins could clone
   that internal repository and start managing the sites/machines.

2. Disposability: depending on design choice, the frontend .onion address
   could be the only persistent data, and everything else could be
   disposable/recycled/recreated in case of failure or major infrastructure/design
   revamp..

   That of course depends if Onionbalance supports backend rotation.

   Consequence: initial effort could be focused in a good frontend implementation
   (Oniobalance instance etc), while backends and other nodes could be reworked
   later if time is limited right now.

3. Elasticity: that leads to the follow requirement: shall this system be such
   that backend nodes can be added and removed at will? In a way that enables
   the system to be elastic in the future, adding and removing nodes according to
   average load or sysadmin command? Or that would need the Onionbalance instance
   to be restarted (resulting in unwanted downtimes), or even breaking it?

   Currently Onionbalance supports only up to 8 backends:
   v3: Support distinct descriptor mode (#7) · Issues · The Tor Project / Core / Onionbalance · GitLab

   Initial proposal for Sponsor 123 project would be to use a fixed number
   of 8 backends per .onion site, but experiments could be made to test if
   a site could have a dynamic number of backends.

4. Uniformity with flexibility: looks like most (if not all) sites can have the
   same "CDN" fronting setup, while it's "last mile"/endpoints might be all
   different. That said, the "first half" of the solution could be based in the
   same software suite and workflow which could be flexible enough to accept distinct
   endpoint configurations.

5. External instance(s): for the Sponsor 123 contract, a single instance of this
   "CDN" solution could be used to manage all sites, instead of having to
   manage many instances (and dashboards) in parallel.

   Future contracts with other third-parties could either be managed using that
   same instance or having their own instances (isolation).

6. Internal instance: another, internal instance could be set to manage all
   sites listed at https://onion.torproject.org if TPA likes and decides to
   adopt the solution :slight_smile:

7. Migration support: the previous point would depend in built-in support to
   migrate existing onion services into the CDN instance.

8. Other considerations: see rhatto's skill-test research.

### Questions

General:

0. If you were the Onion Services SRE, how would you implement this project?

1. What existing solutions to look at, and what to avoid?

2. What limits we should expect from the current technology, and how we could work
   around those?

Architecture:

0. What people think about the architecture proposed by rhatto during his
   skill-test (without paying attention to the improvised implementation he
   coded)?

1. Tor daemon is single process, no threads. How it scales under load for Onion
   Services and with a varying number of Onion Services?

2. Which other limits are important to be considered in the scope of this project,
   like the current upper bound of 8 Onionbalance backend servers?

Implementation:

0. What are the dimensions for the comparison matrix of existing DevOps solutions
   such as Puppet, Ansbile, Terraform and Salt (and specific modules/recipes/cookbooks
   /roles)?

1. Shall this suite be tested using Chutney or via the shadow simulator (Gitlab
   CI)? Makes sense?

2. Which other tests should be considered?

3. How TPA manages passphrases and secrets for existing systems and keys?

4. What (if any) TPA (or other) security policies should be observed in this
   project?

5. Which solutions are in use to manage the sites listed at
   https://onion.torproject.org/?

Management:

0. Sponsor 123 Project Plan timeline predicts setup of first .onion sites
   on M1 and M2, with 2-5 business days to set up a single .onion site.
   But coding a solution could take longer. How to do then?
   Suggested approach is to have a detailed discovery phase while coding
   the initial solution in parallel. Some rework migth be needed, but
   we can gain time in overall.

## Possible next tasks

0. Gather all relevant docs on onion services.

1. Build a comprehensible Onion Service checklist/documentation, including
   stuff like:
  * Basic:
    * Relay security checklist (if exists).

    * Best practices and references: see existing and legacy docs like
      OperationalSecurity · Wiki · Legacy / Trac · GitLab

    * Making sure the system clock is synchronized.

    * Setup the Onion Location header.

    * Encrypted backup of .onion keys.

  * Optional:
    * Vanity address generation (using `mkp224o`)?

    * Setup HTTPS with valid x509 certificates (and automatic HTTP -> HTTPS
      connection upgrade).

    * Setup Onion Names (HTTPS Everywhere patch, or whatever is on it's place).

    * Onion v3 auth (current unsupported by Onionbalance, see
      v3: Support client authorization (#5) · Issues · The Tor Project / Core / Onionbalance · GitLab).

2. Create repository (suggested name: Oniongroove - gives groove to Onionbalance!).

3. Write initial spec proposal for the system after both matrixes are ready and
   other requirements are defined, dealing with architecture, implementation
   and UX.

4. Write a threat model for the current implementation, considering issues such
   as lack of support for offline Onion Services keys, which makes the protection
   of the frontend keys a critical issue.

5. Create tickets for these and other tasks.

## Meeting Minutes - 2022-02-08

### Participants

* Anarcat
* David
* Hiro
* Rhatto

### Discussion

(Free note taking, don't necesarilly/preciselly represents what people said)

Rhatto:

* Short intro, summarizing stuff above.

Hiro:

* When talking last year with NYT: something that helps on the community side:
  someone that does a course (bootcamp) on devops, applying stuff like
  Terraform, Ansible etc.

* What would be easier to rhatto to do (script, ansible ).

Anarcat:

* Puppet/agent:
  * TPA is a big puppet shop. But don't think it's the good tool for the job:
    too central, like having a central puppet server.
  * Also ansible is more popular.
  * Ansible has little requirement, easier to deploy and to reuse.
  * Not sure about Terraform, issues provisioning to Hetzner or Ganeti.

Rhatto:

* Maybe something to deploy node instances and atop of that using stuff like
  ansible to provision the services?
* How TPA provision nodes at Hetzner and Ganeti?
* Shall we look at Kubernetes?

Anarcat:

* Before joining Tor: what kind of a mess and shell scripts.
* Wrote a kind of a debian installer with python + fabric.
* The installer makes a machine configured up to be added do LDAP/Puppet.
* Maybe a MVP that uses ansible (services setup) and then another using Terraform
  (node setup).

Hiro:

* Docker Swarm using Terraform.
* Likes ansible (because of python+ssh only requirement).
* About Kubernetes: same issue with puppet: have to run a centralized set of
  control nodes.
* Ansible: lots of recipes available to harden the machine.
* Puppet is complicated I think because it works for your own infrastructure.
* It works for companies because it is tailored to providers.

David:

* There are lots of recipes and blog posts about ansible for Tor.

Anarcat:

* Docker: does provide some standard environment.
* Like what rhatto did at his skill test.
* Question with Docker: what to do? Swawm, Kubernets, Compose? Irony with
  Docker, with is not obvious in how to use at production.
* Docker might be interesting for use to produce docker containers.
* Part of the job is to do that evaluation.

Rhatto:

* Could do all this research.

Anarcat:

* About stopping using NGINX: having troubles with the blog, upstream charging
  a lot for the traffic.
* NGINX: generic webserver, had heard lots of good things about.
* Setup 2 VMs caching the blog, but them retired as the caching is not.
* NGINX as an opencore, specially tricking when you want to do monitoring.
* OpenResty is very interesting.

Hiro:

* OpenResty: similar opencore model like NGINX.

Rhatto:

* How to connect the sollution and the endpoint.
* Questions:
  * Local .onion keypair generation is a good approach?
  * Could offline .onion keys support be in the roadmap?
  * Are backend keys disposable?

David:

* Offline key is very unlikelly to have it until the Rust rewrite.
  Would not bet on that to come for this project.

* Local key generationg and deployment: there will be a need for this.

* Would not bet that we could rotate the Onionbalance keys.

### Next

See "Possible next tasks" section above :slight_smile:

···

--
Silvio Rhatto
pronouns he/him

1 Like