[tor-project] minutes from the sysadmin meeting

Roll call: who’s there and emergencies

No emergencies: we have an upcoming maintenance about chi-san-01 which
will require a server shutdown at the end of the meeting.

Present: anarcat, gaba, kez, lavamind

Storage brainstorm

The idea is to just throw ideas for this ticket:

anarcat went to explain the broad strokes around the current storage
problems (lack of space, performance issues) and solutions we’re
looking for (specific to some service, but also possibly applicable
everywhere without creating new tools to learn)

We specifically focused on the storage problems on gitlab-02,
naturally, since that’s where the problem is most manifest.

lavamind suggested that there were basically two things we could do:

  1. go through each project one at a time to see how changing certain
    options would affect retention (e.g. “keep latest artifacts”)

  2. delete all artifacts older than 30 or 60 days, regardless of
    policy about retention (e.g. keep latest), could or could not
    include job logs

other things we need to do:

  • encourage people to: “please delete stale branches if you do have
    that box checked”
  • talk with jim and mike about the 45GB of old artifacts
  • draft new RFC about artifact retention about deleting old artifacts
    and old jobs (option two above)

We also considered unchecking the “keep latest artifacts” box at the
admin level, but this would disable the feature in all projects with
no option to opt-in, so it’s not really an option.

We considered the following technologies for the broader problem:

  • S3 object storage for gitlab
  • ceph block storage for ganeti
  • filesystem snapshots for gitlab / metrics servers backups

We’ll look at setting up a VM with minio for testing. We could first
test the service with the CI runners image/cache storage backends,
which can easily be rebuilt/migrated if we want to drop that test.

This would disregard the block storage problem, but we could pretend
this would be solved at the service level eventually (e.g. redesign
the metrics storage, split up the gitlab server). Anyways, migrating
away from DRBD to Ceph is a major undertaking that would require a lot
of work. It would also be part of the largest “trusted high
performance cluster
” work that we recently de-prioritized.

Other discussions

We should process the pending TPA-RFCs, particularly TPA-RFC-16, about
the i18 lektor plugin rewrite.

Next meeting

Our regular schedule would bring us to March 7th, 18:00UTC.

Metrics of the month

  • hosts in Puppet: 88, LDAP: 88, Prometheus exporters: 143
  • number of Apache servers monitored: 25, hits per second: 253
  • number of self-hosted nameservers: 6, mail servers: 8
  • pending upgrades: 0, reboots: 0
  • average load: 2.10, memory available: 3.98 TiB/5.07 TiB, running processes: 722
  • disk free/total: 35.81 TiB/83.21 TiB
  • bytes sent: 296.17 MB/s, received: 182.11 MB/s
  • planned bullseye upgrades completion date: 2024-12-01
  • GitLab tickets: 166 tickets including…
    • open: 1
    • icebox: 149
    • needs information: 2
    • backlog: 7
    • next: 5
    • doing: 2
    • (closed: 2613)

Upgrade prediction graph lives at

Number of the month

-3 months. Since the last report, our bullseye upgrade completion date
moved backwards by three months, from 2024-09-07 to 2024-12-01. That’s
because we haven’t started yet, but it’s interesting that it’s seems
to be moving back faster than time itself… We’ll look at deploying a
perpetual movement time machine on top of this contraption in the next
meeting.

1 Like