Services
We offer SRE and reliability consulting for teams that want to ship faster without sacrificing stability. We take on a small number of engagements at a time. If you're dealing with something on this list, get in touch - we'd love to hear what you're working on.
What we help with
- Observability: instrumentation, tracing, metrics, logging, and making sure you actually understand what your systems are doing before they stop doing it. We can also help with cost management - o11y spend can sneak up on you!
- CI/CD: pipeline design, test suite speedups, build optimization, deployment automation, and getting your delivery cycle time down from "after the CAB meeting" to continuous, on demand
- Performance tuning: profiling, load testing, bottleneck analysis, and the kind of deep-dive work that turns "it's slow" into "it's fixed". We understand the tail at scale, we love taming spike p99s
- Infrastructure migration: moving between cloud providers, containerization, Kubernetes and all the fun that comes with it. We strongly believe in zero-downtime migrations, but also know when to move faster.
- Incident management: on-call design, runbooks, post-incident review, and building a culture where incidents make you stronger
How we work
Our main priority at the start of an engagement is to figure out how best to work together.
Regardless of the engagement model, we start with a discovery phase - we spend 1-2 days on-site with your engineering leaders and engineers to understand the current state of your systems and reliability, and determine what the highest impact work would be. From this, we deliver an initial statement of work with recommendations.
From there, we offer several engagement models:
- Delivery/enablement: we embed with your engineering teams, on-site. We work towards a prioritized list of goals, pair with engineers, write code, documentation, and provide continuous feedback on other areas where we see opportunities for improvement. Think of it as staff-engineer-as-a-service.
- Advisory on retainer: we are available to answer general questions (with a turnaround of 24-48 hours) through email, Slack or equivalent, with additional deliverables and video calls as needed.
- Ad hoc: we scope and deliver specific projects at an hourly rate.
- Workshops: developed on demand, covering topics like incident management, observability, and XP practices.
Track record
Most recently, we helped a major Canadian fintech reduce their delivery cycle time from months to minutes by introducing CI and CD.
Previously, we've worked on reliability at scale at Discord, Shopify, Lightstep, Wealthsimple, Pivotal, and Monzo.
hello@whatsdown.dog - we'd love to hear from you.