reliable | catherine jue

Kernel is a cloud infrastructure provider. As we've grown, I've realized that customers expect us to solve "scale problems" from day one that other companies only face later. Reliability is one of those problems, and it's something I'm spending a lot of time thinking about.

In the world of cloud infrastructure, reliability is a pre-requisite to developers using a product. Developers choose to outsource a part of their stack because (1) the vendor's feature set : cost is more valuable than an internal team building it themselves (2) the part of the stack is not core to their business (3) they have confidence that the vendor works reliably.

As an early stage company, reliability is hard. Unlike consumer applications that can gradually optimize for reliability as they grow (if Twitter doesn't work, just reload the page and try again), infrastructure providers must prove reliability in their first, second, tenth, thousandth, and millionth response. This is difficult enough in one's early days when code is immature, but the problem only grows as the company does. Not only must infrastructure companies ship net new features (which introduces risks and unproven codepaths), they must also scale to provide the same response in their billionth response.

What makes something reliable? Oxford defines reliable as "consistently good in quality or performance; able to be trusted." Chatgpt describes reliable as "consistently performs as expected under the same conditions." The latter more closely mirrors the traditional definition of site reliability engineering, but I believe the former is better framing for an early stage start-up.

The phrase "able to be trusted" is the key: traditional site reliability engineering focuses on uptime, error rates, and latency. That might be difficult when a company is pre-scale, but one can build trust in other ways. Here's how we think about it at Kernel:

Reliable error messages. Error messages are a developer's first interaction with your systems when something goes wrong. Immature code is to be expected in early stage start-ups, but nothing erodes trust faster than useless error messages.

Reliable error messages are actionable, consistent, and include enough context for a developer to resolve issues themselves. This is more than telling the user that something went wrong: the best error messages provide a path forward, such as {"message": "Timeout exceeded. Did you mean to use asynchronous invocations? [...].com/apps/invoke#asynchronous"} instead of {"message": "Timeout execeeded."}[1].

Actionable error messages tell the developer, "These people have anticipated this failure mode and cared enough to build systems that I can debug." Great error messages communicate respect for a developer's time, and that builds trust.

Reliable documentation. Documentation reliability means information accuracy, consistent and thorough examples, and surfacing the right information at the right time (with and without the assistance of LLMs). When a developer can't find answers or source code, products often feel unreliable even if they're technically working as designed.

Reliable documentation also means acknowledging product limitations. Acknowledging that a feature is in beta or has a specific limitation builds trust through transparency. When a developer knows you default to the truth, they're more likely to trust everything else you have to say about your systems.

Reliable customer engineering. In infrastructure, support interactions are high-stakes moments. When developers' production systems depend on your API, how quickly you handle their questions, debug their issues, and communicate unplanned downtime defines your reliability.

This is where early stage companies can outperform larger competitors. You can create a Slack group with every paying customer (we do). Your engineers can jump on a support call within an hour. You can ship feature requests quickly. These advantages are easy to leverage when you're small and difficult to actuate when you're large.

Reliable velocity. Early stage companies all face the same challenge: customers usually need features you don't have yet. The question becomes whether they're willing to bet on your trajectory rather than your current capabilities. This is where reliable velocity becomes a competitive advantage.

Reliable velocity means consistently shipping product improvements at an impressive cadence. When customers see you repeatedly deliver on your commitments, they develop confidence that the features they need will actually materialize.

This creates compounding effects over time; delivered commitments build credibility for the next one. Customers become willing to choose your platform based on where you're going rather than where you are today, and that becomes a bridge to exponential acceleration[2].

Actually, this framework for reliability extends well beyond software engineering. One of the qualities of great leadership is being reliable: doing what one says they'll do (and stating what they won't), showing up dependably—in the same way—and applying consistent pressure to shape excellence. It's also a mark of exceptional hospitality.

Like great hosts, great products anticipate needs, communicate clearly about what to expect, and handle problems gracefully when they arise. And who builds great products? Great companies made up of great people.

[1] Stripe was known for this in their early days.
[2] Mintlify consistently impresses me with this.

august 22, 2025
← back