Open Source CEO by Bill Kerr
Posts
Inference Engineering, Open Models & Shipping AI At Scale

Inference Engineering, Open Models & Shipping AI At Scale

An interview with Philip Kiely, Head of Developer Relations at Baseten. 🧠

Bill Kerr
February 22, 2026

👋 Howdy to the 2,383 new legends who joined this week! You are now part of a 264,323 strong tribe outperforming the competition together.

LATEST POSTS 📚

If you’re new, not yet a subscriber, or just plain missed it, here are some of our recent editions.

🧱 LinkedIn Just Voluntarily Tanked Their Algorithm (+ How To Win). We get down and dirty on the tactics to build your brand.
🥄 Neighbors Feeding Neighbors, At Scale. An interview with Joey Grassia, Co-Founder & CEO at Shef.
💨 Tracksuit’s Bootstrap To Series B. A company that is growing really, really, really fast.

PARTNERS 💫

Framer is the design-first, no-code website builder that lets anyone ship a production-ready site in minutes. Whether you're starting with a template or a blank canvas, Framer gives you total creative control with no coding required.

Add animations, localise with one click, and collaborate in real-time with your whole team. You can even A/B test and track clicks with built-in analytics.

This is the kind of talent you get with Athyna Intelligence: Research Engineer with deep expertise in PyTorch, deep learning, and LLM workflows. Built to ship models that hold up under real-world conditions.

Applied ML at scale
Fast iteration on real-world constraints
Production-first mindset
40–60% cost savings

Part of our vetted LATAM PhD and Masters network, working in U.S.-aligned time zones.

_{Interested in}_{sponsoring these emails?}_{See our}_{partnership options here}_.

HOUSEKEEPING 📨

I am lucky enough to be an investor at Blackbird, based out of Australia, and just as I was preparing today’s piece I received news that we (Blackbird) had invested into the Baseten Series E. And now today we are featuring Philip from Baseten. Incredible timing. On the same day, I received news that Blackbird portfolio company, co-founded by a buddy of mine, Charlie, was acquired by Hims & Hers for $1.15 billion.

A great result for the Australian startup ecosystem, and my first meaningful exit with my VC investments. I only somewhat believe in the investment returns of venture capital. It’s more so just super cool to be able to be part of incredible results like this. Now onto today’s piece with Phil from Baseten.

INTERVIEW 🎙️

Philip Kiely, Head of Developer Relations at Baseten

Philip Kiely is a software engineer and leading developer relations expert at Baseten, an AI infrastructure company that helps organizations deploy and scale large-scale generative AI models in production. At Baseten, Kiely focuses on helping teams adopt open-source and custom models, optimize inference performance, and build robust, low-latency systems that power real-world AI products. He also writes and speaks widely on topics like inference engineering, AI model serving, and the practical trade-offs of deploying generative models at scale.

Philip Kiely.

Before joining Baseten in 2022, Kiely worked across software engineering and technical writing at a range of startups, developing a strong foundation in both product development and clear technical communication. He is the author of Inference Engineering and has appeared on industry podcasts and at major conferences to discuss scalable AI systems, model orchestration, and the evolving role of open models in production. Outside of work, he’s a lifelong martial artist, avid reader, and fan of Bay Area sports.

What problem are you working on right now?

I’m working on inference, specifically inference for both open models and custom models. That includes models like Llama, DeepSeek, Flux, Whisper, Orpheus, and others across different modalities, as well as models that companies train themselves and want to build products around.

To explain why inference matters, it helps to start with what inference actually is. If you think about the lifecycle of a generative AI model, there are two phases. First, the model is trained and the weights are learned from data. After that comes inference, which is when you actually run that finished model in production. For people less familiar with AI, I usually explain it like this: if you’ve used ChatGPT, inference is everything that happens between asking a question and getting an answer back.

Inference in a nutshell.

We work with many of the best companies in the industry, from AI-native startups to growing enterprises that are rapidly adopting these technologies. A few years ago, when ChatGPT and Claude were first announced, there were only a handful of companies training frontier models and doing serious inference engineering. In an alternate version of the world, that could have stayed the case, with a few companies owning intelligence and everyone else renting it token by token.

Source: Baseten.

That’s not the world I want to live in, and fortunately, it’s not the world we live in. Today, there are over two million open-source models on Hugging Face that developers can use to actually own the intelligence inside their products. But for that to be possible, those models need infrastructure that allows them to be fast, less expensive, reliable, and easy to scale. That’s what we do at Baseten.

We provide runtime performance optimizations and multi-cloud infrastructure to run inference at scale with state-of-the-art latency, for any open model or any custom model a company creates. We’ve also expanded into training and post-training workflows, pairing fine-tuning and reinforcement learning with inference so teams can build high-quality custom models and gain independence from closed models.

Source.

What’s underestimated when moving from a demo to something customers rely on?

A few years ago, I talked to a lot of people who believed inference wasn’t the hard part. Some very big names in the industry even said they thought training would be the real challenge, because once you had the model, you could just put it on an inference engine, run it on GPUs, and be done. It turns out that’s not how it works.

There’s a huge amount of complexity in running inference beyond just getting the model to work. One major challenge is latency. Many AI applications are real time, and users have expectations that translate into SLAs measured in hundreds of milliseconds. Influencing a trillion-parameter model across eight or more GPUs while hitting a time-to-first-token in that range is an extremely difficult technical problem. That requires deep runtime work, highly optimized inference engines, custom kernels, careful configuration, and techniques like post-training quantization, speculative decoding, and KV cache reuse. Some of those improve tokens per second, others reduce time to first token, but they all matter. Even if you manage to make it fast, that’s only half the problem. Once something is fast and useful, people want to use it at scale.

Source: Baseten.

That introduces an entirely different challenge: replicating this setup across multiple regions and multiple cloud providers to meet demand. The biggest AI-native products today are doing trillions of tokens per month. For companies operating at that scale, a single region or even a single cloud provider isn’t enough. You need both performance engineering and distributed infrastructure to work together seamlessly to deliver consistent, low-latency inference in production.

What does your day-to-day look like?

Honestly, I’m still figuring out what my job is. I’ve been at Baseten for four years, and I’m one of the longest-tenured employees, including the founders. When I joined, we had four customers and basically no revenue. Today, we’re a team of over 150 and a multi-billion-dollar company. Early on, my role was very product-focused because we were just trying to figure out what to build. I spent a lot of time on documentation, explaining how things worked. As we grew and hit our first real revenue milestones, that naturally evolved into a developer relations role. Instead of documenting features, I started documenting problems and real-world use cases that customers were facing.

Today, that role has expanded alongside a much larger marketing organization. There’s demand for conference talks, interviews, technical writing, demos, and content of all kinds. Recently, a big part of my focus has been writing a book called Inference Engineering, which I’m publishing soon.

What’s been consistent throughout is that I’ve always tried to explain what’s happening right now. In an industry that moves this fast, having a clear understanding of the present is incredibly valuable. Only recently have I had the space to think a bit further ahead, maybe weeks or months out, and help position both myself and the company for what’s coming next.

How has moving between building and explaining shaped your products?

I’ve seen a lot of structural changes over the years, which anyone who’s worked at a hypergrowth company will recognize. I’ve seen teams that are now incredibly strategic and staffed with some of the best engineers I’ve ever met start out as hackathon projects. There’s constant movement and reshaping. My role has always been to first tell things the way they are and clearly explain what’s happening right now. In a market that moves this fast, simply understanding the present is extremely valuable, both internally and externally.

Source: Baseten.

Only more recently have I had the opportunity to think a bit about what’s coming next. Now that we’ve caught up with the present, I can start looking ahead. This is an industry where vision matters, but it’s not one where three- or five-year roadmaps make much sense. I’ve started thinking in shorter horizons: a week, a month, a quarter, and occasionally a year out. The focus over the last four years has really been on staying grounded in what’s happening today, how things have changed recently, and what people need to know to do their jobs effectively, whether they’re inference engineers, AI engineers, or people looking into the space from the outside.

How do you see the future of AI right now?

One thing I’m really excited about is the continued momentum around open-source models. A couple of years ago, the common belief was that open models just weren’t good enough compared to closed models, so they weren’t very useful. Then models like DeepSeek v3 and DeepSeek R1 came out and closed that gap.

After that, the conversation shifted to whether open models would keep up, fall behind, or jump into the lead. That’s not the most productive way to look at it though. The real utility of open models isn’t determined by benchmark scores. It’s determined by the capability thresholds they’re able to cross.

If you look at the past, you can see how this plays out. As closed models improved, they crossed invisible thresholds that unlocked entirely new kinds of products. Early models that powered things like customer support agents or code editors weren’t very good. They were slow, expensive, and unreliable, but they crossed just enough of a capability threshold for people to start building on them. When open models crossed those same thresholds weeks or months later, builders were able to switch and benefit from lower costs, better performance, and more control. Looking ahead, I can see frontier applications of closed models today that, in a few months, we’ll be working to make cheaper and faster in the open ecosystem.

Right now, that includes things like reasoning agents, where closed models still have an edge, as well as speech-to-speech systems, video generation, and world modeling. I’m particularly interested in what happens when these systems become twice as fast and ten times cheaper. I also see a big shift toward reinforcement learning. The best teams today are taking high-quality open models and doing deep fine-tuning to push them across capability thresholds faster. Right now, only the leading teams are doing this, but I think within a year it will be standard practice across the industry.

How do you use AI in your day-to-day?

AI has had a much bigger impact on my work life than my personal life. I’m able to ship substantially more today than I could one, two, or three years ago because I’ve really dialed in the tools I use. As an engineer, I use Cursor constantly. It allows me to build one-off or custom software that wouldn’t otherwise be worth the effort. Sometimes I’ll build an entire application during my commute home.

I also use Descript a lot for video and audio editing. I’ve used traditional tools like Final Cut Pro, so I know how long even basic editing can take. With Descript, I can film, script, edit, and publish a video about a new model within a couple of hours because the editing process is so fast. We’ve also adopted a lot of Notion’s new AI features internally, which have been great. I use AI extensively in writing and proofreading, especially while working on my book. I’m also using AI-generated voice tools to produce the audiobook version.

One tool that’s especially important for me is Whisper. I don’t type particularly fast, so being able to speak and have text appear accurately is incredibly helpful for someone whose job revolves around creating content. At this point, AI tools are so ingrained in my workflow that I often forget I’m using them until I stop and list everything out.

Is there anything I should have asked you?

The main thing I’d emphasize is just how early we still are in inference engineering. There’s going to be a massive increase in demand for inference engineers over the next year, and this is where I’ve personally bet the foundation of my career.

As more companies adopt open models, they’ll all need people who understand how to serve and operate these systems. Even companies that don’t build inference platforms in-house still need teams that know how to run and manage them effectively. There’s growing demand not just for engineers, but also for executives who understand inference, people who know how to sell it, and people who know how to manage it. That’s why I wrote Inference Engineering. It’s a 300-page book that captures everything I’ve learned over the last four years at Baseten.

Inference Engineering.

The goal is to give people a clear roadmap to becoming experts in inference, which I believe is one of the most valuable areas in AI right now.

Extra reading / learning

Arctic Vaults, AI Agents & The Future Of Dev Tools - June, 2025
Launching Athyna Intelligence - January, 2026
Why AI Is Easy Ti Demo But Hard To Trust - February, 2025

And that’s it! You can follow Philip on LinkedIn or check out Baseten on their website to keep up with what they’re building!

BRAIN FOOD 🧠

TWEETS OF THE WEEK 🐣

Source.

TOOLS WE RECOMMEND 🛠️

Every week, we highlight tools we like and those we actually use inside our business and give them an honest review. Today, we are highlighting Thesys*—a Generative UI company that helps anyone build AI apps and agents that respond with interactive UI like charts, forms, cards, tables, and more, instead of walls of text.

beehiiv: We use beehiiv to send all of our newsletters.
Apollo: We use Apollo to automate a large part of our 1.2M weekly outbound emails.
Taplio: We use Taplio to grow and manage my online presence.

See the full set of tools we use inside of Athyna & Open Source CEO here.

HOW I CAN HELP 🥳

P.S. Want to work together?

Hiring global talent: If you’re hiring tech, business or ops talent and want to do it 80% less, check out my startup, Athyna. 🌏
See my tech stack: Find our suite of tools & resources for both this newsletter and Athyna here. 🧰
Reach an audience of tech leaders: Advertise with us if you want to get in front of founders, investors and leaders in tech. 👀

That’s it from me. See you next week, Doc 🫡

_{P.P.S. Let’s connect on}_LinkedIn_and_Twitter_.

Reply

or to participate.