The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

By: Fexingo

Listen for free

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know the basics and want to debate the hard edges: Is 99.999% uptime always worth the cost? When should you deliberately degrade service to improve reliability? How do you design for resilience when your system is already in production? Lucas and Luna don't pretend to have final answers — they build the conversation so you can draw your own. If you've ever argued about whether a page was necessary or whether an SLO should be tightened, this is your show. #SiteReliabilityEngineering #SRE #Uptime #ProductionEngineering #IncidentResponse #ErrorBudgets #SLOs #Postmortem #ToilAutomation #CapacityPlanning #Observability #DevOps #PlatformEngineering #Resilience #OnCall #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo© 2026 Fexingo. All rights reserved.

Economics

Episodes View all

How SRE Teams Use Incident Command Systems to Coordinate Response

Jun 7 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dive into the incident command system (ICS) model that large-scale SRE teams borrow from emergency services to manage complex outages. They walk through a real example: a major payment processing incident at a fintech company where a database migration triggered a cascading failure affecting three million users. Lucas explains the four key roles in an SRE incident command structure — incident commander, operations lead, communications lead, and scribe — and how each prevents the chaos of engineers stepping on each other during a crisis. Luna challenges whether ICS slows down response time for smaller incidents, and Lucas shares how teams use tiered response models to scale the approach. They also discuss the one mistake teams make most often: failing to formally hand off the incident commander role during long-running incidents. The episode closes with a practical tip for any team looking to adopt ICS without formal training: start by assigning a scribe for the next on-call rotation. #IncidentCommandSystem #SRE #SiteReliabilityEngineering #IncidentResponse #OnCall #CascadingFailure #Fintech #DatabaseMigration #IncidentCommander #OperationsLead #CommunicationsLead #Scribe #TieredResponse #Handoff #ProductionEngineering #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

10 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Blameless Postmortems to Build Better Systems

Jun 6 2026

In this episode of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems go beyond simple incident analysis to drive real systemic improvements. Using the example of a major payment processor incident in early 2026, they break down the anatomy of an effective blameless postmortem: separating human error from system design flaws, writing actionable recommendations, and tracking follow-ups. They discuss common pitfalls like blame drift and incomplete data, and share how one SRE team at a mid-size SaaS company reduced repeat incidents by 40 percent after adopting a structured blameless process. If you're looking to turn outages into learning opportunities, this episode offers a practical playbook. #BlamelessPostmortems #SRE #SiteReliabilityEngineering #IncidentManagement #ProductionEngineering #Uptime #RootCauseAnalysis #DevOps #Reliability #LearningFromFailure #BlamelessCulture #IncidentResponse #SaaSSRE #TechOps #Technology #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

9 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Postmortems That Actually Change Behavior

Jun 6 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dig into the one incident-documentation practice most teams get wrong: the postmortem. Most postmortems are filed and forgotten. Lucas walks through how Google's SRE team shifted from blame-free to action-oriented postmortems, using a concrete example from their own 2017 Gmail outage. He breaks down the difference between a cause and a contributing factor, and explains why the 'action items' list is usually the weakest part. Luna pushes back on the idea that postmortems should always be public, and they discuss how psychological safety changes whether people actually report the truth. The episode closes with a practical takeaway: if your postmortem doesn't change how you deploy, monitor, or alert, it's a report, not a postmortem. #SRE #SiteReliabilityEngineering #Postmortems #IncidentResponse #BlamelessCulture #GoogleSRE #GmailOutage #ActionItems #PsychologicalSafety #IncidentAnalysis #ReliabilityEngineering #DevOps #FexingoBusiness #BusinessPodcast #Technology #LearningFromFailure #ContinuousImprovement #RootCauseAnalysis Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free

No reviews yet

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

How SRE Teams Use Incident Command Systems to Coordinate Response

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Blameless Postmortems to Build Better Systems

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Postmortems That Actually Change Behavior

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed