The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Episodes

How SRE Teams Use Auto-Remediation to Resolve Incidents Without Humans

Jun 7 2026

In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna explore how SRE teams are using auto-remediation to automatically resolve incidents without human intervention. They break down the anatomy of an auto-remediation pipeline — from monitoring alerts to automated runbook execution — using real-world examples like a major streaming service that reduced pager fatigue by 40 percent. Lucas explains the critical distinction between deterministic remediation (simple if-then rules) and AI-driven remediation (pattern-matching across past incidents). The hosts also discuss where auto-remediation fails: novel incidents, complex multi-service failures, and scenarios requiring human judgment. They emphasize that auto-remediation isn't about replacing SREs but about freeing them to focus on higher-value work. Practical tips include starting with high-frequency, low-complexity alerts and gradually expanding scope. No fluff, just a focused look at a key SRE practice. Tune in for a concrete example you can apply to your own incident response. #AutoRemediation #SiteReliabilityEngineering #IncidentResponse #RunbookAutomation #PagerFatigue #DeterministicRemediation #AIDrivenRemediation #StreamingServiceCaseStudy #SRE #Uptime #ProductionEngineering #FexingoBusiness #BusinessPodcast #TechnologyPodcast #LucasAndLuna #IncidentManagement #OnCall #Observability Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

12 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Incident Command Systems to Coordinate Response

Jun 7 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dive into the incident command system (ICS) model that large-scale SRE teams borrow from emergency services to manage complex outages. They walk through a real example: a major payment processing incident at a fintech company where a database migration triggered a cascading failure affecting three million users. Lucas explains the four key roles in an SRE incident command structure — incident commander, operations lead, communications lead, and scribe — and how each prevents the chaos of engineers stepping on each other during a crisis. Luna challenges whether ICS slows down response time for smaller incidents, and Lucas shares how teams use tiered response models to scale the approach. They also discuss the one mistake teams make most often: failing to formally hand off the incident commander role during long-running incidents. The episode closes with a practical tip for any team looking to adopt ICS without formal training: start by assigning a scribe for the next on-call rotation. #IncidentCommandSystem #SRE #SiteReliabilityEngineering #IncidentResponse #OnCall #CascadingFailure #Fintech #DatabaseMigration #IncidentCommander #OperationsLead #CommunicationsLead #Scribe #TieredResponse #Handoff #ProductionEngineering #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

10 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Blameless Postmortems to Build Better Systems

Jun 6 2026

In this episode of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems go beyond simple incident analysis to drive real systemic improvements. Using the example of a major payment processor incident in early 2026, they break down the anatomy of an effective blameless postmortem: separating human error from system design flaws, writing actionable recommendations, and tracking follow-ups. They discuss common pitfalls like blame drift and incomplete data, and share how one SRE team at a mid-size SaaS company reduced repeat incidents by 40 percent after adopting a structured blameless process. If you're looking to turn outages into learning opportunities, this episode offers a practical playbook. #BlamelessPostmortems #SRE #SiteReliabilityEngineering #IncidentManagement #ProductionEngineering #Uptime #RootCauseAnalysis #DevOps #Reliability #LearningFromFailure #BlamelessCulture #IncidentResponse #SaaSSRE #TechOps #Technology #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

9 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Postmortems That Actually Change Behavior

Jun 6 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dig into the one incident-documentation practice most teams get wrong: the postmortem. Most postmortems are filed and forgotten. Lucas walks through how Google's SRE team shifted from blame-free to action-oriented postmortems, using a concrete example from their own 2017 Gmail outage. He breaks down the difference between a cause and a contributing factor, and explains why the 'action items' list is usually the weakest part. Luna pushes back on the idea that postmortems should always be public, and they discuss how psychological safety changes whether people actually report the truth. The episode closes with a practical takeaway: if your postmortem doesn't change how you deploy, monitor, or alert, it's a report, not a postmortem. #SRE #SiteReliabilityEngineering #Postmortems #IncidentResponse #BlamelessCulture #GoogleSRE #GmailOutage #ActionItems #PsychologicalSafety #IncidentAnalysis #ReliabilityEngineering #DevOps #FexingoBusiness #BusinessPodcast #Technology #LearningFromFailure #ContinuousImprovement #RootCauseAnalysis Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Runbook Automation to Reduce Human Error

Jun 5 2026

In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical side of runbook automation — moving beyond static documentation to executable, automated responses. They explore how companies like Google and Netflix use runbook automation to reduce mean time to repair by up to 60%, and discuss the common pitfalls: over-automation, stale runbooks, and the tension between speed and safety. Lucas shares a concrete example from a major e-commerce platform where automated runbooks cut incident response time from 45 minutes to under 5. Luna challenges whether automation can replace human judgment in complex outages. The conversation also touches on tools like Rundeck, PagerDuty Automation, and custom Slack bots. By the end, listeners will understand the key principles for building runbooks that actually get followed in the heat of an incident. #SiteReliabilityEngineering #RunbookAutomation #SRE #IncidentResponse #DevOps #Automation #GoogleSRE #Netflix #PagerDuty #Rundeck #MeanTimeToRepair #Technology #ProductionEngineering #Uptime #FexingoBusiness #BusinessPodcast #TechOps #OnCall Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Cost Optimization to Balance Performance and Budget

Jun 5 2026

In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into the often-overlooked intersection of site reliability engineering and cloud cost optimization. They explore how SRE teams at companies like Uber and Airbnb use techniques such as right-sizing instances, leveraging spot instances, and implementing autoscaling policies to reduce infrastructure spend without sacrificing reliability. Specific metrics like cost per transaction and cost per request are discussed as key indicators. The hosts also examine the trade-offs between reserved and on-demand instances, the role of FinOps in SRE, and how to set cost-aware SLOs. A concrete example from a mid-sized SaaS company shows how they saved 35% on AWS costs by shifting to a well-architected framework. This episode offers practical strategies for SREs and platform engineers looking to optimize both uptime and cloud bills. #SiteReliabilityEngineering #CloudCostOptimization #FinOps #SRE #CostOptimization #Uber #Airbnb #AWS #Autoscaling #SpotInstances #ReservedInstances #SLOs #Technology #Podcast #FexingoBusiness #BusinessPodcast #CloudComputing #Infrastructure Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

7 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Load Shedding to Survive Traffic Spikes

Jun 4 2026

When a massive traffic spike hits, every millisecond of latency can cost thousands of dollars. In this episode, Lucas and Luna explore load shedding — the SRE technique of intentionally dropping non-critical requests to keep core systems running. They walk through how Google SREs used load shedding during the 2020 YouTube outage, how Stripe applies graceful degradation during payment surges, and why Netflix deliberately kills low-priority traffic during peak hours. They also break down the mental shift required: treating load shedding as a feature, not a failure. If you're an SRE, platform engineer, or just someone who wonders why services fail gracefully sometimes and fall over completely other times, this one's for you. #SiteReliabilityEngineering #LoadShedding #TrafficSpikes #GoogleSRE #Stripe #Netflix #GracefulDegradation #CapacityPlanning #IncidentResponse #SREBestPractices #Observability #PriorityBasedShedding #FexingoBusiness #BusinessPodcast #Technology #Podcast #SRE #Uptime Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

10 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
How SRE Teams Use Feature Flags to Reduce Incident Risk

Jun 4 2026

Feature flags are a powerful tool for SREs, but they come with their own operational risks. In this episode, Lucas and Luna explore how companies like Etsy, Netflix, and LaunchDarkly use feature flags to decouple deployment from release, enabling canary rollouts, instant kill switches, and safer experimentation. They break down the difference between boolean flags, multivariate flags, and experiment flags, and discuss the hidden costs: flag debt, stale flags, and the risk of configuration cascades. Lucas shares a specific incident where a misconfigured flag caused a cascading failure at a major e-commerce platform, and how the team rebuilt their flag management system. Luna asks the hard questions about observability and testing: how do you know a flag is safe to flip? And when do you remove an old flag? The episode closes with a forward-looking question about the future of progressive delivery and whether SRE teams should treat flags as infrastructure code. #FeatureFlags #SRE #SiteReliabilityEngineering #LaunchDarkly #Etsy #Netflix #ProgressiveDelivery #CanaryDeployments #KillSwitch #FlagDebt #ConfigurationManagement #Observability #IncidentResponse #DevOps #Technology #FexingoBusiness #BusinessPodcast #ProductionEngineering Keep every episode free: buymeacoffee.com/fexingo
Show More Show Less

11 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free

Episodes

How SRE Teams Use Auto-Remediation to Resolve Incidents Without Humans

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Incident Command Systems to Coordinate Response

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Blameless Postmortems to Build Better Systems

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Postmortems That Actually Change Behavior

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Runbook Automation to Reduce Human Error

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Cost Optimization to Balance Performance and Budget

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Load Shedding to Survive Traffic Spikes

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Feature Flags to Reduce Incident Risk

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed