How Data Scientists Use Distributed Computing for Massive Datasets

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

How Data Scientists Use Distributed Computing for Massive Datasets

Listen for free

View show details

When your dataset outgrows a single machine, what do you do? In this episode, Lucas and Luna explore how data scientists use distributed computing frameworks like Apache Spark and Dask to process terabytes of data without crashing their laptops. They break down the key concept of data partitioning, explain why MapReduce is still relevant, and walk through a real example of how a mid-sized e-commerce company reorganized its log-processing pipeline to cut runtime from 14 hours to 47 minutes. Lucas shares a cautionary tale about shuffling bottlenecks that can ruin a cluster's performance, and Luna asks the practical question every team faces: when does it make sense to move from a single-node pandas workflow to a distributed system? They also discuss managed services like Databricks and AWS EMR versus rolling your own cluster. No prior distributed systems experience required — just a curiosity about what happens when data gets too big for a spreadsheet. #DataScience #DistributedComputing #ApacheSpark #Dask #MapReduce #BigData #DataEngineering #DataPartitioning #Shuffling #Databricks #AWSEmr #Pandas #Tech #Technology #FexingoBusiness #BusinessPodcast #Podcast #DataPodcast Keep every episode free: buymeacoffee.com/fexingo

No reviews yet