How Data Scientists Use Distributed Computing for Massive Datasets cover art

How Data Scientists Use Distributed Computing for Massive Datasets

How Data Scientists Use Distributed Computing for Massive Datasets

Listen for free

View show details
When your dataset outgrows a single machine, what do you do? In this episode, Lucas and Luna explore how data scientists use distributed computing frameworks like Apache Spark and Dask to process terabytes of data without crashing their laptops. They break down the key concept of data partitioning, explain why MapReduce is still relevant, and walk through a real example of how a mid-sized e-commerce company reorganized its log-processing pipeline to cut runtime from 14 hours to 47 minutes. Lucas shares a cautionary tale about shuffling bottlenecks that can ruin a cluster's performance, and Luna asks the practical question every team faces: when does it make sense to move from a single-node pandas workflow to a distributed system? They also discuss managed services like Databricks and AWS EMR versus rolling your own cluster. No prior distributed systems experience required — just a curiosity about what happens when data gets too big for a spreadsheet. #DataScience #DistributedComputing #ApacheSpark #Dask #MapReduce #BigData #DataEngineering #DataPartitioning #Shuffling #Databricks #AWSEmr #Pandas #Tech #Technology #FexingoBusiness #BusinessPodcast #Podcast #DataPodcast Keep every episode free: buymeacoffee.com/fexingo
adbl_web_anon_alc_button_suppression_t1
No reviews yet