Advanced LLM Optimization techniques
Failed to add items
Sorry, we are unable to add the item because your shopping cart is already at capacity.
Add to basket failed.
Please try again later
Add to wishlist failed.
Please try again later
Remove from wishlist failed.
Please try again later
Adding to library failed
Please try again
Follow podcast failed
Unfollow podcast failed
-
Narrated by:
-
By:
Welcome to another Data Architecture Elevator podcast! Today's discussion is hosted by Paolo Platter supported by our experts Antonino Ingargiola and Irene Donato.
In this episode, we explore effective strategies for optimizing large language models (LLMs) for inference tasks with multimodal data like audio, text, images, and video.
We discuss the shift from online APIs to hosted models, choosing smaller, task-specific models, and leveraging fine-tuning, distillation, quantization, and tensor fusion techniques. We also highlight the role of specialized inference servers such as Triton and Dynamo, and how Kubernetes helps manage horizontal scaling.
Don't forget to follow us on LinkedIn! Enjoy!
adbl_web_anon_alc_button_suppression_t1
No reviews yet