Text-to-Speech Data Preparation and Fine-tuning Workshop - Ronan McGovern Introduction to the Workshop 00:00
Ronan McGovern introduces the workshop focused on text-to-speech model fine-tuning using Sesame's CSM 1B model.
Materials for the workshop are available on GitHub under the AI Worldsfare-2025 repository.
Understanding Token-Based Text-to-Speech Models 00:36
Participants will learn to create a voice dataset from a selected YouTube video for fine-tuning.
Explanation of how token-based models work, including the prediction of audio tokens from text tokens.
Discussion on the importance of using a transformer model that can handle both text and audio inputs.
Data Preparation for Fine-Tuning 06:01
The workshop demonstrates data generation using Whisper for transcribing audio from a YouTube video.
A dataset is created with audio snippets and their corresponding transcriptions, with a recommendation to use around 30-second clips.
Emphasis on ensuring data quality by correcting transcription errors and combining segments to create longer audio clips.
Fine-Tuning the Model 13:51
Instructions for installing and loading the Unsloth library for model fine-tuning.
The workshop prepares to fine-tune the CSM 1B model using the generated dataset, focusing on training specific parameters to save memory and improve speed.
Inference and Comparison 17:05
Initial inference is conducted to evaluate the base model's performance before fine-tuning.
Voice cloning is introduced as a method to create a personalized voice output by passing in a sample audio.
Training the Fine-Tuned Model 24:05
Detailed steps for setting up the training process, including batch size, learning rate, and monitoring loss during training.
Training runs with the fine-tuned model, showing improvements in performance and reduction in loss.
Evaluating Fine-Tuned Performance 31:03
The results of the fine-tuned model are compared against the base model, showing notable improvements in voice output quality and personalization.
Suggestions for further data collection to enhance model performance and quality.
Conclusion and Next Steps 33:02
The workshop concludes with a recap of the fine-tuning process and the effectiveness of combining cloning with fine-tuning.
Encouragement to explore further resources on GitHub and to provide feedback or questions in the comments.