Text-to-Speech Data Preparation and Fine-tuning Workshop - Ronan McGovern

Introduction to the Workshop 00:00

  • Ronan McGovern introduces the workshop focused on text-to-speech model fine-tuning using Sesame's CSM 1B model.
  • Materials for the workshop are available on GitHub under the AI Worldsfare-2025 repository.

Understanding Token-Based Text-to-Speech Models 00:36

  • Participants will learn to create a voice dataset from a selected YouTube video for fine-tuning.
  • Explanation of how token-based models work, including the prediction of audio tokens from text tokens.
  • Discussion on the importance of using a transformer model that can handle both text and audio inputs.

Data Preparation for Fine-Tuning 06:01

  • The workshop demonstrates data generation using Whisper for transcribing audio from a YouTube video.
  • A dataset is created with audio snippets and their corresponding transcriptions, with a recommendation to use around 30-second clips.
  • Emphasis on ensuring data quality by correcting transcription errors and combining segments to create longer audio clips.

Fine-Tuning the Model 13:51

  • Instructions for installing and loading the Unsloth library for model fine-tuning.
  • The workshop prepares to fine-tune the CSM 1B model using the generated dataset, focusing on training specific parameters to save memory and improve speed.

Inference and Comparison 17:05

  • Initial inference is conducted to evaluate the base model's performance before fine-tuning.
  • Voice cloning is introduced as a method to create a personalized voice output by passing in a sample audio.

Training the Fine-Tuned Model 24:05

  • Detailed steps for setting up the training process, including batch size, learning rate, and monitoring loss during training.
  • Training runs with the fine-tuned model, showing improvements in performance and reduction in loss.

Evaluating Fine-Tuned Performance 31:03

  • The results of the fine-tuned model are compared against the base model, showing notable improvements in voice output quality and personalization.
  • Suggestions for further data collection to enhance model performance and quality.

Conclusion and Next Steps 33:02

  • The workshop concludes with a recap of the fine-tuning process and the effectiveness of combining cloning with fine-tuning.
  • Encouragement to explore further resources on GitHub and to provide feedback or questions in the comments.