Many popular multimodal AI models like Claude, Grock, and Llama are not genuinely multimodal; they use late fusion techniques that treat text and images separately during training.
Models such as Gemini and GPT-4 may be exceptions, but the late fusion approach connects separately trained components rather than integrating them from the start.
This method allows leveraging pre-trained components that are easier to train due to existing datasets.
Meta's research group, Metafare, has shown promise with early fusion models, which process text and visual data together from the beginning using a unified transformer architecture.
Early fusion treats images as discrete tokens, enabling seamless interaction between text and images.
Challenges in training early fusion models arise from competing signals between modalities, which can lead to instability.
A study involving 457 models found that early fusion matches late fusion performance while being more efficient.
Early fusion models can learn to specialize in either text or image processing spontaneously without human guidance.
Increasing image resolution enhances early fusion performance, contradicting the belief that specialized encoders are needed for high-resolution images.