SUMM

Computer vision is essential for systems interacting with the real world, as vision is foundational in built environments.
There is a significant gap between human vision and the current abilities of computer vision models, larger than the gap in speech processing.
Unique challenges in computer vision include the need for low latency and edge processing, as decisions must happen in real time and can't rely on centralized computation.
Existing benchmarks like ImageNet and COCO focus largely on pattern matching rather than deeper visual intelligence, leading to saturated evaluations that don't require real visual understanding.

Vision models do not benefit from large-scale pre-training to the same extent as language models, lacking powerful, generalizable embeddings.
Currently, there are not strong vision models that can leverage large-scale embeddings effectively, partly due to limited quality in pre-trained vision models.
Example of LLMs' visual shortcomings: Chatbots like Claude 3.5 and Claude 4 fail at simple visual tasks such as telling the time on a watch, indicating a lack of genuine visual understanding.
The MMVP dataset exposes the inability of vision-language models to see subtle visual differences, often leading to incorrect and hallucinated responses.

CLIP, a prominent vision-language model, is limited in discriminating between visually similar images if textual captions do not sufficiently differentiate them.
Pure vision models like DINOv2, trained solely on images in a self-supervised manner, better differentiate fine-grained visual details and discover analogies in object parts across categories.
A significant open problem is effectively aligning vision features with language for improved visual fidelity in vision-language models.

In object detection, transformer-based models (like LW DTER) benefit much more from large-scale pre-training than convolutional models (like YOLO V8), showing substantial performance gains.
Pre-training on datasets like Objects365 (1.6 million images) is considered large in vision but tiny compared to language model pre-training.
There is a growing trend toward transformer-based vision models to better utilize pre-training and improve downstream task performance.

Roboflow introduces RF DTOR, a model combining the LW DTER transformer architecture with a DINOv2 pre-trained backbone for improved real-time object detection.
On the traditional COCO benchmark, RF DTOR achieves strong (though not leading) results; however, gains are more evident on Roboflow's new R100VL dataset.
R100VL is a new benchmark comprising 100 curated object detection datasets from diverse domains, focusing on challenging scenarios like varied camera angles, imaging types (e.g., microscopy, X-ray), and rare classes.
The new dataset aims to better measure the intelligence, adaptability, and generalization of vision models compared to COCO, which focuses on familiar, easy classes.

R100VL serves as both a visual and vision-language benchmark, with contextualized class names and instructions to evaluate deeper understanding.
Specialized detectors (like fine-tuned Grounding DINO) outperform large vision-language models on R100VL, especially when given few-shot (e.g., 10-shot) examples per class.
Vision-language models still struggle to generalize in the visual domain, even though they perform well in the linguistic domain, highlighting a key research gap.

Roboflow's platform is freely available for researchers; in exchange, users contribute their labeled data back to the open-source community.
Many datasets in R100VL come from research publications, including medical and biological imagery, fostering diverse and real-world evaluation scenarios.
The R100VL dataset and associated resources are publicly available at rf100vl.org and on platforms like Hugging Face.

The few-shot (10-shot) evaluation tract in R100VL tests a model's ability to learn from limited examples using class names, visual examples, and instructions.
No current vision-language model can fully leverage all provided information in the 10-shot setup better than specialized detectors.
Fine-tuning specialist models remains more effective than using current generalist vision-language models, but future research should aim to bridge this gap.
The dataset encourages research towards journalist vision models that can effectively integrate multi-modal instructions and visual examples for object detection tasks.

Vision AI in 2025 — Peter Robicheaux, Roboflow