Written by @Christopher Norman
New Vision-Language models with multimodal inputs have been released. The question is whether we can use these models to perform the matching task in the Dobble game scenario, which we previously trained a convolutional neural network for. Video. The Vision-Language model will take two inputs: an image and a text prompt. The output will be text, which should be parsed into a schema containing the word (and possibly the relative positions) of the matching symbol. This experiment should take no more than three days, utilising open-source models or APIs to speed up the initial experimentation. We will not use edge devices in the first pass. No training will be involved—our goal is to determine if we can perform this task with a pre-trained Vision-Language model without post-training or fine-tuning. We will briefly analyse how well these models might run on a Jetson Nano or similar edge devices. Additionally, we will consider the computational cost of running these models with images sampled from a continuous image stream.
Note that these models may fail the task, allowing us to explain the motivation for using highly specific, small models instead of large, generic ones. If the experiment does work, we may explore deploying the model onto Jetson Nano hardware.
The main focus of this project should be failing fast; we should perform any cuts necessary to minimise the time either proving or disproving the hypothesis that these models will be able to complete the task without fine-tuning. Try the “best” and largest models first and work our way down the sizes, if a paid API is the cheapest and fastest option, try it first.
This project provides an opportunity to explore the landscape of Vision-Language models, compare their strengths and weaknesses, and understand how multi-modal models function. We will also be able to compare models of different sizes and from various providers.
The project will result in a blog post summarising our findings, a LinkedIn post highlighting key insights, and internal documentation explaining how multi-modal models work and how they differ from text-only models or CNNs.
https://github.com/deepseek-ai/DeepSeek-VL2