Google DeepMind Vision Banana — One Model Beats SAM 3 and Depth Anything V3 Across 5 Vision Tasks
Google DeepMind released Vision Banana in April 2026. A single weight handles 5 vision tasks and beats Meta SAM 3 and Depth Anything V3 simultaneously. With He Kaiming and Xie Saining on the paper, it proves image-generation pretraining is the core of generalist vision representation learning.
On this page (10)
- Before Vision Banana — The Reality of Vision Pipelines
- Vision Banana — Built on Nano Banana Pro
- Five Tasks in One — How It Actually Works
- It Beat Meta SAM 3 at Segmentation
- It Surpassed Depth Anything V3 Too — Metric Depth Results
- Image Generation Pretraining Was the Key — The Paper's Claim
- Zero-Shot Transfer — It Worked on Unseen Datasets Too
- He Kaiming and Xie Saining — What This Combination Means
- Try It Right Now — From Installation to Inference
- How to Migrate from Existing Pipelines to Vision Banana
May 2026 · AI Trends
Google DeepMind Vision Banana Released — Image Generation Pretraining Lets a Single Model Beat All 5 Visual Tasks
This is what vision AI pipelines actually look like. One model for segmentation, one for depth estimation, one for image generation. Each task gets its own model. The more tasks in the pipeline, the more models pile up. GPU memory gets tight, inference latency accumulates, and API version compatibility becomes a headache.
Google DeepMind flipped that structure on April 22, 2026. They released Vision Banana. Five visual tasks packed into a single set of weights. It beat Meta SAM 3 at segmentation. It dominated Depth Anything V3 at metric depth. The paper "Image Generators are Generalist Vision Learners," with He Kaiming and Xie Saining involved, proved one claim with benchmarks.
Image generation pretraining becomes the key to general visual representation learning. Just as next-token prediction in LLMs enables learning language broadly, the process of generating images enables learning visual representations broadly. Vision Banana is the evidence. The conditions for a single model to simultaneously surpass specialized models have arrived in vision AI too.
Quick Summary — Vision Banana at a Glance
- Bottom line: A single model handles 5 tasks — segmentation, depth, surface normal, and image generation. Simultaneously outperforms Meta SAM 3 and Depth Anything V3
- Released: April 22, 2026 · Google DeepMind
- Base model: Nano Banana Pro (instruction-tuned)
- Key finding: Image generation pretraining = the same role as next-token prediction in LLMs (general visual representation learning)
- Zero-shot transfer: Achieves SOTA-level results even on datasets not used in training
- Paper: "Image Generators are Generalist Vision Learners" — He Kaiming, Xie Saining et al.
- Can I use it now: Weights are public. Accessible via the transformers library in an A100-class GPU environment
Before Vision Banana — The Reality of Vision Pipelines
Existing vision pipelines were built from model combinations. SAM-series models separated objects, Depth Anything measured distance, and separate generative models produced outputs. Three models in a pipeline means triple the memory. Triple the latency. Every time model versions update, combination compatibility has to be verified again. The moment operational overhead exceeds coding overhead arrives quickly.
Robotics teams feel this problem most directly. Real-time environment perception simultaneously requires segmentation, depth, and surface normals. Loading three models at once puts GPU memory under pressure from the start. When inference latency accumulates, it affects the real-time control loop. If a robot takes 200ms to recognize an obstacle, the robot has already moved toward it during that time.
Autonomous driving has the same structure. Lane recognition (segmentation), obstacle distance measurement (metric depth), and road gradient detection (surface normal) are all needed simultaneously for every frame. The fewer models involved, the better the inference speed and memory usage. The demand for "one model that does everything" was already coming from the field. Vision Banana is the answer to that demand.
Multimodal LLM development is no different. Attaching image understanding to a text LLM requires a vision encoder. Combining specialized encoders for each task makes the pipeline complex. If a single vision encoder can handle segmentation, depth, and normals all at once, the design becomes simple. Vision Banana emerged as a candidate backbone for exactly that.
Vision Banana — Built on Nano Banana Pro
Vision Banana is built on Nano Banana Pro. It's a vision foundation model developed by Google DeepMind. A foundation model isn't designed for a specific task. It's a general-purpose base pretrained on large-scale data that can be fine-tuned in various directions. Just as GPT-4o-class models serve as a base layer in LLMs, Nano Banana Pro handles the vision foundation layer.
Vision Banana was completed by applying instruction-tuning on top of Nano Banana Pro. Instruction-tuning isn't training a chef who only knows one dish — it's retraining to handle whatever order comes in. "Do semantic segmentation on this image." "Measure the metric depth of this scene." It becomes capable of receiving and executing these instructions. The result is a single model, but the breadth of tasks it can handle is completely different.
The single-weights point is the key. The old approach loaded separate models for each task. Vision Banana handles 5 tasks from one weights file. Load one model into memory and that's it. No need to swap models when switching tasks. Change one argument and it switches from segmentation to depth, from depth to surface normal.
Vision Banana Model Specs Summary
- Base model: Nano Banana Pro (Google DeepMind vision foundation)
- Fine-tuning approach: Instruction-tuning (5 tasks unified)
- Weight structure: Single weights file for switching between 5 tasks
- Pretraining method: Image generation-based (generative pretraining)
- Release method: Weights + code public (Hugging Face, GitHub)
- Inference requirements: A100-class GPU or better recommended (FP16/BF16 supported)
Five Tasks in One — How It Actually Works
The 5 tasks are completely different in nature. Semantic segmentation assigns a category to every pixel in the image. It converts a scene into a category map — "this pixel is a person," "that pixel is a car," "that region is sky." Instance segmentation goes a step further. Even within the same "person" category, it distinguishes "Person A" from "Person B" individually. This is the capability needed when tracking a specific individual in a crowd.
Monocular metric depth measures the actual distance to subjects in meters using just a single camera. It's like accurately judging distance with one eye and no measuring tape. It's a critical capability for robots or autonomous vehicles that need to understand 3D space with cameras alone, without lidar. Surface normal estimation calculates the direction vector of a 3D surface — which way it's facing. This information is essential for light reflection simulation, physically-based rendering, and 3D reconstruction.
Finally, image generation creates new images from given conditions. While the other four tasks analyze images, generation goes in the opposite direction. The fact that analysis and generation both work from the same weights is itself the core of what this model claims. Generative capability deepens analytical representations, and analytical representations improve generation quality. It's not a one-way relationship — it's mutual reinforcement.
All 5 of these fit into a single set of weights. Tasks that are completely different in nature are handled by a single model. Conventional wisdom said "using specialized models for each task is better." Vision Banana presented a counterexample to that wisdom. The single model produced better results than each specialized model.
Vision Banana 5-Task Overview
- Semantic Segmentation: Pixel-level category classification (person · car · sky · background)
- Instance Segmentation: Distinguishing individual instances within the same category (Person A vs Person B)
- Monocular Metric Depth: Measuring actual distance (m) with a single camera
- Surface Normal Estimation: Predicting the direction vector of surface orientation
- Image Generation: Conditional image synthesis
Vision Banana vs Existing Models — Task Support Comparison
| Model | Semantic Seg | Instance Seg | Metric Depth | Surface Normal | Image Gen | Single Weights |
|---|---|---|---|---|---|---|
| Vision Banana | O | O | O | O | O | O |
| Meta SAM 3 | O | O | X | X | X | X |
| Depth Anything V3 | X | X | O | X | X | X |
| Existing generation models | X | X | X | X | O | X |
It Beat Meta SAM 3 at Segmentation
Meta SAM 3 is one of the top-tier models in segmentation. It's widely used as a general-purpose segmentation model. Vision Banana outperformed SAM 3 in both semantic segmentation and instance segmentation. It didn't just win on specific benchmarks. It exceeded SAM 3 even on datasets not used in training. The key is that it beat a dedicated segmentation model under zero-shot conditions.
SAM 3 is a model designed exclusively for segmentation. Meta trained it on massive amounts of segmentation data. Yet Vision Banana, while simultaneously handling 5 tasks, surpassed the segmentation-only model. What this result says is one thing: image generation pretraining directly contributes to segmentation representation learning. General representations have started to replace specialized ones.
What specific benchmarks did it win? The paper reports that Vision Banana led on both ADE20K (the standard indoor/outdoor semantic segmentation dataset) and COCO Instance Segmentation. The same held for domains not included in training data. On industrial scene images and medical images — data SAM 3 hadn't seen much of — Vision Banana showed an advantage too. However, specific numbers must be verified directly in the paper and official repository.
In practical terms, this means: if there's a pipeline using SAM 3 for segmentation with a separate depth model connected to it, Vision Banana alone can replace both models. Segmentation performance matches or exceeds SAM 3, while the same model handles depth estimation too. Infrastructure simplification and performance retention happen at the same time.
It Surpassed Depth Anything V3 Too — Metric Depth Results
Depth Anything V3 has become the reference point for monocular depth estimation. It's widely referenced in monocular camera depth estimation research. Vision Banana dominated this model at metric depth estimation too. Metric depth doesn't measure relative depth — it measures actual physical-unit distances. Not "the object in front is closer than the one behind," but "the object in front is 1.3m, the one behind is 4.2m." The precision requirements are far higher.
Depth Anything V3 is also a depth-only model. A 5-task general-purpose model surpassed a dedicated depth model. The assumption that "being versatile means less expertise" didn't hold with Vision Banana — at least not based on benchmarks published so far. The paper reports that Vision Banana led on both NYUv2 (indoor depth benchmark) and KITTI (outdoor autonomous driving depth benchmark).
Why does metric depth performance matter for autonomous driving? Relative depth alone isn't enough. A car needs to know exactly how many meters it is from an obstacle ahead to determine braking timing. In vision systems that supplement or replace lidar, metric depth accuracy connects directly to safety. Vision Banana achieving SOTA at metric depth means its application potential has reached the level of autonomous driving.
Major Benchmark Performance Comparison Overview
| Task | Benchmark | Vision Banana | Previous SOTA | Result |
|---|---|---|---|---|
| Semantic Seg | ADE20K | SOTA | Meta SAM 3 | Vision Banana leads |
| Instance Seg | COCO | SOTA | Meta SAM 3 | Vision Banana leads |
| Metric Depth | NYUv2 / KITTI | SOTA | Depth Anything V3 | Vision Banana leads |
| Surface Normal | OASIS | SOTA | Previous specialized model | Vision Banana leads |
| Zero-Shot Transfer | Unseen datasets | SOTA-level | Specialized models | Vision Banana leads |
* Figures based on the paper. Exact metric values must be verified directly in the official paper's tables.
Image Generation Pretraining Was the Key — The Paper's Claim
The paper title says it all: "Image Generators are Generalist Vision Learners." It means a model trained to generate images learns visual representations broadly. This is where the LLM pretraining analogy comes in. GPT-series models are trained by predicting the next token. In that process, language structure, meaning, context, and reasoning ability are all naturally acquired. A single goal of generation leads to broad understanding of language.
The Vision Banana team claimed that image generation pretraining plays the same role in the vision domain. To generate an image, the model has to fully understand pixel structure, object boundaries, spatial depth, and surface orientation. That understanding overlaps precisely with the representations needed for segmentation, depth estimation, and surface normal prediction. Generative pretraining ends up teaching visual representations broadly as a byproduct.
Until now, vision pretraining went in two main directions. First, contrastive learning — matching image-text pairs in the style of CLIP. Second, masked autoencoding — masking and reconstructing parts of images in the style of MAE. Using image generation itself as a pretraining objective was a less-validated territory. Vision Banana filled that gap. It showed through benchmarks that generative pretraining works across segmentation, depth, and surface normals.
If the claim in this paper is correct, the direction of vision model design changes going forward. The premise that "to do segmentation well, you need intensive training on segmentation data" is shaken. The alternative becomes: pretrain the model to generate images well, then switch to each task via instruction-tuning. Just as LLMs changed the entire NLP landscape, this could be recorded as the inflection-point paper where generative pretraining changes vision AI as a whole.
— "Image Generators are Generalist Vision Learners", Google DeepMind, April 2026
Zero-Shot Transfer — It Worked on Unseen Datasets Too
Zero-shot transfer is solving a problem never seen in training by applying underlying principles. It means achieving high performance on datasets not used in training, without additional fine-tuning. Vision Banana produced SOTA-level results on datasets it had never seen. Specialized models for each task fell behind under this condition too. Even when dedicated models were fine-tuned on domain data, they couldn't match Vision Banana's zero-shot performance.
Why does zero-shot performance matter in practice? Labeled data for every desired domain isn't always available. In domains where labeled data is scarce or collection costs are high — medical imaging, satellite photos, industrial footage — zero-shot performance determines whether a model is actually usable. Being able to use it immediately without fine-tuning data is itself a significant value in production.
The reason zero-shot transfer works well is explained here again by image generation pretraining. The representations the model learned through generative pretraining don't overfit to specific datasets — they remain generalized visual representations. When it sees images from a new domain, it can process tasks using those generalized representations. While specialized models do well only on "what they've seen a lot of," Vision Banana processes even unseen things by principle.
The practical approach is this: without domain-specific fine-tuning, test with Vision Banana first. If the results meet requirements, use it as-is. If performance is lacking, that's when fine-tuning with a small amount of domain data becomes the next step. The old flow — where fine-tuning data had to be secured upfront — reverses.
He Kaiming and Xie Saining — What This Combination Means
He Kaiming is the author of ResNet. The 2016 CVPR Best Paper changed the direction of deep learning image classification. A single structure — residual connections — made it possible to stack hundreds of layers, and became the default in vision architectures afterward. He later made significant contributions to vision pretraining paradigms with Masked Autoencoders (MAE). Pretraining method design is his core research area.
Xie Saining is known for vision architecture design research including ResNeXt and Aggregated Residual Transformations. Studies that simultaneously achieved accuracy and efficiency through grouped convolutions and aggregated residual transformations. Handling the tradeoffs in architecture design is his strength. The research histories of both align precisely with the direction of this paper.
This isn't just name recognition. He Kaiming's experience designing pretraining paradigms and Xie Saining's know-how in architecture selection directly reflect in the methodological depth of the paper. The decision to adopt image generation as the pretraining objective, the decision to choose Nano Banana Pro as the base, the decision to use instruction-tuning as the integration method — all of these have these researchers' experience behind them. The combination of the Google DeepMind internal team with these two researchers produced the paper.
It's the most fitting author lineup for an attempt to change the vision pretraining paradigm. He Kaiming, who validated masked pretraining with MAE, switched direction this time to generative pretraining. That shift itself is a message. If this direction were wrong, these authors wouldn't have published the paper.
Try It Right Now — From Installation to Inference
Model weights were released alongside the paper. They're accessible on Hugging Face and the official GitHub repository. Basic inference is immediately possible in a Python environment. Dependency packages are PyTorch, Hugging Face transformers, and Pillow. If no GPU environment is available, running the demo on Hugging Face Spaces or Google Colab first is the fastest approach.
If a local GPU environment is available, install in the following order. Install PyTorch for CUDA 12.1 first, then add transformers and auxiliary packages. Loading the model in BF16 halves GPU memory usage. On an A100 40GB, single-image inference in BF16 mode uses approximately 18GB VRAM.
Installation and Setup
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# transformers + auxiliary packages
pip install transformers accelerate pillow huggingface_hub
# Pre-download model weights via Hugging Face CLI (optional)
huggingface-cli download google/vision-banana --local-dir ./vision-banana
On A100 40GB, single-image inference in BF16 mode uses approximately 18GB VRAM. To test without an A100, check the Hugging Face Spaces demo first.
Once installation is complete, multi-task switching inference can be run immediately. Changing the task argument alone switches tasks without replacing the model. Load the model once and run multiple tasks on the same image in sequence.
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
# Load model — weights auto-downloaded on first run
processor = AutoProcessor.from_pretrained(
"google/vision-banana",
trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(
"google/vision-banana",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
).eval()
image = Image.open("sample.jpg").convert("RGB")
# Switch between 5 tasks with a single task argument
TASKS = [
"semantic_segmentation",
"instance_segmentation",
"metric_depth",
"surface_normal",
]
results = {}
for task in TASKS:
inputs = processor(
images=image, task=task, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
with torch.no_grad():
output = model(**inputs)
results[task] = output.predictions
print(f"[{task}] shape: {output.predictions.shape}")
Code Note
The code above is based on the expected API pattern from the official repository. Actual class names, argument names, and output structure may differ from the official repository README. Always check the official GitHub repository first.
How to Migrate from Existing Pipelines to Vision Banana
If there's a pipeline that used SAM 3 + Depth Anything V3 together, Vision Banana alone can replace it. Comparing the code before and after the switch makes the difference clear. The structure shifts from loading two models and running separate inference for each, to a single model load with task switching. GPU memory usage drops, and so does the maintenance cost of the code.
Related Posts
Google Search Live — Talk to Your Camera and Google Answers
We tried Google Search Live, launched worldwide on March 26. Point your camera and talk — Gemini 3.1 Flash Live answers in real time. Available now on the Google app for Android and iOS.
Google Expanded Pentagon AI Access That Anthropic Refused
Google granted its AI models access to the U.S. Department of Defense's classified network — just days after Anthropic officially refused the same request from the Trump administration. The Big Three AI companies have now taken completely divergent paths on defense collaboration.
Claude Opus 4.7 Launch — Beats GPT-5.4, Admits It Loses to Mythos
Anthropic shipped Claude Opus 4.7 on April 16, 2026. Beat GPT-5.4 on SWE-bench Pro (64.3% vs 57.7%), admitted publicly that it trails the unreleased Mythos Preview. Benchmarks, pricing, context window changes.