Is Vision Banana open source?

The model weights and code are public. It's free to use for research purposes. Commercial use terms need to be checked directly in the license document in the official repository. Google DeepMind models may have conditions that get updated, so it's essential to verify the license at the time of use.

What is Nano Banana Pro?

It's the Google DeepMind vision foundation model that Vision Banana is built on. Vision Banana is a version with 5-task instruction-tuning applied on top of it. The specs and training approach of Nano Banana Pro itself are detailed in the paper body.

Should I replace existing SAM 3 or Depth Anything V3?

It's less about replacing and more about choosing based on use case. If multiple tasks need to be handled simultaneously in a pipeline, consolidating with Vision Banana is the better option. If only a single task is needed and peak performance is the goal, a specialized model is still an option. Run both and decide.

Does zero-shot transfer performance hold up in production domains?

Based on public benchmarks, it achieves SOTA-level results even on unseen datasets. In production, there may be variation depending on domain characteristics. The right order is to run it without fine-tuning first and then judge the results. Zero-shot performance is especially meaningful in domains like medical imaging or satellite photos where labeled data is scarce.

Is real-time inference possible for robotics applications?

With the current public model, real-time inference requires an A100-class GPU or better. Edge devices and embedded environments are still not feasible. Google DeepMind may release a lightweight Nano-series successor. For now, evaluating it in a server-side inference pipeline first is the realistic approach.

Why is image generation pretraining effective for visual representation learning?

To generate an image, the model has to understand pixel structure, object boundaries, depth relationships, and surface orientation. In that process, the representations needed for all visual tasks are naturally learned. It's the same principle as LLMs learning language broadly through next-token prediction. The generation process becomes a means for representation learning.

Can I test Vision Banana without a GPU?

Hugging Face Spaces or Google Colab can be used to run demos first. Small-scale single-image inference is possible without an A100. Large-scale batch processing or real-time inference effectively requires an A100-class GPU or better. Check the interface on Spaces first and then decide whether to invest.

Can it be deployed to production without fine-tuning?

Zero-shot performance is validated on benchmarks. Use it without fine-tuning first, and if performance falls short, that's when to consider domain fine-tuning. In situations without labeled data, starting with zero-shot is effectively the only option.

AI Trends2026-05-0111 min

Google DeepMind Vision Banana — One Model Beats SAM 3 and Depth Anything V3 Across 5 Vision Tasks

Google DeepMind released Vision Banana in April 2026. A single weight handles 5 vision tasks and beats Meta SAM 3 and Depth Anything V3 simultaneously. With He Kaiming and Xie Saining on the paper, it proves image-generation pretraining is the core of generalist vision representation learning.

On this page (10)

Before Vision Banana — The Reality of Vision Pipelines
Vision Banana — Built on Nano Banana Pro
Five Tasks in One — How It Actually Works
It Beat Meta SAM 3 at Segmentation
It Surpassed Depth Anything V3 Too — Metric Depth Results
Image Generation Pretraining Was the Key — The Paper's Claim
Zero-Shot Transfer — It Worked on Unseen Datasets Too
He Kaiming and Xie Saining — What This Combination Means
Try It Right Now — From Installation to Inference
How to Migrate from Existing Pipelines to Vision Banana

May 2026 · AI Trends

Google DeepMind Vision Banana Released — Image Generation Pretraining Lets a Single Model Beat All 5 Visual Tasks

This is what vision AI pipelines actually look like. One model for segmentation, one for depth estimation, one for image generation. Each task gets its own model. The more tasks in the pipeline, the more models pile up. GPU memory gets tight, inference latency accumulates, and API version compatibility becomes a headache.

Google DeepMind flipped that structure on April 22, 2026. They released Vision Banana. Five visual tasks packed into a single set of weights. It beat Meta SAM 3 at segmentation. It dominated Depth Anything V3 at metric depth. The paper "Image Generators are Generalist Vision Learners," with He Kaiming and Xie Saining involved, proved one claim with benchmarks.

Image generation pretraining becomes the key to general visual representation learning. Just as next-token prediction in LLMs enables learning language broadly, the process of generating images enables learning visual representations broadly. Vision Banana is the evidence. The conditions for a single model to simultaneously surpass specialized models have arrived in vision AI too.

Quick Summary — Vision Banana at a Glance

Bottom line: A single model handles 5 tasks — segmentation, depth, surface normal, and image generation. Simultaneously outperforms Meta SAM 3 and Depth Anything V3
Released: April 22, 2026 · Google DeepMind
Base model: Nano Banana Pro (instruction-tuned)
Key finding: Image generation pretraining = the same role as next-token prediction in LLMs (general visual representation learning)
Zero-shot transfer: Achieves SOTA-level results even on datasets not used in training
Paper: "Image Generators are Generalist Vision Learners" — He Kaiming, Xie Saining et al.
Can I use it now: Weights are public. Accessible via the transformers library in an A100-class GPU environment

Before Vision Banana — The Reality of Vision Pipelines

Existing vision pipelines were built from model combinations. SAM-series models separated objects, Depth Anything measured distance, and separate generative models produced outputs. Three models in a pipeline means triple the memory. Triple the latency. Every time model versions update, combination compatibility has to be verified again. The moment operational overhead exceeds coding overhead arrives quickly.

Robotics teams feel this problem most directly. Real-time environment perception simultaneously requires segmentation, depth, and surface normals. Loading three models at once puts GPU memory under pressure from the start. When inference latency accumulates, it affects the real-time control loop. If a robot takes 200ms to recognize an obstacle, the robot has already moved toward it during that time.

Autonomous driving has the same structure. Lane recognition (segmentation), obstacle distance measurement (metric depth), and road gradient detection (surface normal) are all needed simultaneously for every frame. The fewer models involved, the better the inference speed and memory usage. The demand for "one model that does everything" was already coming from the field. Vision Banana is the answer to that demand.

Multimodal LLM development is no different. Attaching image understanding to a text LLM requires a vision encoder. Combining specialized encoders for each task makes the pipeline complex. If a single vision encoder can handle segmentation, depth, and normals all at once, the design becomes simple. Vision Banana emerged as a candidate backbone for exactly that.

Vision Banana — Built on Nano Banana Pro

Vision Banana is built on Nano Banana Pro. It's a vision foundation model developed by Google DeepMind. A foundation model isn't designed for a specific task. It's a general-purpose base pretrained on large-scale data that can be fine-tuned in various directions. Just as GPT-4o-class models serve as a base layer in LLMs, Nano Banana Pro handles the vision foundation layer.

Vision Banana was completed by applying instruction-tuning on top of Nano Banana Pro. Instruction-tuning isn't training a chef who only knows one dish — it's retraining to handle whatever order comes in. "Do semantic segmentation on this image." "Measure the metric depth of this scene." It becomes capable of receiving and executing these instructions. The result is a single model, but the breadth of tasks it can handle is completely different.

The single-weights point is the key. The old approach loaded separate models for each task. Vision Banana handles 5 tasks from one weights file. Load one model into memory and that's it. No need to swap models when switching tasks. Change one argument and it switches from segmentation to depth, from depth to surface normal.

Vision Banana Model Specs Summary

Base model: Nano Banana Pro (Google DeepMind vision foundation)
Fine-tuning approach: Instruction-tuning (5 tasks unified)
Weight structure: Single weights file for switching between 5 tasks
Pretraining method: Image generation-based (generative pretraining)
Release method: Weights + code public (Hugging Face, GitHub)
Inference requirements: A100-class GPU or better recommended (FP16/BF16 supported)

Five Tasks in One — How It Actually Works

The 5 tasks are completely different in nature. Semantic segmentation assigns a category to every pixel in the image. It converts a scene into a category map — "this pixel is a person," "that pixel is a car," "that region is sky." Instance segmentation goes a step further. Even within the same "person" category, it distinguishes "Person A" from "Person B" individually. This is the capability needed when tracking a specific individual in a crowd.

Monocular metric depth measures the actual distance to subjects in meters using just a single camera. It's like accurately judging distance with one eye and no measuring tape. It's a critical capability for robots or autonomous vehicles that need to understand 3D space with cameras alone, without lidar. Surface normal estimation calculates the direction vector of a 3D surface — which way it's facing. This information is essential for light reflection simulation, physically-based rendering, and 3D reconstruction.

Finally, image generation creates new images from given conditions. While the other four tasks analyze images, generation goes in the opposite direction. The fact that analysis and generation both work from the same weights is itself the core of what this model claims. Generative capability deepens analytical representations, and analytical representations improve generation quality. It's not a one-way relationship — it's mutual reinforcement.

All 5 of these fit into a single set of weights. Tasks that are completely different in nature are handled by a single model. Conventional wisdom said "using specialized models for each task is better." Vision Banana presented a counterexample to that wisdom. The single model produced better results than each specialized model.

Vision Banana 5-Task Overview

Semantic Segmentation: Pixel-level category classification (person · car · sky · background)
Instance Segmentation: Distinguishing individual instances within the same category (Person A vs Person B)
Monocular Metric Depth: Measuring actual distance (m) with a single camera
Surface Normal Estimation: Predicting the direction vector of surface orientation
Image Generation: Conditional image synthesis

Vision Banana vs Existing Models — Task Support Comparison

Model	Semantic Seg	Instance Seg	Metric Depth	Surface Normal	Image Gen	Single Weights
Vision Banana	O	O	O	O	O	O
Meta SAM 3	O	O	X	X	X	X
Depth Anything V3	X	X	O	X	X	X
Existing generation models	X	X	X	X	O	X

It Beat Meta SAM 3 at Segmentation

Meta SAM 3 is one of the top-tier models in segmentation. It's widely used as a general-purpose segmentation model. Vision Banana outperformed SAM 3 in both semantic segmentation and instance segmentation. It didn't just win on specific benchmarks. It exceeded SAM 3 even on datasets not used in training. The key is that it beat a dedicated segmentation model under zero-shot conditions.

SAM 3 is a model designed exclusively for segmentation. Meta trained it on massive amounts of segmentation data. Yet Vision Banana, while simultaneously handling 5 tasks, surpassed the segmentation-only model. What this result says is one thing: image generation pretraining directly contributes to segmentation representation learning. General representations have started to replace specialized ones.

What specific benchmarks did it win? The paper reports that Vision Banana led on both ADE20K (the standard indoor/outdoor semantic segmentation dataset) and COCO Instance Segmentation. The same held for domains not included in training data. On industrial scene images and medical images — data SAM 3 hadn't seen much of — Vision Banana showed an advantage too. However, specific numbers must be verified directly in the paper and official repository.

In practical terms, this means: if there's a pipeline using SAM 3 for segmentation with a separate depth model connected to it, Vision Banana alone can replace both models. Segmentation performance matches or exceeds SAM 3, while the same model handles depth estimation too. Infrastructure simplification and performance retention happen at the same time.

It Surpassed Depth Anything V3 Too — Metric Depth Results

Depth Anything V3 has become the reference point for monocular depth estimation. It's widely referenced in monocular camera depth estimation research. Vision Banana dominated this model at metric depth estimation too. Metric depth doesn't measure relative depth — it measures actual physical-unit distances. Not "the object in front is closer than the one behind," but "the object in front is 1.3m, the one behind is 4.2m." The precision requirements are far higher.

Depth Anything V3 is also a depth-only model. A 5-task general-purpose model surpassed a dedicated depth model. The assumption that "being versatile means less expertise" didn't hold with Vision Banana — at least not based on benchmarks published so far. The paper reports that Vision Banana led on both NYUv2 (indoor depth benchmark) and KITTI (outdoor autonomous driving depth benchmark).

Why does metric depth performance matter for autonomous driving? Relative depth alone isn't enough. A car needs to know exactly how many meters it is from an obstacle ahead to determine braking timing. In vision systems that supplement or replace lidar, metric depth accuracy connects directly to safety. Vision Banana achieving SOTA at metric depth means its application potential has reached the level of autonomous driving.

Major Benchmark Performance Comparison Overview

Task	Benchmark	Vision Banana	Previous SOTA	Result
Semantic Seg	ADE20K	SOTA	Meta SAM 3	Vision Banana leads
Instance Seg	COCO	SOTA	Meta SAM 3	Vision Banana leads
Metric Depth	NYUv2 / KITTI	SOTA	Depth Anything V3	Vision Banana leads
Surface Normal	OASIS	SOTA	Previous specialized model	Vision Banana leads
Zero-Shot Transfer	Unseen datasets	SOTA-level	Specialized models	Vision Banana leads

* Figures based on the paper. Exact metric values must be verified directly in the official paper's tables.

Image Generation Pretraining Was the Key — The Paper's Claim

The paper title says it all: "Image Generators are Generalist Vision Learners." It means a model trained to generate images learns visual representations broadly. This is where the LLM pretraining analogy comes in. GPT-series models are trained by predicting the next token. In that process, language structure, meaning, context, and reasoning ability are all naturally acquired. A single goal of generation leads to broad understanding of language.

The Vision Banana team claimed that image generation pretraining plays the same role in the vision domain. To generate an image, the model has to fully understand pixel structure, object boundaries, spatial depth, and surface orientation. That understanding overlaps precisely with the representations needed for segmentation, depth estimation, and surface normal prediction. Generative pretraining ends up teaching visual representations broadly as a byproduct.

Until now, vision pretraining went in two main directions. First, contrastive learning — matching image-text pairs in the style of CLIP. Second, masked autoencoding — masking and reconstructing parts of images in the style of MAE. Using image generation itself as a pretraining objective was a less-validated territory. Vision Banana filled that gap. It showed through benchmarks that generative pretraining works across segmentation, depth, and surface normals.

If the claim in this paper is correct, the direction of vision model design changes going forward. The premise that "to do segmentation well, you need intensive training on segmentation data" is shaken. The alternative becomes: pretrain the model to generate images well, then switch to each task via instruction-tuning. Just as LLMs changed the entire NLP landscape, this could be recorded as the inflection-point paper where generative pretraining changes vision AI as a whole.

"We show that image generation pretraining, when applied at scale, leads to representations that transfer across diverse visual tasks — comparable to the role of next-token prediction in large language models."
— "Image Generators are Generalist Vision Learners", Google DeepMind, April 2026

Zero-Shot Transfer — It Worked on Unseen Datasets Too

Zero-shot transfer is solving a problem never seen in training by applying underlying principles. It means achieving high performance on datasets not used in training, without additional fine-tuning. Vision Banana produced SOTA-level results on datasets it had never seen. Specialized models for each task fell behind under this condition too. Even when dedicated models were fine-tuned on domain data, they couldn't match Vision Banana's zero-shot performance.

Why does zero-shot performance matter in practice? Labeled data for every desired domain isn't always available. In domains where labeled data is scarce or collection costs are high — medical imaging, satellite photos, industrial footage — zero-shot performance determines whether a model is actually usable. Being able to use it immediately without fine-tuning data is itself a significant value in production.

The reason zero-shot transfer works well is explained here again by image generation pretraining. The representations the model learned through generative pretraining don't overfit to specific datasets — they remain generalized visual representations. When it sees images from a new domain, it can process tasks using those generalized representations. While specialized models do well only on "what they've seen a lot of," Vision Banana processes even unseen things by principle.

The practical approach is this: without domain-specific fine-tuning, test with Vision Banana first. If the results meet requirements, use it as-is. If performance is lacking, that's when fine-tuning with a small amount of domain data becomes the next step. The old flow — where fine-tuning data had to be secured upfront — reverses.

He Kaiming and Xie Saining — What This Combination Means

He Kaiming is the author of ResNet. The 2016 CVPR Best Paper changed the direction of deep learning image classification. A single structure — residual connections — made it possible to stack hundreds of layers, and became the default in vision architectures afterward. He later made significant contributions to vision pretraining paradigms with Masked Autoencoders (MAE). Pretraining method design is his core research area.

Xie Saining is known for vision architecture design research including ResNeXt and Aggregated Residual Transformations. Studies that simultaneously achieved accuracy and efficiency through grouped convolutions and aggregated residual transformations. Handling the tradeoffs in architecture design is his strength. The research histories of both align precisely with the direction of this paper.

This isn't just name recognition. He Kaiming's experience designing pretraining paradigms and Xie Saining's know-how in architecture selection directly reflect in the methodological depth of the paper. The decision to adopt image generation as the pretraining objective, the decision to choose Nano Banana Pro as the base, the decision to use instruction-tuning as the integration method — all of these have these researchers' experience behind them. The combination of the Google DeepMind internal team with these two researchers produced the paper.

It's the most fitting author lineup for an attempt to change the vision pretraining paradigm. He Kaiming, who validated masked pretraining with MAE, switched direction this time to generative pretraining. That shift itself is a message. If this direction were wrong, these authors wouldn't have published the paper.

Try It Right Now — From Installation to Inference

Model weights were released alongside the paper. They're accessible on Hugging Face and the official GitHub repository. Basic inference is immediately possible in a Python environment. Dependency packages are PyTorch, Hugging Face transformers, and Pillow. If no GPU environment is available, running the demo on Hugging Face Spaces or Google Colab first is the fastest approach.

If a local GPU environment is available, install in the following order. Install PyTorch for CUDA 12.1 first, then add transformers and auxiliary packages. Loading the model in BF16 halves GPU memory usage. On an A100 40GB, single-image inference in BF16 mode uses approximately 18GB VRAM.

Installation and Setup

# Install PyTorch for CUDA 12.1

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# transformers + auxiliary packages

pip install transformers accelerate pillow huggingface_hub

# Pre-download model weights via Hugging Face CLI (optional)

huggingface-cli download google/vision-banana --local-dir ./vision-banana

On A100 40GB, single-image inference in BF16 mode uses approximately 18GB VRAM. To test without an A100, check the Hugging Face Spaces demo first.

Once installation is complete, multi-task switching inference can be run immediately. Changing the task argument alone switches tasks without replacing the model. Load the model once and run multiple tasks on the same image in sequence.

import torch

from PIL import Image

from transformers import AutoProcessor, AutoModelForVision2Seq

# Load model — weights auto-downloaded on first run

processor = AutoProcessor.from_pretrained(

    "google/vision-banana",

    trust_remote_code=True

)

model = AutoModelForVision2Seq.from_pretrained(

    "google/vision-banana",

    torch_dtype=torch.bfloat16,

    device_map="auto",

    trust_remote_code=True

).eval()

image = Image.open("sample.jpg").convert("RGB")

# Switch between 5 tasks with a single task argument

TASKS = [

    "semantic_segmentation",

    "instance_segmentation",

    "metric_depth",

    "surface_normal",

]

results = {}

for task in TASKS:

    inputs = processor(

        images=image, task=task, return_tensors="pt"

    ).to(model.device, dtype=torch.bfloat16)

    with torch.no_grad():

        output = model(**inputs)

    results[task] = output.predictions

    print(f"[{task}] shape: {output.predictions.shape}")

Code Note

The code above is based on the expected API pattern from the official repository. Actual class names, argument names, and output structure may differ from the official repository README. Always check the official GitHub repository first.

How to Migrate from Existing Pipelines to Vision Banana

If there's a pipeline that used SAM 3 + Depth Anything V3 together, Vision Banana alone can replace it. Comparing the code before and after the switch makes the difference clear. The structure shifts from loading two models and running separate inference for each, to a single model load with task switching. GPU memory usage drops, and so does the maintenance cost of the code.

#

X LinkedIn Facebook

AI Trends

Google Search Live — Talk to Your Camera and Google Answers

We tried Google Search Live, launched worldwide on March 26. Point your camera and talk — Gemini 3.1 Flash Live answers in real time. Available now on the Google app for Android and iOS.

2026-03-30·13 min

AI Trends

Google Expanded Pentagon AI Access That Anthropic Refused

Google granted its AI models access to the U.S. Department of Defense's classified network — just days after Anthropic officially refused the same request from the Trump administration. The Big Three AI companies have now taken completely divergent paths on defense collaboration.

2026-04-28·25 min

AI Trends

Claude Opus 4.7 Launch — Beats GPT-5.4, Admits It Loses to Mythos

Anthropic shipped Claude Opus 4.7 on April 16, 2026. Beat GPT-5.4 on SWE-bench Pro (64.3% vs 57.7%), admitted publicly that it trails the unreleased Mythos Preview. Benchmarks, pricing, context window changes.

2026-04-17·11 min

← All posts

Google DeepMind Vision Banana — One Model Beats SAM 3 and Depth Anything V3 Across 5 Vision Tasks

Google DeepMind Vision Banana Released — Image Generation Pretraining Lets a Single Model Beat All 5 Visual Tasks

Before Vision Banana — The Reality of Vision Pipelines

Vision Banana — Built on Nano Banana Pro

Five Tasks in One — How It Actually Works

It Beat Meta SAM 3 at Segmentation

It Surpassed Depth Anything V3 Too — Metric Depth Results

Image Generation Pretraining Was the Key — The Paper's Claim

Zero-Shot Transfer — It Worked on Unseen Datasets Too

He Kaiming and Xie Saining — What This Combination Means

Try It Right Now — From Installation to Inference

How to Migrate from Existing Pipelines to Vision Banana

Related Posts

Google Search Live — Talk to Your Camera and Google Answers

Google Expanded Pentagon AI Access That Anthropic Refused

Claude Opus 4.7 Launch — Beats GPT-5.4, Admits It Loses to Mythos