It’s a big improvement if you’re already paying them but, given their aggressive approach to licensing, I can’t imagine why anyone would choose to use an Ultralytics model on a new project in 2026. You’re just asking to be shaken down and have to pay off a large bill down the line.
“RF-DETR is both faster and more accurate and truly open source with an Apache 2.0 license”
Misleading marketing statement.
The catch is that for image resolutions >=700x700pixels (most production usecases), the roboflow license is actually PML1.0 instead of Apache2.0
https://github.com/roboflow/rf-detr#license
That may be true for legacy CNNs but very few production use-cases require such a large resolution with DETRs. The latency scales quadratically with the resolution.
Regardless, you can do whatever resolution you want with the Apache 2.0 model. Just change the config at runtime; it was trained to be resolution agnostic.
You are correct that we also released larger models with a larger backbone under a different, non open-source license.
> The catch is that for image resolutions >=700x700pixels (most production usecases)
Citation needed? 2XL looks like you go up to 800x800 pixel inputs. This isn't the dealbreaker you say it is - all pipelines benefit from thoughtful crop and rescaling before going to inference.
See the url in my comment (search for the term rfdetr-2xlarge).
2XL does indeed go up to 800x800 and has PML1.0 license instead of apache 2.0.
Rescaling is fine for some purposes but but not for all. For many domain-specific (often less common and odd dimensioned) objects, downscaling will severely reduce recall. There is a reason that Roboflow slaps a license that is not open source on those specific architectures.
> See the url in my comment (search for the term rfdetr-2xlarge). 2XL does indeed go up to 800x800 and has PML1.0 license instead of apache 2.0.
All of the models, including the Apache 2.0 ones, can be configured to go higher than 800x800. The difference between the ones with the PML license and the Apache 2.0 ones is the backbone, not the resolution.
I'd suggest you read the ICLR paper[1] which shows clearly the difference between the backbones at various latencies in Figure 1.
> For many domain-specific (often less common and odd dimensioned) objects, downscaling will severely reduce recall.
We released an entire paper[2] at Neurips about the long-tail transferability of models across a multitude of domains and benchmarked RF-DETR against that benchmark. The Apache 2.0 model is pareto optimal over the larger PML model at latencies less than the XL size.
(I'm one of the co-founders of Roboflow and worked on RF-DETR and RF100-VL.)
We've been running YOLO for a number of years (since v5) on soccer videos. None of the recent iterations have been significantly better, with v26 scoring worse then v9 and v11 on our tasks. Makes me wonder why this version is being pushed by roboflow and ultralytics.
When I was working with YOLO models it did seem like there was little practical improvements were between all of the spinoff models. It seemed people were pushing new models for personal recognition since the original creator stopped working on it.
That said, many of the claimed improvements in this model were are efficiency related.
Can't speak for 26, but a year ago I worked on a project that migrated from v5 to 11 because of improved image segmentation capabilities. My understanding is that the newer versions don't necessarily have better precision/recall, but they tend to be faster for equivalent results, and have increased capabilities.
What I find cool is not the model in itself, but the architectures / training methods found that make the model better. It gives out a new possibilites for other fields of AI. (Notably if you want to fine tune other CV models)
Was evaluating YOLO26 within the last month for its on-device (iPhone 16 Pro) segmentation capabilities. Its decent, but its biggest limitation is that its only trained on 80 COCO classes (meaning pre-labeled images). If whatever is in your images isn't in the 80 classes, its invisible to YOLO26.
Conversely I have SAM2 running on-device and its my current workhorse. The biggest benefit with SAM2 for me is that it does fine-grained segmentation masks but isn't trained on labeled images. This was a specific requirement for the app I'm building. SAM2 isn't anywhere as speedy as the native Vision framework apis, but it is more capable across a vastly wider array of potential image targets.
Doesn't work for my use-case. GroundingDINO is a text to bounding box model. SAM2 supports coordinate based masks (user taps or clicks somewhere in an image), which is what my research app needs.
My buddy has some vision impairments, and I remember training a much older of YOLO's models to detect objects/enemies in Terraria for him. It worked very well.
I then tried trained it on a lot of sample images from a 3D point & shoot game, and was quite disappointed in how it performed.
Has anyone else experimented with it recently? How does this suit as a base-model for training custom classifiers? And with hardware growth in the last ~5 years, is it suitable to run in parallel with games which are graphically intensive?
Ive used YOLO26 in one of my projects, It was very easy to train on our custom dataset and also very easy to deploy even on rust with AVX2 support. This model is indeed fast and can be used for almost real time inference.
Same question, same answer: In pixels/second? Sure!
What are you trying to accomplish by those questions? Are you genuinely asking, or just baiting? If the former, didnt answers to your previous question make it clear that your question makes less sense than you might assume?
It's a serious question. I took a few hours to try this range of models to do this task. Online guides recommend to calibrate the size of the image with a "horizon" line and its real physical size. That's quite complicated.
I wish new models coupled with LLM would be capable of estimating the size of features on the map, e.g. the size of the car in meters, to be able to derive the speed with a world understanding. But I have found no resource doing this.
Global-shutter cameras are fast and expensive, while Doppler radar modules are robust and under $30 these days.
Running machine-vision outside in the Sun or Weather can get tricky. There is also a limited supply of BS a firm can shovel before some bystander ends up dead. =3
FWIW there are today many more alternatives with better license. Here is a good meta repo for object detection with different model variants:
https://github.com/LibreYOLO/libreyolo
It’s a big improvement if you’re already paying them but, given their aggressive approach to licensing, I can’t imagine why anyone would choose to use an Ultralytics model on a new project in 2026. You’re just asking to be shaken down and have to pay off a large bill down the line.
RF-DETR is both faster and more accurate and truly open source with an Apache 2.0 license: https://github.com/roboflow/rf-detr
Full disclosure: I’m one of the co-founders of Roboflow (we made RF-DETR, wrote this blog post, and are a sub-licensor of Ultralytics’ models.)
“RF-DETR is both faster and more accurate and truly open source with an Apache 2.0 license”
Misleading marketing statement.
The catch is that for image resolutions >=700x700pixels (most production usecases), the roboflow license is actually PML1.0 instead of Apache2.0 https://github.com/roboflow/rf-detr#license
That may be true for legacy CNNs but very few production use-cases require such a large resolution with DETRs. The latency scales quadratically with the resolution.
Regardless, you can do whatever resolution you want with the Apache 2.0 model. Just change the config at runtime; it was trained to be resolution agnostic.
You are correct that we also released larger models with a larger backbone under a different, non open-source license.
> The catch is that for image resolutions >=700x700pixels (most production usecases)
Citation needed? 2XL looks like you go up to 800x800 pixel inputs. This isn't the dealbreaker you say it is - all pipelines benefit from thoughtful crop and rescaling before going to inference.
See the url in my comment (search for the term rfdetr-2xlarge). 2XL does indeed go up to 800x800 and has PML1.0 license instead of apache 2.0.
Rescaling is fine for some purposes but but not for all. For many domain-specific (often less common and odd dimensioned) objects, downscaling will severely reduce recall. There is a reason that Roboflow slaps a license that is not open source on those specific architectures.
In some cases tiled inferencing (for example with https://github.com/obss/sahi ) might do the job.
> See the url in my comment (search for the term rfdetr-2xlarge). 2XL does indeed go up to 800x800 and has PML1.0 license instead of apache 2.0.
All of the models, including the Apache 2.0 ones, can be configured to go higher than 800x800. The difference between the ones with the PML license and the Apache 2.0 ones is the backbone, not the resolution.
I'd suggest you read the ICLR paper[1] which shows clearly the difference between the backbones at various latencies in Figure 1.
> For many domain-specific (often less common and odd dimensioned) objects, downscaling will severely reduce recall.
We released an entire paper[2] at Neurips about the long-tail transferability of models across a multitude of domains and benchmarked RF-DETR against that benchmark. The Apache 2.0 model is pareto optimal over the larger PML model at latencies less than the XL size.
(I'm one of the co-founders of Roboflow and worked on RF-DETR and RF100-VL.)
[1] https://arxiv.org/abs/2511.09554 [2] https://arxiv.org/abs/2505.20612
We've been running YOLO for a number of years (since v5) on soccer videos. None of the recent iterations have been significantly better, with v26 scoring worse then v9 and v11 on our tasks. Makes me wonder why this version is being pushed by roboflow and ultralytics.
When I was working with YOLO models it did seem like there was little practical improvements were between all of the spinoff models. It seemed people were pushing new models for personal recognition since the original creator stopped working on it.
That said, many of the claimed improvements in this model were are efficiency related.
Can't speak for 26, but a year ago I worked on a project that migrated from v5 to 11 because of improved image segmentation capabilities. My understanding is that the newer versions don't necessarily have better precision/recall, but they tend to be faster for equivalent results, and have increased capabilities.
What I find cool is not the model in itself, but the architectures / training methods found that make the model better. It gives out a new possibilites for other fields of AI. (Notably if you want to fine tune other CV models)
The original YOLO author has long quit due to ethical reasons.
Despite having a very memorable paper on the topic I believe they now work at Ai2.
Was evaluating YOLO26 within the last month for its on-device (iPhone 16 Pro) segmentation capabilities. Its decent, but its biggest limitation is that its only trained on 80 COCO classes (meaning pre-labeled images). If whatever is in your images isn't in the 80 classes, its invisible to YOLO26. Conversely I have SAM2 running on-device and its my current workhorse. The biggest benefit with SAM2 for me is that it does fine-grained segmentation masks but isn't trained on labeled images. This was a specific requirement for the app I'm building. SAM2 isn't anywhere as speedy as the native Vision framework apis, but it is more capable across a vastly wider array of potential image targets.
I would prefer GroundingDINo which is a sort of SAM and Dino combo which does open vocabulary.
Doesn't work for my use-case. GroundingDINO is a text to bounding box model. SAM2 supports coordinate based masks (user taps or clicks somewhere in an image), which is what my research app needs.
My buddy has some vision impairments, and I remember training a much older of YOLO's models to detect objects/enemies in Terraria for him. It worked very well.
I then tried trained it on a lot of sample images from a 3D point & shoot game, and was quite disappointed in how it performed.
Has anyone else experimented with it recently? How does this suit as a base-model for training custom classifiers? And with hardware growth in the last ~5 years, is it suitable to run in parallel with games which are graphically intensive?
Is the license for this AGPL? Can someone please confirm?
Yes AGPL-3.0
I found that while CLIPSeg is slower than YOLOn, it is still pretty fast and if gave me much much better results without training.
If you want to detect objects and speed is important so you can’t use a LLM architecture, you can give it a try too.
Reminder that Ultralytics is pushing AGPL in a very overreaching way with their models that's why they are not available in Frigate
https://github.com/blakeblackshear/frigate/pull/10717
Wow I'm old, I still remember working with YOLOv2.
One thing I don’t get I why the article is credited to ‘Contributing Author’.
Meanwhile their very own Peter Skalski already does super job with host write ups and examples of all YOLO sorts and is well respected.
I'm sure the model is capable, but I find it funny that the sample image that contains three bears gets detected as two elephants.
It’s an accurate representation of the model capabilities in my experience.
With some previous versions of YOLO I‘ve found pages that run it in real-time locally on your browser, analyzing the webcam.
Is there a demo like that available for YOLO26?
https://github.com/computer-vision-with-marco/realtime-detec...
Thanks. Is the result deployed anywhere?
Ive used YOLO26 in one of my projects, It was very easy to train on our custom dataset and also very easy to deploy even on rust with AVX2 support. This model is indeed fast and can be used for almost real time inference.
Just a reminder that RF-DETR is better than yolo26
Can it measure the speed of a car on a video ?
Same question, same answer: In pixels/second? Sure!
What are you trying to accomplish by those questions? Are you genuinely asking, or just baiting? If the former, didnt answers to your previous question make it clear that your question makes less sense than you might assume?
It's a serious question. I took a few hours to try this range of models to do this task. Online guides recommend to calibrate the size of the image with a "horizon" line and its real physical size. That's quite complicated.
I wish new models coupled with LLM would be capable of estimating the size of features on the map, e.g. the size of the car in meters, to be able to derive the speed with a world understanding. But I have found no resource doing this.
Global-shutter cameras are fast and expensive, while Doppler radar modules are robust and under $30 these days.
Running machine-vision outside in the Sun or Weather can get tricky. There is also a limited supply of BS a firm can shovel before some bystander ends up dead. =3
https://www.bbc.co.uk/news/articles/c07yp02mxyjo
I am curious why there is no desire to produce a paper showcasing key details.
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models:
https://arxiv.org/abs/2606.03748