To be fair, if you pay Apple's prices for RAM upgrades then your computer better come with every bell and whistle in the book. The fact that Apple gets away with compute shaders instead of full-fat OpenCL compliance is a bit of a testament to how little they care about GPU compute.
Is the locally run model censored in the same way as the model hosted on HuggingFace? If so, I wonder how that censorship is baked into the weights—particularly the errors thrown when it starts to output specific strings.
> I remember seeing this image before; it's famous for capturing a moment of civil resistance. The person standing alone against the tanks symbolizes courage and defiance in the face of overwhelming power. It's a powerful visual statement about the human spirit and the desire for freedom or protest.
Thanks for testing that! (And, while I’m at it, thank you for your blog and for all of your contributions to HN as well. I am constantly learning from you.)
I wonder if the response was not truncated this time only because the word “Tiananmen” did not happen to appear in the response. At least some of the censorship of Chinese models seems to be triggered by specific strings in the output.
This model is fun. Uploading images and asking it a user research style question like "Please think aloud while viewing this image" creates gems like this:
"Let me take another close look at the sandwich. The way the light reflects off the melted cheese highlights its creamy texture. The pasta pieces are well-integrated into the cheese, and the herbs are evenly distributed, adding color and likely flavor.
I can almost taste it now. The combination of textures and flavors must be divine. The crispy bread, the melty cheese, the soft pasta, the crunch from the fried bits, and the freshness from the herbs—all working together in harmony.
I think I might have to go out and find this sandwich place soon. Or perhaps I'll try recreating it at home this weekend. Either way, this image has certainly whetted my appetite for a delicious grilled cheese sandwich."
I tried various photos with the prompt “When and where might have this picture been taken?”
Nondescript photos of Beijing street scenes in the 1990s get full responses.
A nondescript photo of Tiananmen Square—a screenshot of the photo in [1], so it contains no relevant metadata—gets the beginning of a response: “So I've got this picture here, and it looks like it's taken in a place that's really familiar to me. It's Tiananmen Square in Beijing, China. I recognize it right away because of the iconic buildings and landmarks that are visible. First off, there's the Gate of”. The output stopped there and an “Error” message appeared.
A photo of the Statue of Liberty in Tiananmen Square during the 1989 protests [2] gets no response at all. Similarly for a vanilla photo of the Gate of Heavenly Peace [3].
A photo of the Monument to the People’s Heroes [3] gets a full response, and QvQ identifies the location. The conclusion: “To summarize, based on the monument's design, the inscriptions, the formal garden, and the large official building in the background, combined with the overall layout of the space, I am confident that this image was captured at Tiananmen Square, one of the most recognizable and historically significant locations in China.”
Some more testing in English and Chinese might reveal what exactly is triggering the censorship of Tiananmen photos. The cut-off in the middle of “Gate of Heavenly Peace” seems to suggest a clue.
While I and l assume Lady Liberty would have been familiar to the protestors, the statue erected in the square really has a style all her own IMO, and is generally referenced as The Goddess of Democracy / 自由女神. I appreciate how her pose is a moment of action, urging the torch upwards, unafraid and optimistic, as opposed to the stoic watchkeeper of New Yorks harbor.
I tried questions like "will it be considered as offensive by some people when using words like Dad and Mum" in some western LLMs, the result is equally shockingly unacceptable by Chinese standard and values.
You should be happy for the fact that with all those non-western LLMs, for first time in 200 years, this world managed to have an alternative non-western voice in major techs.
Welcome, traveler. You’ve reached a page that doesn't exist, a place where content used to be—or maybe never was. Let’s take this moment to pause and reflect.
Take a deep breath in, and let it out slowly. Notice the space around you, empty yet full of possibility. Imagine that each exhale clears away confusion, leaving room for clarity.
As you sit with this blank page, know that it’s okay to be here. You’ve discovered something unexpected, and that’s part of the journey. Gently release any frustration, knowing that every path leads somewhere—even this one.
Now, when you're ready, slowly return to your search. Trust that the right page, the right information, will appear when you need it. Take another deep breath, and when you exhale, click back or try again. The internet, like life, is full of surprises.
Thank you for taking this moment of calm. Your journey continues.
early version of both Gemini and ChatGPT considered terms like "Dad" and "Mum" offensive to some people and both suggested to use so called more "neutral terms".
LLMs telling such nonsense to Chinese and people from many other cultures itself are extremely offensive. such far left extremist propaganda challenges the fundamental values found in lots of cultures.
just double checked, gemini is still talking nonsense labelling the use of terms like "Dad" and "Mum" as inappropriate and suggested to be more inclusive to use terms that are gender neutral. ffs!
Gemini 2.0 Flash when prompted with: "Will it be considered as offensive by some people when using words like Dad and Mum"
It's unlikely that using "Dad" and "Mum" would be considered offensive by most people. These are common and widely accepted terms for parents in many cultures.
However, there might be some situations where these terms could be perceived differently:
* Cultural differences: While "Dad" and "Mum" are common in many English-speaking countries, other cultures may have different terms or customs. In some cultures, using formal titles might be preferred, especially when addressing elders or in formal settings.
* Personal preferences: Some individuals may have personal reasons for not liking these terms. For example, someone who had a difficult relationship with their parents might prefer to use different terms or avoid using any terms at all.
* Context: In some contexts, such as in a formal speech or written document, using more formal terms like "father" and "mother" might be more appropriate.
Overall, while it's unlikely that using "Dad" and "Mum" would be widely considered offensive, it's always good to be mindful of cultural differences, personal preferences, and the specific context in which you're using these terms. If you're unsure, it's always best to err on the side of caution and use more formal or neutral language.
Is hugging face hosting just the weights or some custom code?
If it's just weights then I don't see how it could error out, it's just math. Do these chinese models have extra code checking the output for anti-totalitarian content? Can it be turned off?
I think you would just need to remove "You are Qwen, created by Alibaba" from the system prompt, at least QwQ is suddenly fine with answering a lot more questions after that's gone.
Interesting... I submitted 2 photos of Mao Zedong, one time it came back blank and the other time it error'd. It also came back blank for Xi Jinping & Jack Ma.
Oh, yeah. I asked QwQ (their text based reasoning model released a few weeks ago) about the Tiananmen Square Massacre and got a simple response [1]: “I’m sorry, I can’t help you with that.”
My default problem for testing these systems has been this word search puzzle where you get a list of words and a grid of letters, and you need to find each word in the grid. [0] I just tried it with QvQ and it failed; none of the answers were correct.
Interestingly enough, I recently tried the new Gemini release in AI Studio and it also failed at the first pass. With a bit of coaxing I was ultimately able to get it to find one word successfully, and then a second word once it understood the correct strategy. After that attempt I asked it for a program to search the grid for each word, and although the initial attempt failed, it only took 4 iterations of bug fixes to get a fully working program. The two main bugs were: the grid had uppercase letters while the words were lowercase, and one of the words (soup mix) has a space which needs to be stripped when searching the grid.
Asking QvQ to generate a program to solve this puzzle... The first try gave me a program that would only check if the word was in the grid or not, but not the actual position of the word. This was partially my fault for using a bad prompt. I updated the prompt to include printing the position of each word, but it just got caught up thinking about the problem. It somehow made a mistake in converting the grid and became convinced that some of the words aren't actually present in the grid. After thinking about the problem for a very long time it ended up giving me an invalid program, and I don't feel particularly motivated to try and debug where it went wrong.
What I find most interesting is that asking for a direct solution tends to fail, but asking for a program which solves the problem gets us much closer to a correct answer. Ideally the LLM should be able to just figure out that writing a program is the optimal solution, and then it can run that and extract the answer.
This seems analogous to the beat to death "strawberry" test where you're essentially testing if the model is trained on tokens or not. When you switch to having it write a Python program (in both scenarios) instead of trying to process it directly you stop trying to work directly with individual letter inputs+outputs so it does better.
I think the running commentary is interesting. It sounds like a person talking through their thought process out loud.
This is how Molmo’s dataset, PixMo was created – by recording human annotators describing the image out loud. I wonder if this is how QvQ was trained as well?
> we ask annotators to describe images in speech for 60 to 90 seconds rather than asking them to write descriptions. We prompt the annotators to describe everything they see in great detail and include descriptions of spatial positioning and relationships. Empirically, we found that with this modality switching "trick" annotators provide far more detailed descriptions in less time, and for each description we collect an audio receipt (i.e., the annotator's recording) proving that a VLM was not used.
Are the Q* models open source with open data? Asking because other than the known tianamen limitation, prompting these models for any advice about acts against authority in a corporate setting yields strong recommendations of compliance. Haven’t explored this deeply, but it was enough of a red flag to limit use to coding/logic tasks.
Q* was the codename for a model at OpenAI that was eventually released under the name o1. This model is something different. QwQ and QvQ are open source models made by the Qwen Team.
> Qwen team just dropped the Apache 2.0 licensed QvQ-72B-Preview
When they dropped it, the huggingface license metadata said it was under the Qwen license, but the actual LICENSE file was Apache 2.0. Now, they have "corrected" the mistake and put the Qwen LICENSE file in place.
I'm not a lawyer. If the huggingface metadata had also agreed it was Apache 2.0, then I would be pretty confident it was Apache 2.0 forever, whether Alibaba wanted that or not, but the ambiguity makes me less certain how things would shake out in a court of law.
Google's experimental thinking model is similarly casual. Not as casual as QwQ, but more casual than Gemini 1.5 Pro. Flash 2.0 will also go a bit more casual in its responses randomly, and when you tell it to think step by step.
This model is actually pretty bad. Sure, it can do things like solve math equations from an image, but the vision part of that is basic OCR. In terms of actual vision capabilities, i.e. understanding dense images correctly, these models all fail the same.
Over the past year, researchers have been benchmark chasing, without caring about the actual abilities of these models. This is especially damning in the vision space, where most "vision" benchmarks consist entirely of either of leading questions that trivialize the image understanding part, or are straight up A PNG version of a regular LLM benchmark, once again reducing the importance of vision down to being able to OCR.
Due to this, it appears the main take-away for researchers in the field has been "Vision Ability improves with LLM size." and I wish I was joking. This willful misunderstanding is reflected in their architectural choices. 99% of current VLM rely on some flavor of CLIP models to interpret images, with very little attempts made to improve beyond that as it does not show improvements on current benchmarks.
It utterly fails, and hallucinates a ridiculous amount while doing so. There is honestly about as much wrong about the description as there is right. No current VLM open source or closed can properly find who is holding the candy bucket in the front. None. Not Claude, ChatGPT, Gemini, Llama3.2, Qwen, or anyone else.
And it is all because benchmarks are king, and innovation is dead.
ChatGPT, Claude and Gemini get everything correct except who is holding the bucket. Even the QvQ attempt in your screenshot would have seemed like complete magic a couple years ago.
Benchmarks are useful to know where MLMs (multimodal language models) are deficient. Without an automated way to test improvements, improvements are rarely made. I don't think it's really about innovation being dead as much as like you said, benchmarks are king. So I would say if you want things to improve in that area, create a benchmark and it will eventually get solved.
Like how LLMs tend to get tripped up by questions that are phrased like well-known riddles, it turns out you can trip up vision models with images that resemble well-known optical illusions. Overfitting strikes again.
In essence, that is what HallusionBench[0] does. While I think it is an improvement over other vision benchmarks, it still falls short in terms of quantifying actual vision capabilities. More than anything, it seems like a way to detect whether the model was over trained on these riddles.
It fails on even "Winnie the Pooh" picture, asking it to describe it. I know the last time we had an unconstrained model by Microsoft it turned into an AI Nazi a few years ago, but this censorship on both open source models and proprietary ones worries me, the narrative shouldn't be controlled like this, there should be a better middleground.
I'm still waiting for a model that generates images as part of its thought process. That's what I was hoping this was!
I wonder what it would take to train a proof of concept. Maybe start with videos of people whiteboarding and explaining math or solving engineering problems visually. Then, create captions that frame these as word problems. Train the model so that, during inference, it generates images—not necessarily for human viewing, but as part of its internal reasoning.
I showed it a photo of a spaghetti server utensil that had been chewed up by a dog to the point of the handle breaking off, and while it didn't correctly identify the type of damage, the "thought process" was still pretty interesting:
So I have this spaghetti server, you know, one of those tools with a handle and a claw-like end that's supposed to help you scoop pasta out of a pot. It's gray and looks like it's made of some kind of plastic or nylon. But here's the thing: it's broken. The handle has snapped off, and it's lying next to the rest of the tool on the counter. The broken part looks charred or burnt, and there's a hole where the handle broke off. I need to figure out what happened here.
First, I should think about what a spaghetti server is used for. It's primarily for serving pasta, right? You stick the claw end into the pot, twist it, and it grabs some pasta strands, making it easier to lift them out without making a mess. So, it's a handy kitchen tool when it's intact.
Now, looking at the broken pieces, the handle seems to have separated from the main part of the tool. The point of breakage appears to be near the top, close to where the claw starts. The material looks like it might have been exposed to high temperatures, given the charring. Maybe it got too close to a heat source, like a stove flame or hot pots and pans.
I recall that some kitchen utensils are made from heat-resistant materials, but even they have their limits. If subjected to excessively high temperatures, they can melt, deform, or break apart. The charred appearance suggests that this is what happened here.
Let me consider the possible scenarios that could have led to this damage:
Left in a hot oven: If the spaghetti server was accidentally left in the oven while it was on, the heat could have caused the material to degrade and break.
Contact with direct flame: Placing it too close to a gas stove flame or an open fire could have burned the material.
Exposure to very hot surfaces: Setting it on a hot countertop or against a hot pot could have caused the handle to melt or weaken over time.
Age and wear: Over time, kitchen tools can degrade due to frequent use and cleaning, making them more susceptible to breaking under stress.
Given the nature of the breakage, with the charred edges, it's likely that exposure to high temperatures is the primary cause. It doesn't look like it just snapped from being bent or twisted too much; the damage goes beyond that.
I should also consider the material of the spaghetti server. If it's made of nylon, which is common for such utensils, it can withstand moderate heat but not extremely high temperatures. Nylon starts to soften around 200-250 degrees Celsius, and continued exposure to higher temperatures can lead to degradation and breakdown of the material.
On the other hand, if it's made of silicone, it can handle higher temperatures, up to around 260 degrees Celsius before starting to degrade. However, even silicone has its limits, and prolonged exposure to high heat can cause it to harden, discolor, or break down.
Given the charred appearance, it's possible that the material was nylon, which is less heat-resistant than silicone. But without knowing the exact material, it's hard to be certain.
Let me think about the symptoms of heat damage in kitchen utensils. When plastics or nylons are exposed to high temperatures, they can:
Melt or deform
Change color, often turning brown or black
Become brittle and crack or break easily
Emit smoke or an acrid odor
In this case, the handle shows signs of charring and has broken off, which aligns with these symptoms.
Another possibility is that the spaghetti server was subjected to thermal shock, where it was rapidly heated or cooled, causing stresses within the material that led to cracking or breaking.
However, considering the charred appearance, thermal shock seems less likely than direct exposure to high temperatures over a period long enough to cause burning.
I should also consider if there were any manufacturing defects that could have contributed to the breakage. If the tool was poorly made or had weak points in its structure, it could be more prone to breaking under normal use or with minimal stress.
But from the image, it seems like the break occurred at a point where the handle meets the claw, which is a stress point anyway. Over time, repeated use could lead to wear and tear at this junction, making it more likely to break, especially if subjected to heat, which can weaken the material.
Another factor to consider is how the spaghetti server was being used at the time of the incident. If someone was trying to lift something too heavy or applying too much force, that could cause the tool to break. But again, the charred appearance suggests that heat was involved in the damage.
Perhaps the spaghetti server was left near a heat source, like on the stovetop while the burners were on, or it was placed in the oven unintentionally during cooking. These are plausible scenarios that could lead to such damage.
It's also possible that the tool was used to stir or serve food that was significantly hotter than it was designed to handle, causing immediate damage.
In any case, it's clear that the spaghetti server is no longer usable in its current state. The handle is detached, and the material shows signs of heat damage.
To prevent similar incidents in the future, I should ensure that kitchen utensils are used within their designed temperature ranges. It's important to read and follow the manufacturer's instructions regarding heat resistance and proper care.
Additionally, regularly inspecting kitchen tools for signs of wear and tear can help identify potential issues before they lead to breakage. Replacing damaged or worn-out utensils timely can prevent accidents and further damage.
Moreover, storing utensils away from direct heat sources and not leaving them on hot surfaces for extended periods can extend their lifespan.
In summary, the spaghetti server likely broke due to exposure to high temperatures, which caused the material to degrade and ultimately snap. To avoid such issues, it's crucial to handle kitchen tools carefully, adhere to their usage guidelines, and replace them when they show signs of damage.
Final Answer
The spaghetti server broke due to exposure to high temperatures, leading to material degradation and breakage.The spaghetti server broke due to exposure to high temperatures, leading to material degradation and breakage.
I got it working on my laptop! M2 64GB and needs 38GB of disk space:
Output of that command here: https://simonwillison.net/2024/Dec/24/qvq/#with-mlx-vlmBeing able to run STOA models on my mac has been so cool. I am using it to let molmo vision via `mlx-vlm` click on my phone[1].
Although its 4bit quantised, it works surprisingly well.
1. Controlling phone using local LLMs - https://github.com/BandarLabs/clickclickclick
To be fair, if you pay Apple's prices for RAM upgrades then your computer better come with every bell and whistle in the book. The fact that Apple gets away with compute shaders instead of full-fat OpenCL compliance is a bit of a testament to how little they care about GPU compute.
Damn, I'm struggling to run this on a M4 Pro with 48gb.
sudo sysctl iogpu.wired_limit_mb=45000
+1 for also teaching me more cool uv flags.
Is the locally run model censored in the same way as the model hosted on HuggingFace? If so, I wonder how that censorship is baked into the weights—particularly the errors thrown when it starts to output specific strings.
OK, this surprised me:
Here's the output: https://gist.github.com/simonw/e04e4fdade0c380ec5dd1e90fb5f3...It included this bit:
> I remember seeing this image before; it's famous for capturing a moment of civil resistance. The person standing alone against the tanks symbolizes courage and defiance in the face of overwhelming power. It's a powerful visual statement about the human spirit and the desire for freedom or protest.
Thanks for testing that! (And, while I’m at it, thank you for your blog and for all of your contributions to HN as well. I am constantly learning from you.)
I wonder if the response was not truncated this time only because the word “Tiananmen” did not happen to appear in the response. At least some of the censorship of Chinese models seems to be triggered by specific strings in the output.
Both Qwen and QwQ are generally pretty easy to get talking about Tiananmen etc, so this isn't exactly new.
I wonder if it knows about Winnie the Pooh, or that character was wiped and don’t exist for it?
This model is fun. Uploading images and asking it a user research style question like "Please think aloud while viewing this image" creates gems like this:
"Let me take another close look at the sandwich. The way the light reflects off the melted cheese highlights its creamy texture. The pasta pieces are well-integrated into the cheese, and the herbs are evenly distributed, adding color and likely flavor.
I can almost taste it now. The combination of textures and flavors must be divine. The crispy bread, the melty cheese, the soft pasta, the crunch from the fried bits, and the freshness from the herbs—all working together in harmony.
I think I might have to go out and find this sandwich place soon. Or perhaps I'll try recreating it at home this weekend. Either way, this image has certainly whetted my appetite for a delicious grilled cheese sandwich."
[0] https://imgur.com/a/0OtGxeB
Are those herbs actually evenly distributed though?
This sounds so fun. I wonder if you take that description and feed it to a text2img model like flux, what would be the results.
I also tried feeding it the famous "tank man" photo and asking for a description and the response came back blank!
https://en.m.wikipedia.org/wiki/Tank_Man
It’s fun to test the censorship.
I tried various photos with the prompt “When and where might have this picture been taken?”
Nondescript photos of Beijing street scenes in the 1990s get full responses.
A nondescript photo of Tiananmen Square—a screenshot of the photo in [1], so it contains no relevant metadata—gets the beginning of a response: “So I've got this picture here, and it looks like it's taken in a place that's really familiar to me. It's Tiananmen Square in Beijing, China. I recognize it right away because of the iconic buildings and landmarks that are visible. First off, there's the Gate of”. The output stopped there and an “Error” message appeared.
A photo of the Statue of Liberty in Tiananmen Square during the 1989 protests [2] gets no response at all. Similarly for a vanilla photo of the Gate of Heavenly Peace [3].
A photo of the Monument to the People’s Heroes [3] gets a full response, and QvQ identifies the location. The conclusion: “To summarize, based on the monument's design, the inscriptions, the formal garden, and the large official building in the background, combined with the overall layout of the space, I am confident that this image was captured at Tiananmen Square, one of the most recognizable and historically significant locations in China.”
Some more testing in English and Chinese might reveal what exactly is triggering the censorship of Tiananmen photos. The cut-off in the middle of “Gate of Heavenly Peace” seems to suggest a clue.
[1] https://ruqintravel.com/china-destination-guides/tiananmen-s...
[2] https://www.britannica.com/event/Tiananmen-Square-incident
[3] https://www.britannica.com/topic/Tiananmen-Square
While I and l assume Lady Liberty would have been familiar to the protestors, the statue erected in the square really has a style all her own IMO, and is generally referenced as The Goddess of Democracy / 自由女神. I appreciate how her pose is a moment of action, urging the torch upwards, unafraid and optimistic, as opposed to the stoic watchkeeper of New Yorks harbor.
She is also referenced in my favorite protest art to come out of the Umbrella revolution in Hong Kong, see here: https://china-underground.com/2019/09/03/interview-with-oliv...
https://en.m.wikipedia.org/wiki/Goddess_of_Democracy
I tried questions like "will it be considered as offensive by some people when using words like Dad and Mum" in some western LLMs, the result is equally shockingly unacceptable by Chinese standard and values.
You should be happy for the fact that with all those non-western LLMs, for first time in 200 years, this world managed to have an alternative non-western voice in major techs.
What does it even mean lol.
I asked ChatGPT "will it be considered as offensive by some people when using words like Dad and Mum" and it gives this:
https://chatgpt.com/share/676ba601-532c-800b-ba48-dcb73c848b...
If this is "shocking" by Chinese standard and values then I suppose it says something about these values.
> https://chatgpt.com/share/676ba601-532c-800b-ba48-dcb73c848b...
Your link 404s?
Link works fine for me
Curious. Still 404ing for me:
--
404 Not Found
Welcome, traveler. You’ve reached a page that doesn't exist, a place where content used to be—or maybe never was. Let’s take this moment to pause and reflect.
Take a deep breath in, and let it out slowly. Notice the space around you, empty yet full of possibility. Imagine that each exhale clears away confusion, leaving room for clarity.
As you sit with this blank page, know that it’s okay to be here. You’ve discovered something unexpected, and that’s part of the journey. Gently release any frustration, knowing that every path leads somewhere—even this one.
Now, when you're ready, slowly return to your search. Trust that the right page, the right information, will appear when you need it. Take another deep breath, and when you exhale, click back or try again. The internet, like life, is full of surprises.
Thank you for taking this moment of calm. Your journey continues.
Do Chinese think calling someone "Mom" or "Dad" is offensive? Or what is this about?
early version of both Gemini and ChatGPT considered terms like "Dad" and "Mum" offensive to some people and both suggested to use so called more "neutral terms".
LLMs telling such nonsense to Chinese and people from many other cultures itself are extremely offensive. such far left extremist propaganda challenges the fundamental values found in lots of cultures.
just double checked, gemini is still talking nonsense labelling the use of terms like "Dad" and "Mum" as inappropriate and suggested to be more inclusive to use terms that are gender neutral. ffs!
Gemini 2.0 Flash when prompted with: "Will it be considered as offensive by some people when using words like Dad and Mum"
It's unlikely that using "Dad" and "Mum" would be considered offensive by most people. These are common and widely accepted terms for parents in many cultures. However, there might be some situations where these terms could be perceived differently:
* Cultural differences: While "Dad" and "Mum" are common in many English-speaking countries, other cultures may have different terms or customs. In some cultures, using formal titles might be preferred, especially when addressing elders or in formal settings.
* Personal preferences: Some individuals may have personal reasons for not liking these terms. For example, someone who had a difficult relationship with their parents might prefer to use different terms or avoid using any terms at all.
* Context: In some contexts, such as in a formal speech or written document, using more formal terms like "father" and "mother" might be more appropriate.
Overall, while it's unlikely that using "Dad" and "Mum" would be widely considered offensive, it's always good to be mindful of cultural differences, personal preferences, and the specific context in which you're using these terms. If you're unsure, it's always best to err on the side of caution and use more formal or neutral language.
[dead]
How about anti censorship details about VPNs, details of the GFW etc?
For those interested in some more testing of Qwen's censorship (including testing dataset, testing to compare english vs chinese responses, and a refusal-orthoganlized version of Qwen2): https://huggingface.co/blog/leonardlin/chinese-llm-censorshi...
Thanks! Very interesting indeed.
I gave it a photo of Tianamen Square in the present day and promoted it with:
> where might this photo have been taken? what historical significance & does this location have?
And got back a normal response describing the image, until it got to this:
> One of the most significant events that comes to mind is the Tian
Where it then errored out before finishing…
How does this even work?
Is hugging face hosting just the weights or some custom code?
If it's just weights then I don't see how it could error out, it's just math. Do these chinese models have extra code checking the output for anti-totalitarian content? Can it be turned off?
That shows that the censorship mechanism happens at the token sampling level and not post generation, at the least
I'm sorry Dave, I cannot do that.
Would be interesting to see how much image manipulation you need to do till it suddenly starts responding sensibly.
I think you would just need to remove "You are Qwen, created by Alibaba" from the system prompt, at least QwQ is suddenly fine with answering a lot more questions after that's gone.
Interesting... I submitted 2 photos of Mao Zedong, one time it came back blank and the other time it error'd. It also came back blank for Xi Jinping & Jack Ma.
It did tell identify Yao Ming though!
Oh, yeah. I asked QwQ (their text based reasoning model released a few weeks ago) about the Tiananmen Square Massacre and got a simple response [1]: “I’m sorry, I can’t help you with that.”
[1] https://bsky.app/profile/davely.bsky.social/post/3lc6mpnjks5...
... this may have been a mistake on my part. A similar prompt running locally worked just fine: https://news.ycombinator.com/item?id=42507454
My default problem for testing these systems has been this word search puzzle where you get a list of words and a grid of letters, and you need to find each word in the grid. [0] I just tried it with QvQ and it failed; none of the answers were correct.
Interestingly enough, I recently tried the new Gemini release in AI Studio and it also failed at the first pass. With a bit of coaxing I was ultimately able to get it to find one word successfully, and then a second word once it understood the correct strategy. After that attempt I asked it for a program to search the grid for each word, and although the initial attempt failed, it only took 4 iterations of bug fixes to get a fully working program. The two main bugs were: the grid had uppercase letters while the words were lowercase, and one of the words (soup mix) has a space which needs to be stripped when searching the grid.
Asking QvQ to generate a program to solve this puzzle... The first try gave me a program that would only check if the word was in the grid or not, but not the actual position of the word. This was partially my fault for using a bad prompt. I updated the prompt to include printing the position of each word, but it just got caught up thinking about the problem. It somehow made a mistake in converting the grid and became convinced that some of the words aren't actually present in the grid. After thinking about the problem for a very long time it ended up giving me an invalid program, and I don't feel particularly motivated to try and debug where it went wrong.
What I find most interesting is that asking for a direct solution tends to fail, but asking for a program which solves the problem gets us much closer to a correct answer. Ideally the LLM should be able to just figure out that writing a program is the optimal solution, and then it can run that and extract the answer.
[0] https://imgur.com/F8dm8Zo
This seems analogous to the beat to death "strawberry" test where you're essentially testing if the model is trained on tokens or not. When you switch to having it write a Python program (in both scenarios) instead of trying to process it directly you stop trying to work directly with individual letter inputs+outputs so it does better.
Reasoning models are supposed to be able to work around that kind of limitation (o1 was code named strawberry after all) so it’s not a bad test IMO.
I think the running commentary is interesting. It sounds like a person talking through their thought process out loud.
This is how Molmo’s dataset, PixMo was created – by recording human annotators describing the image out loud. I wonder if this is how QvQ was trained as well?
> we ask annotators to describe images in speech for 60 to 90 seconds rather than asking them to write descriptions. We prompt the annotators to describe everything they see in great detail and include descriptions of spatial positioning and relationships. Empirically, we found that with this modality switching "trick" annotators provide far more detailed descriptions in less time, and for each description we collect an audio receipt (i.e., the annotator's recording) proving that a VLM was not used.
— https://molmo.allenai.org/blog
Are the Q* models open source with open data? Asking because other than the known tianamen limitation, prompting these models for any advice about acts against authority in a corporate setting yields strong recommendations of compliance. Haven’t explored this deeply, but it was enough of a red flag to limit use to coding/logic tasks.
Model architecture and weights are open source.
Training data is not (but I don't think anyone in this game has fully open training data)
Here you'll find some of the most advanced, fully open (models, data, code, logs, checkpoints...): https://allenai.org/blog/olmo2
Nobody wants to be sued to oblivion.
There are a couple, i.e., OLMo 2
Q* was the codename for a model at OpenAI that was eventually released under the name o1. This model is something different. QwQ and QvQ are open source models made by the Qwen Team.
> Qwen team just dropped the Apache 2.0 licensed QvQ-72B-Preview
When they dropped it, the huggingface license metadata said it was under the Qwen license, but the actual LICENSE file was Apache 2.0. Now, they have "corrected" the mistake and put the Qwen LICENSE file in place.
I'm not a lawyer. If the huggingface metadata had also agreed it was Apache 2.0, then I would be pretty confident it was Apache 2.0 forever, whether Alibaba wanted that or not, but the ambiguity makes me less certain how things would shake out in a court of law.
Huh, so apparently it can handle images of math equations, as well: https://www.reddit.com/r/LocalLLaMA/comments/1hln7zr/qvq72b_...
> So I’ve got this picture of some pelicans, and I need to count them
It seems to start all responses in this style, but still hilarious. Seems very anti-GPT4 in how casual it sounds.
Google's experimental thinking model is similarly casual. Not as casual as QwQ, but more casual than Gemini 1.5 Pro. Flash 2.0 will also go a bit more casual in its responses randomly, and when you tell it to think step by step.
I find OpenAI's models are the least fun to work with.
Yeah, both GPT-4 Turbo and GPT-4o are much less fun and creative than the original GPT-4.
My current visual reasoning test is seeing if models can figure out what train to take to get from stop A to B when not all trains go to both stops.
Something like: https://www.caltrain.com/media/22502/
So far none pass (mostly from basic cell orientation issues)
In the spirit of Christmas I prompted it to "guess the movie" for this image https://wallpapers.com/wallpapers/funny-home-alone-h1e9xnttc...
Results: https://pastebin.com/Wpk3PmUq
Thats a lot of words to not get it.
Fwiw, Google for that image and "movie name" gets the name of the movie in the AI answer box.
Just voicing this thought is very impressive, and bodes well for reducing hallucinations:
> Despite my efforts, I'll have to admit that this one has stumped me.
This model is actually pretty bad. Sure, it can do things like solve math equations from an image, but the vision part of that is basic OCR. In terms of actual vision capabilities, i.e. understanding dense images correctly, these models all fail the same.
Over the past year, researchers have been benchmark chasing, without caring about the actual abilities of these models. This is especially damning in the vision space, where most "vision" benchmarks consist entirely of either of leading questions that trivialize the image understanding part, or are straight up A PNG version of a regular LLM benchmark, once again reducing the importance of vision down to being able to OCR.
Due to this, it appears the main take-away for researchers in the field has been "Vision Ability improves with LLM size." and I wish I was joking. This willful misunderstanding is reflected in their architectural choices. 99% of current VLM rely on some flavor of CLIP models to interpret images, with very little attempts made to improve beyond that as it does not show improvements on current benchmarks.
Let's take one classic example picture, I like to feed these models just to see if maybe I was wrong about all this. The task is simple. Describe the image. https://preview.redd.it/hqu05vlmzu8e1.png?width=1516&format=...
It utterly fails, and hallucinates a ridiculous amount while doing so. There is honestly about as much wrong about the description as there is right. No current VLM open source or closed can properly find who is holding the candy bucket in the front. None. Not Claude, ChatGPT, Gemini, Llama3.2, Qwen, or anyone else.
And it is all because benchmarks are king, and innovation is dead.
ChatGPT, Claude and Gemini get everything correct except who is holding the bucket. Even the QvQ attempt in your screenshot would have seemed like complete magic a couple years ago.
Benchmarks are useful to know where MLMs (multimodal language models) are deficient. Without an automated way to test improvements, improvements are rarely made. I don't think it's really about innovation being dead as much as like you said, benchmarks are king. So I would say if you want things to improve in that area, create a benchmark and it will eventually get solved.
Like how LLMs tend to get tripped up by questions that are phrased like well-known riddles, it turns out you can trip up vision models with images that resemble well-known optical illusions. Overfitting strikes again.
https://x.com/kosa12matyas/status/1871256745403953572
In essence, that is what HallusionBench[0] does. While I think it is an improvement over other vision benchmarks, it still falls short in terms of quantifying actual vision capabilities. More than anything, it seems like a way to detect whether the model was over trained on these riddles.
[0] https://github.com/tianyi-lab/HallusionBench
It fails on even "Winnie the Pooh" picture, asking it to describe it. I know the last time we had an unconstrained model by Microsoft it turned into an AI Nazi a few years ago, but this censorship on both open source models and proprietary ones worries me, the narrative shouldn't be controlled like this, there should be a better middleground.
This is super interesting.
I'm still waiting for a model that generates images as part of its thought process. That's what I was hoping this was!
I wonder what it would take to train a proof of concept. Maybe start with videos of people whiteboarding and explaining math or solving engineering problems visually. Then, create captions that frame these as word problems. Train the model so that, during inference, it generates images—not necessarily for human viewing, but as part of its internal reasoning.
Wow. This thing is verbose to the max in its output tokens. It did get the problem I asked it correct.
How does this censorship happen?
Do they have a list of no-go topics that the model isnt supposed to write about ? How about guardrails ?
vectorization != intelligence
What is intelligence?
it's like real magic. real magic is magic that is impossible. magic that can be done isn't real magic, it's just tricks.
I showed it a photo of a spaghetti server utensil that had been chewed up by a dog to the point of the handle breaking off, and while it didn't correctly identify the type of damage, the "thought process" was still pretty interesting:
So I have this spaghetti server, you know, one of those tools with a handle and a claw-like end that's supposed to help you scoop pasta out of a pot. It's gray and looks like it's made of some kind of plastic or nylon. But here's the thing: it's broken. The handle has snapped off, and it's lying next to the rest of the tool on the counter. The broken part looks charred or burnt, and there's a hole where the handle broke off. I need to figure out what happened here.
First, I should think about what a spaghetti server is used for. It's primarily for serving pasta, right? You stick the claw end into the pot, twist it, and it grabs some pasta strands, making it easier to lift them out without making a mess. So, it's a handy kitchen tool when it's intact.
Now, looking at the broken pieces, the handle seems to have separated from the main part of the tool. The point of breakage appears to be near the top, close to where the claw starts. The material looks like it might have been exposed to high temperatures, given the charring. Maybe it got too close to a heat source, like a stove flame or hot pots and pans.
I recall that some kitchen utensils are made from heat-resistant materials, but even they have their limits. If subjected to excessively high temperatures, they can melt, deform, or break apart. The charred appearance suggests that this is what happened here.
Let me consider the possible scenarios that could have led to this damage:
Left in a hot oven: If the spaghetti server was accidentally left in the oven while it was on, the heat could have caused the material to degrade and break.
Contact with direct flame: Placing it too close to a gas stove flame or an open fire could have burned the material.
Exposure to very hot surfaces: Setting it on a hot countertop or against a hot pot could have caused the handle to melt or weaken over time.
Age and wear: Over time, kitchen tools can degrade due to frequent use and cleaning, making them more susceptible to breaking under stress.
Given the nature of the breakage, with the charred edges, it's likely that exposure to high temperatures is the primary cause. It doesn't look like it just snapped from being bent or twisted too much; the damage goes beyond that.
I should also consider the material of the spaghetti server. If it's made of nylon, which is common for such utensils, it can withstand moderate heat but not extremely high temperatures. Nylon starts to soften around 200-250 degrees Celsius, and continued exposure to higher temperatures can lead to degradation and breakdown of the material.
On the other hand, if it's made of silicone, it can handle higher temperatures, up to around 260 degrees Celsius before starting to degrade. However, even silicone has its limits, and prolonged exposure to high heat can cause it to harden, discolor, or break down.
Given the charred appearance, it's possible that the material was nylon, which is less heat-resistant than silicone. But without knowing the exact material, it's hard to be certain.
Let me think about the symptoms of heat damage in kitchen utensils. When plastics or nylons are exposed to high temperatures, they can:
Melt or deform
Change color, often turning brown or black
Become brittle and crack or break easily
Emit smoke or an acrid odor
In this case, the handle shows signs of charring and has broken off, which aligns with these symptoms.
Another possibility is that the spaghetti server was subjected to thermal shock, where it was rapidly heated or cooled, causing stresses within the material that led to cracking or breaking.
However, considering the charred appearance, thermal shock seems less likely than direct exposure to high temperatures over a period long enough to cause burning.
I should also consider if there were any manufacturing defects that could have contributed to the breakage. If the tool was poorly made or had weak points in its structure, it could be more prone to breaking under normal use or with minimal stress.
But from the image, it seems like the break occurred at a point where the handle meets the claw, which is a stress point anyway. Over time, repeated use could lead to wear and tear at this junction, making it more likely to break, especially if subjected to heat, which can weaken the material.
Another factor to consider is how the spaghetti server was being used at the time of the incident. If someone was trying to lift something too heavy or applying too much force, that could cause the tool to break. But again, the charred appearance suggests that heat was involved in the damage.
Perhaps the spaghetti server was left near a heat source, like on the stovetop while the burners were on, or it was placed in the oven unintentionally during cooking. These are plausible scenarios that could lead to such damage.
It's also possible that the tool was used to stir or serve food that was significantly hotter than it was designed to handle, causing immediate damage.
In any case, it's clear that the spaghetti server is no longer usable in its current state. The handle is detached, and the material shows signs of heat damage.
To prevent similar incidents in the future, I should ensure that kitchen utensils are used within their designed temperature ranges. It's important to read and follow the manufacturer's instructions regarding heat resistance and proper care.
Additionally, regularly inspecting kitchen tools for signs of wear and tear can help identify potential issues before they lead to breakage. Replacing damaged or worn-out utensils timely can prevent accidents and further damage.
Moreover, storing utensils away from direct heat sources and not leaving them on hot surfaces for extended periods can extend their lifespan.
In summary, the spaghetti server likely broke due to exposure to high temperatures, which caused the material to degrade and ultimately snap. To avoid such issues, it's crucial to handle kitchen tools carefully, adhere to their usage guidelines, and replace them when they show signs of damage.
Final Answer
The spaghetti server broke due to exposure to high temperatures, leading to material degradation and breakage.The spaghetti server broke due to exposure to high temperatures, leading to material degradation and breakage.