SolidStart - Hacker News

endymion-light 17 hours ago ago

There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.

When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like the size of the meal into an actual model, using the integration of pre-existing tools that are (slightly more) accurate. Hell - most food literally is required to have calorie information, and you can query open source data for others!

But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

This is akin to the instagram reels that talk to chatGPT and ask it to time how long they're run is. Except those are treated as funny jokes rather than being turned into studies.

I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that.

[-]

kalleboo 17 hours ago ago

> But the author just took pictures of food & expected a realistic response?

There are very popular apps on the App Store right now that are going viral among non-techie people that do exactly this, and they have no concept of how AI works. My wife was talking about one and I had to give her a reality check that the AI had no idea what ingredients were used to make the food. And she's a licensed nutritionalist.

Studies like this create something to point at for people who are confused and serve as a springboard for a conversation in the media.

[-]

endymion-light 16 hours ago ago

That's true - I suppose i'm just dissapointed that this study hasn't seemed to include those within any analysis. Being able to point out that the top 100 calorie counting apps on the app store return similiar results to simple frontier models would be of interest.

I think i'm just dissapointed that this study doesn't go deep enough, and stays at a surface level statistical analysis of frontier models.

[-]

dpark 13 hours ago ago

I think it’s a very useful study specifically to debunk the apps that support this flow.

None of those apps have magic. They cannot do better than the frontier models.

asdfasgasdgasdg 15 hours ago ago

To be fair these kinds of apps also existed before LLMs. They just used OpenCV or similar instead of the LLM APIs.

inerte 14 hours ago ago

To be fair my expectations is that those apps have done the prompt engineering, and schema, and tools (to query nutrition database), etc... and although they're not 100% consistent, the margin of errors should be narrow to the point that barely matter, and they should do a bit better than a random ChatGPT chat session.

[-]

Centigonal 13 hours ago ago

the problem isn't one that can be solved with prompts. If I gave a panel of food and nutrition experts (human or machine) a bunch of pictures of food, they still wouldn't be able to tell if, e.g. a slice of cake was made with whole milk or skim.

The "pic of packaged food --> LLM --> nutrition DB call" pipeline is workable, but many users of these apps are using them for fresh prepared foods, which is just an unworkable problem without either an understanding of the preparation process or a bomb calorimeter.

xnx 13 hours ago ago

Even simpler examples make the limitations obvious. Images can't distinguish Diet Coke from Coke.

senordevnyc 16 hours ago ago

licensed nutritionalist

Nutritionist?

[-]

kalleboo 4 hours ago ago

Haha oops. English is hard...

furyofantares 17 hours ago ago

From the text of the article I believe the author is implying there are apps doing exactly this, and so this is why it was studied that way.

Had the author written the article themselves rather than an LLM their motivation probably would have been clearer.

[-]

Brendinooo 17 hours ago ago

> there are apps doing exactly this

Yeah, for sure there are. And people will just ask ChatGPT as well.

The funny thing is that for people who are just trying to lose weight without managing any health issues precisely, this type of extreme variance doesn't really matter, because the mere act of consciously quantifying food consumption is, based on my experience counting calories, the single biggest factor in success with weight loss.

[-]

criley2 17 hours ago ago

I actually think "just asking ChatGPT" is fine, because A) the data in these apps is suspect at best and B) the data behind calories is also pretty suspect (but we all play along because we can adjust other variables to make it all "work" well enough).

Once or twice a year I spend a few weeks meticulously measuring ingredients/cooked foods and recording calories and on complex recipes apps are next to useless at getting accurate data. You're trying to input five or ten relevant ingredients, and then weighing your cooked outcome to try and divide the ingredients by proportion. Frankly it's a mess and most people aren't doing it for home cooked meals, and are getting very lossy outcomes (weighing cooked chicken and marking it as raw chicken, etc)

With reasoning and tool calling (combined with me meticulously weighing before and after), it's producing fine data for my purposes.

[-]

ijk 15 hours ago ago

I was complaining about AI generated clothes being misleading marketing, deceiving customers as to whether the garment even exists.

And then I learned that the pre-AI norms weren't any less fictional: they made an exemplar garment and did photoshoots, sure, but then they send the pictures and patterns to the lowest bidder factories with permission to make whatever edits are necessary to make it cheap and manufactureable. The whole thing was already a simulacrum.

smoe 15 hours ago ago

I honestly think that, given the sorry state of the pre-GenAI internet, with all the SEO optimization nonsense, clickbait, and supplement peddling everywhere, LLMs are for now actually better than Google for “doing your own research” on many things.

At least at the entry level. Once you want to go in depth, the outcome in my experience is the same as with LLM use on any topic depends heavily on the domain knowledge of the prompter and their ability to steer it.

ozgung 16 hours ago ago

The author uses the prompts and method from an open-source app that connects to insulin pump, a medical device. I think AI food identification is an experimental feature in the app.

> The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.

https://github.com/Artificial-Pancreas/iAPS

I think these are the prompts in the app: https://github.com/Artificial-Pancreas/iAPS/tree/5eabe22e7e2...

[-]

sjhatfield 15 hours ago ago

Exactly. This is not paid software. We assume full responsibility for outcomes when using it. There's a reason it's not on any app store. I'm glad features like this are being experimented with. Not how I would use AI to estimate carbs...

Ancapistani 14 hours ago ago

True, but I'm working on a product that's "adjacent to" this sort of thing, and we also have a "food recognition" feature that's marked as experimental. Our users are using it, and now I plan to push fairly hard on at least measuring the accuracy and hopefully exposing those results to our users regardless of how well it performs.

andrewvc 16 hours ago ago

One of the biggest gaps is that people don't understand that food labels are allowed by the FDA to be off by up to 20% in terms of the number of actual calories!

In the real world you need to calibrate your behavior with the results. Are you gaining weight? You'll need to eat less if you want to lose any. You can do all the math with nutrition labels and macros you want but that's all theoretical.

See this study below for the 20% figure, as well as their experimental results on real food items (some even exceeded this threshold though most were within it). https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/?st_source=...

[-]

bonoboTP 16 hours ago ago

A large part of the effectiveness in counting calories is that you pay more attention and make more conscious decisions and are less likely to "cheat" if you have to enter it in your food log.

It's indeed like astrology. Simply thinking about personality traits and thinking through your life and your desires and goals and current situation is already beneficial to take charge and navigate your life.

[-]

smallmancontrov 15 hours ago ago

I'd take the opposite point of view: just thinking about life, desires, and goals is how people wind up paying $10 to Whole Foods for a slice of "healthy pizza" that is nutritionally identical to any other pizza but comes out of a cute stone oven and is displayed on a wooden platform next to green leafy plants. Vibes astrology is notoriously easy to exploit, both by your own sugar/salt/fat-seeking instincts and by unscrupulous commercial forces, let alone the two working together. The unique thing about calorie counting is that it cannot be exploited like vibes astrology. Not even with a 20% error margin, which is (probably not coincidentally) the caloric deficit targeted by standard dieting advice.

ijk 15 hours ago ago

That's an interesting bit, where reducing friction too much can eliminate the side effect that is actually driving the desired results.

Do you want to count calories, or do you want to lose weight? Sounds like it's possible to hyper-optimize calorie counting to the point that it becomes counter-productive...

johsole 15 hours ago ago

Thank you for linking the study.

Some good news from it. If you weigh the food instead of depending on the package size then the labels become much more accurate!!

"Serving size, by weight, exceeded label statements by 1.2% [median] (25th percentile −1.4, 75th percentile 4.3, p=0.10). When differences in serving size were accounted for, metabolizable calories were 6.8 kcal (0.5, 23.5, p=0.0003) or 4.3% (0.2, 13.7, p=0.001) higher than the label statement."

If you look at the table "Deviation of metabolizable calories from label calories" [https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/figure/F1/] you'll see that most labels even for service side are pretty good and there are some that are really bad.

If you'll look at one of the worst offenders Tostitos, the label has "Tostitos Tortilla Chips - serving size 24 chips", but chips vary a lot in size, so you could have a huge variance in weight. If instead you weighed them, which I do with my chips, I bet the calories are much closer to the label.

Body composition comes down to routine. I've found found I love to eat, but I pretty much eat the same meals week over week, that makes it extremely easy for me to lose or gain weight depending on my goals.

[-]

adrian_b 14 hours ago ago

Moreover, the calorie numbers for raw ingredients are much more accurate than for snack foods, where the amount of each ingredient may vary from nominal, even when the total mass is nominal.

So when you cook yourself and you weigh the ingredients used for cooking, you can know the real calorie content with far more accuracy than when buying ready-to-eat food.

CGMthrowaway 12 hours ago ago

The errors either cancel out (if error in both directions) or they work in one direction ("bad" foods like junk systematically underestimate calories, "good" foods like protein powder systematically overestimate calories).

Either way, if you count calories and compare to your weight gain/loss over a few weeks and adjust your calorie target as warranted, assuming the types of food you are eating do not change drastically (e.g. you calibrated on regular diet and now have started an elimination diet), the error bars can be basically ignored.

[-]

andrewvc 12 hours ago ago

Exactly! My point was that despite the precision of calories we should really think of them as ballpark estimates .

strken 15 hours ago ago

Realistically the labels are going to be much closer for staples like long-grain Jasmine rice or olive oil, if they're measured by weight.

It's just not that easy to change the nutritional content of a kilogram of a known cultivar of dry rice when it's passed all the standard checks for moisture content, protein content, etc.

ndisn 15 hours ago ago

20% doesn’t seem so bad?

And the more natural a food is the more inaccurate the results will be because of natural fluctuations. Think the amount of fat a chicken can have. So making this percent stricter will only benefit foods that are all chemicals.

The usual goyslop made of shit ingredients allows for very low tolerances. Mayo has some lab-grade soy oil, lab-grade yolk, and perhaps some lab-grade starch as thickener. Yipeee, we have a tolerance of 0.1% in calories. But how do you reach that level of accuracy with a roast chicken with no added stuff?

ludicrousdispla 15 hours ago ago

how many calories are in strawberry?

nekusar 14 hours ago ago

Its even weirder.

What has more calories: 1 lb of peanuts, OR 1 lb of peanuts ground into peanut butter?

I cant find the study, but the peanut butter has more calories since its pre-ground and more bioavailable. Peanuts get chomped up but larger pieces still remain and are not captured by the body.

stogot 15 hours ago ago

Does this only apply to calories or to other categories too?

throwaw12 17 hours ago ago

I feel like you didn't understand the goal of this study

> The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement.

This study is to prove that you should not rely on LLMs

[-]

lukeschlather 15 hours ago ago

The thing is it doesn't really prove LLMs can't do this, it proves no existing frontier LLMs can do this.

The part where they talk about sampling multiple runs is interesting - it suggests to me that in the next few years as the reasoning process is improved the models may be able to do that autonomously.

My mind really is going to using a dedicated object detection models fine-tuned with nutrition information, but I don't think there's a fundamental reason LLMs can't eventually manage this use case, except perhaps the size of the needed weights being prohibitively large.

[-]

tsimionescu 15 hours ago ago

Per some people, LLMs of the future can do literally anything that's possible to do. They could create quantum computers powered by fusion power.

That has nothing to do with the question being asked, can you rely on an LLM today to help you track carbs as a diabetic?

This is very explicitly what the article is all about. Potential future LLMs are entirely irrelevant.

[-]

lukeschlather 11 hours ago ago

This isn't something so fanciful as fusion power, this is reasonably something that might be within the capabilities of object detection transformers. Whether a different prompt/finetuning with a good dataset could make this work is very relevant here.

The_Blade 16 hours ago ago

that is good to know. presented this way i find LLM behavior to be a feature, not a bug. then again i think everything is value add over pen and paper / notepad / spreadsheet and maybe a friend or doctor (or specialized equipment if you need more than calorie in, calorie out). just go exercise and don't be a lard lad

[-]

devilbunny 15 hours ago ago

> just go exercise and don't be a lard lad

You can out-exercise almost any diet, but it takes 3-4 hours a day of a hard workout.

If calories in, calories out was useful advice rather than a banal statement of physics, nobody would be fat.

[-]

genewitch 9 hours ago ago

> If calories in, calories out was useful advice rather than a banal statement of physics

it's also wrong, or at least imprecise. fat and bone and muscle all weigh different amounts, at the same volume.

the only way i've been able to explain the science to laypersons is thus:

if you and a friend both weigh 200lbs, but you once weighed 250lbs and your friend has never weighed more than 200lbs; all else equal: you must ingest less calories than your friend to maintain 200lb body weight.

your body will try to "outlast the famine" if you had to lose a lot of weight (or lost a lot of weight for any reason).

That absolutely does not comport with "calories in, calories out". It's also why the people who were never fat have no problem "just eating a donut."

no, i won't cite, this has been published many times in the last decade, 13 years.

[-]

devilbunny 9 hours ago ago

Eh, it totally does comport with calories in, calories out. We just don't hook people up to metabolic carts in day-to-day life, but that's really the only way to measure the calories out part.

The physics of CICO are undeniable. Humans don't photosynthesize. The biology of CICO is useless.

- former fat guy. I'm deeply, personally aware of how useless CICO is as dietary advice.

tsimionescu 15 hours ago ago

> just go exercise and don't be a lard lad

This is about people suffering from diabetes tracking their insulin needs. You can outrun any diet, but not insulin shots.

fabian2k 17 hours ago ago

The paper itself is a lot clearer about the purpose. The blog post reads very clickbaity and doesn't really explain the context well.

[-]

Aurornis 17 hours ago ago

I disagree, it clearly explains that AI carb counting apps are a problem and shouldn’t be used.

They’re writing in a neutral way that reaches their audience without lecturing or being condescending. They lead the reader to the conclusion rather than shoving it at them. I think that’s why it’s triggering so many angry comments on HN, but it’s effective for the audience they’re writing for (non technical people who may need convincing but don’t like being preached at)

snapcaster 17 hours ago ago

But it's stupid. If i smack myself in the head with a hammer is that proof hammers shouldn't be relied on?

[-]

fc417fc802 17 hours ago ago

If you smack yourself in the head with a hammer and it injures you that's evidence that smacking people in the head with hammers is bad and shouldn't be done, right?

jkestner 17 hours ago ago

Here we’re at the origin of the tool and get to watch how many people hit themselves in the head before we learn this collective wisdom.

There’s a gap between what the tool will allow you to tell it to do, and what it’s good at. The feedback mechanism to tell the difference is deficient compared to a hammer.

coldtea 17 hours ago ago

No, but it would be proof you didn't get the point of the paper.

jmye 17 hours ago ago

Are there start-ups led by idiots suggesting that smacking yourself in the head with a hammer will help treat your diabetes?

If not, then perhaps there's a problem in your analogy.

Aurornis 17 hours ago ago

> But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

The article explains this: There are apps targeting people with diabetes that claim to count your carbs with AI.

> If you’re using AI carb counting in a diabetes app

Before you dismiss a study, try to understand where it’s coming from.

The authors of the study weren’t stupid. They knew the LLMs would provide poor results. They ran the study to quantify it and create a resource to spread the information in response to the rise of AI carb counting apps.

[-]

ijk 15 hours ago ago

> The authors of the study weren’t stupid. They knew the LLMs would provide poor results. They ran the study to quantify it and create a resource to spread the information in response to the rise of AI carb counting apps.

Yeah. I think it is under-appreciated that much of science is intended for debugging purposes. Sure, you and I know that X is positive, but what's it actual value? Can we find the causes that make it that way? Et cetera.

endymion-light 17 hours ago ago

I don't believe the authors of this study are stupid.

If there are apps targeting people with diabetes that claims to count your carbs with AI, why haven't those been analysed? That would be a far more effective claim.

I based the study off of the clickbait article that they wrote about the study - i'll read through the study to see whether they analyse that, but it would be far more effective to see if the 'carb-counting' AI app is returning similiar results to the frontier model - that's an interesting result that actually can forward discussion.

[-]

Aurornis 17 hours ago ago

> If there are apps targeting people with diabetes that claims to count your carbs with AI, why haven't those been analysed? That would be a far more effective claim.

Because the apps aren’t going to let you submit 29,000 automated requests for statistical analysis.

And if you did, the authors of those apps would just release an update saying they changed models and try to dismiss the study.

The vitriol against this article on HN is sad. Commenters who agree with the article and its conclusions are grasping for reasons to be angry about it anyway

[-]

endymion-light 16 hours ago ago

You can commit statistical analysis on frontier models and still use commercial applications as an identifier & comparison.

Criticism is not vitriol - it's possible to make a wider point about being taken aback by the lack of education within AI to the point that there's a critical mass of people using them for calorie counting; but there are many studies on effects of LLMs on psychology etc that are far more effective.

But for me - this is like creating a study that performing algebra & calculus is innacurate on LLMs. That should be common knowledge

[-]

hrimfaxi 16 hours ago ago

It is not uncommon to study things that are considered common knowledge.

YeGoblynQueenne 15 hours ago ago

Well, for me the comments that insist we don't need to study X because everybody knows LLMs can't do that is a very good justification to study exactly X.

Not to mention that this is now a standard thought-terminating cliché, where someone points out a use case where LLMs don't work at all well and irrate responses protest that LLMs aren't meant to be used in that way. Says who? If you ask an LLM a question and it answers it- then that's an LLM use case. If you can ask the same question many times and evaluate the results then that's an evaluation that is perfectly fine to make.

[-]

endymion-light 15 hours ago ago

Yes - my original claim is not to not study it, it's to study it deeper than just surface level, which is my belief at what I've read from the site linked

tsimionescu 14 hours ago ago

The linked "click bait" article explains this very clearly as well. It clearly explains the methodology: they took the prompt sent to an LLM by a popular open source carb counting iOS app and sent it, together with five different pictures of food that a typical person might take, to all of the frontier models, and checked the responses. They also explain the purpose: to check the possible accuracy of this approach taken by a real app that real people use.

The fact that you somehow perceived this as an attack on LLMs as a technology is a failure entirely on your part. There is nothing in the article that suggests that people shouldn't use LLMs for other purposes - just a statistical verification of the fact that they shouldn't be used for this one particular thing.

[-]

endymion-light 11 hours ago ago

I didn't take anything as an attack on LLMs. I took it as a severe misunderstanding of how technology works. I specifically outline that I would like to see the margin of error even when integrating actual apps that claim to achieve results, rather than using tools that don't.

None of my claim perceives anything as an attack on LLMs, which shows a mischaracterisation on your part of my entire point.

ilivethere 17 hours ago ago

Typical case of the "curse of knowledge". We deal with AI on a daily basis on the technical level, so it's very easy to forget that the "common" folk really still believe that AI can replace dieticians, gym coaches, etc

coldtea 17 hours ago ago

>But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

If there are commercial services where you take pictures of food and are promised a realistic (paid for) response, then yes. And there are.

[-]

dahart 16 hours ago ago

And what’s the variance & accuracy of their responses? Isn’t comparing the models’ variance to baseline human variance what matters here? It seems like they didn’t do that, and I agree with parent’s call for that kind of baseline.

Having counted calories for years, I don’t think I could reliably estimate the calories or carbs in the example picture of a cheese sandwich. I can make assumptions about the bread and the cheese, but I might easily be off by 2-3x. Calorie counting apps that use text descriptions also have huge variance for the same thing. The problem might be the belief that a picture or description is enough, regardless of who or what is guessing…?

Edit: Ah, I see from sibling thread you meant commercial services are LLMs, I thought you meant there were human-backed services to compare to. Anyway, I totally agree there’s a problem if people rely on AI for safety, but I’m not sure LLMs are the core issue here, it seems like using vague information and guessing is the core issue.

[-]

swiftcoder 16 hours ago ago

> Isn’t comparing the models’ variance to baseline human variance what matters here?

You seem to be missing the context that this isn't just about diet apps - this is about apps claiming to be able to track carbs sufficiently accurately to be used in a medical context to dose insulin (a substance which can be lethal if incorrectly dosed)

[-]

dahart 15 hours ago ago

No I understand apps are making dubious claims and implications; obviously claiming LLMs can accurately estimate carbs from a photo is just wrong. But that doesn’t necessarily change my question. Should people use photos to estimate carbs? Can people looking at photos do any better?

The presence of variance in the LLM output doesn’t actually prove anything, in fact I would expect and hope for variance when confidence is less than 1.0. I’m more curious about accuracy of the mean of guesses for different models, for example.

But should any diabetic expect photos to be reliable, regardless of whether it’s an app or an LLM or a human? I know some diabetics, and the people I know do not rely on photos for their safety. They don’t even rely on food labels either (which are far more accurate than photos), they measure their insulin.

It’s probably useful to raise awareness, and useful to scare app makers away from making bogus medical claims - products and scams that make bogus medical claims is of course a practice as old as history. But we can still hold the studies and PR around this up to high standards, right? Even assuming this article & the paper behind it are right, there are reasonable questions here about how to demonstrate the problem and what the baselines are.

It’s worth keeping in mind that trying to prove the bogus apps wrong with a flawed methodology or questionable reasoning or just an overly heavy handed style can cause backlash and do damage to the cause. We’re already seeing that effect play out with respect to vaccinations.

endymion-light 16 hours ago ago

But I don't see them using those commercial services in this study - instead, they're using frontier model companies? Is Gemini advertising that you get a realistic calorie count from a picture? Maybe so - in which case i'd take it back!

[-]

notahacker 16 hours ago ago

The commercial services likely also have frontier model dependencies...

The opening to the actual paper is quite explicit that (i) other studies have already tested commercial apps with with unimpressive results and (ii) a popular open source app for carb counting directly relies on API calls from these frontier models, and this research batch tested the images used the exact same models and prompts as the popular open source app.

[-]

azakai 14 hours ago ago

A carb counting app might use API calls to these frontier models and then do some kind of analysis. It could see if different models agree or not, or multiple calls, and with how much variance.

So it would be more accurate to test the apps rather than the APIs, unless the goal is to warn people that just open chatgpt and ask there.

[-]

notahacker 13 hours ago ago

The open source app could in theory do that, but the paper's authors would be able to determine whether it did or not by reading its code, which they evidently did to replicate the API calls it made with their own script.

(And of course it would also be far more tedious to submit each picture 500 times manually using an app and manually log the response than using a script which is designed to collect the data automatically as fast as API rate limits permit)

coldtea 16 hours ago ago

Are commercial services anything more than just UI facades on top of frontier model APIs?

[-]

endymion-light 16 hours ago ago

Great point - and i'd love a study to address that. If the study pointed out that X services sit perfectly within the analysis found, I think that would be a fantastic study that would be enlightening & useful to show.

[-]

swiftcoder 16 hours ago ago

The app the study is based on is open-source, so you yourself can verify that it does indeed just call a frontier model with the same prompts used in the study

[-]

endymion-light 15 hours ago ago

That's not really the same thing as what I'm saying - which is to investigate the applications specifically advertising AI calorie counting capabilities

[-]

notahacker 12 hours ago ago

They investigated an open source application specifically advertising carb counting capabilities, replicated its prompts and API calls in a way optimised to collect data from 26000 queries (which is a lot to do using a GUI!). They also note other people have already done [necessarily] smaller scale studies of the commercial AI carb counting apps and been similarly unimpressed by the responses.

This is all in the first few paragraphs of a preprint paper describing the research in considerably more detail which is linked at the bottom of TFA

Meta: enjoying nearly half this HN thread being arguments that surely people care about what's in their food don't ask ChatGPT for comment instead of looking it up properly, and most of the rest of it being people who apparently care what's in a research paper asking HN for comment instead of looking it up :)

swalsh 17 hours ago ago

It amazes me how much people try to build AI systems relying on nothing more than the models knowledge. I suspect a great deal of "failed" AI experiments we keep reading are people just not having any idea how to use AI at what its good at.

nextlevelwizard 17 hours ago ago

As someone who used to do this. OpenAI models refuse to look up calories unless you explicitly tell them to and even then it is a hit and miss even if you tell them exactly what the product is. Easiest way to get good calculation is to just take a photo of the nutrition label or feed that info in by hand.

Funny thing is 4o did look up calories but I guess it was too good for this world

[-]

the_duke 17 hours ago ago

I exclusively use thinking mode, which is slower but much more likely to double-check things with web search etc.

[-]

nextlevelwizard 17 hours ago ago

Maybe. I stopped using OpenAI a while ago. But taking pictures of the nutrition labels was good enough

datsci_est_2015 16 hours ago ago

The obvious meme to invoke here is:

  - AI will solve all of our problems 
  - No not like that!

Are the trillion dollars sloshing around the AI economy well-invested if the refrain is always “you’re holding it wrong”?

So we’re trying to define, through trial and error, what problems “AI” will actually solve, and this paper is one of the many cobblestones on that road.

[-]

endymion-light 16 hours ago ago

i mean it's more like

"AI can solve this one problem, but it needs X, Y, Z, because it's not a omnipotent god entity"

"I tried it without any of those things and it didn't work - this is worthless tech!"

I don't know if more accurate calorie counting using AI exists - but it's like being upset that the screwdriver isn't gluing wood. AI is far more than frontier LLMs.

[-]

datsci_est_2015 15 hours ago ago

There are plenty of positions on the spectrum from “omnipotent God entity” and “Casio SL-300SV”. What does the current valuation of LLMaaS companies represent though?

LLMs are certainly not worthless, that’s a strawman in the same way my statement “AI will solve all of our problems” is a strawman. The question of their worth is being explored.

“AI is a black box that can solve problems”. Which problems? How consistently? At what cost? How quickly?

muwtyhg 14 hours ago ago

In this case, what is the 'X, Y, Z' that the apps are providing that the model is not?

[-]

endymion-light 11 hours ago ago

Theoretically:

- A queryable large vector database containing calorie counts for specific meals.

- A vision model specifically trained on food images with labelled data containing approximate calorie counts.

- OCR model allowing reading of barcodes + calorie information.

A model trained to ask for additional context & information (e.g for pasta - please provide a photo of the original sauce tin/ect), (please approximate the weight of X meat)

I don't know how accurate integrating all of those aspects would be - and you could argue the end user would probably be incredibly annoyed and it wouldn't be a good app - but I'd argue you'd at least need that if you're developing an app for diabetes management.

sleepybrett 15 hours ago ago

> "AI can solve this one problem, but it needs X, Y, Z, because it's not a omnipotent god entity"

0 advertisements from openai or anthropic say this. They all sell you an omnipotent god entity.

[-]

pertymcpert 14 hours ago ago

Skill issue in thinking.

giancarlostoro 17 hours ago ago

> But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

Reminds me of that one youtube video (I forget who it is so I have no idea how to pull it up) where he turns on the camera on his phone for ChatGPT and asks it what everything it sees weighs, then puts it on a scale, and ChatGPT was never right, ever, which makes sense, I couldnt tell you what most things weigh on sight alone either, but ChatGPT often got it dramatically off. I got the feeling he thought it was terrible AI for this, but I don't think a model looking at an image of something and trying to guess its weight / calories / etc... is a reason to call an AI model bad...

[-]

wat10000 16 hours ago ago

People do not understand this stuff at all. So many times I've seen people post the output from one of these services with full confidence that it has to be correct because it came from a computer.

[-]

Izkata 16 hours ago ago

Between the marketing and the hype, plenty of people believe we already achieved true sci-fi-style superhuman general AI a year or two ago.

jvanderbot 16 hours ago ago

I did this too! For months (almost a year) I used descriptions, pictures, and measurements of food to get rough calorie counts. My diet is pretty simple and repetitive.

I would occasionally check the estimates, maybe once every few days for meals I wasn't already pretty sure of, and it was generally accurate. Where it was extremely inaccurate was on portions, and anyone who has dealt with computer vision could tell you, you can't get scale from a picture. So I'd have to weigh some meals or ingredients, which would generally make things more accurate again.

So, I think it's possible, but you need multimodal data and grounded with regular checks.

[-]

thechao 16 hours ago ago

About once a week I ask ChatGPT to give me a reasonable diet for recomp with weight loss. It consistently insisted I have at least 7 meals consisting of at least 30g of protein per meal, but the protein source can't be whey or casein. When I ask "why" it cites a bunch of studies ... but most of those "studies" are N=1 of a college or Olympic level athlete. If, instead, I grab a large scale lateral analysis, it says "3 meals" with about 1/2 of the protein.

It'll defend both sides (mutually contradictory) to the death. NOTHING will budge it from its initial stance.

[-]

ifwinterco 16 hours ago ago

To be fair this is a reflection of the general state of nutritional science and the actual answer seems to be "it depends on your genes".

Some people do well on 6 small meals, others do well on no breakfast and two large ones. Studies can't tell you anything useful about that, you have to experiment and find out what works best for you

[-]

machomaster 15 hours ago ago

The answer does not really depend on genes. There are personal preferences, there are sex differences (women prefer more carbs), and the biggest component is where you are and in which direction do you want to go to.

But in terms of physiology the answer is quite clear:

1. The protein is the most important macro to get, no matter if bulking or cutting. It is the building block.

2. Whatever the amounts (0.8g-1.8g/kg of bodyweight, depends a bit on a situation and the willingness to lose some potential marginal gains), try to divide your daily protein somewhat evenly between meals.

3. Pareto principle, you get the most benefits by having 3 meals. 4 if you really care about small differences and want to optimize. 5 meals give negligible additional benefits, for professional athletes who want to be anal.

4. So basically eat at least 3 meals and up to whatever works for you practically speaking.

It's not that difficult or ambiguous.

[-]

ifwinterco 9 hours ago ago

Optimum meal timing in particular I believe is heavily influenced by genes - I have friends who never eat breakfast, survive on black coffee until 1PM, then eat a lot in the evening and feel good doing it. If I do that I feel terrible.

So yes eat 2g/kg protein but the best way to time that in terms of meals, best specific foods to eat etc is definitely influenced by your genes

layer8 16 hours ago ago

The author is doing what a non-sophisticated user would be doing, or would want to be able to do, and estimating calories based on a photo has been an often-cited potential or promised AI use case in recent years years. It makes a lot of sense to test current general-purpose AI’s performance on it as a reality check.

It also exemplifies how current AI offerings are still quite limited in their capabilities, because one would expect that they’d do the intelligent thing on their own that you had expected, instead of the user having to come up with a working methodology.

InsideOutSanta 17 hours ago ago

There are apps in the app store right now that pretend to do this kind of thing, so having somebody actually show that it doesn't work is valuable, even if we already knew the outcome ahead of time.

[-]

endymion-light 17 hours ago ago

I suppose i'd much rather a study analyse the apps in the app store that are attempting and claiming to do that kind of thing - rather than the base model they might be using.

something765478 17 hours ago ago

> The prompt I used asks each model to return a confidence score (0 to 1) for every food item it identifies. All four models dutifully returned confidence scores for 100% of items. Surely we can use those to filter out bad estimates?

This is a problem with the companies selling the AI models, not the customers. It is their responsibility to inform consumers about the limits of their services, and to train the models to say "I don't know, there is not enough information".

nclin_ 12 hours ago ago

He was directly addressing apps that claim to do so, and proving that they can't, with laypeople who might have diabetes as the target audience.

It's more a data-driven pub test, I think it explains itself well.

The question is - are those apps actually so simplistic? Or is this a strawman.

[-]

endymion-light 11 hours ago ago

I guess that's partially my original point summarised; I'd like to see this criticism levied at the apps advertising it - rather than using something that doesn't

toasty228 15 hours ago ago

> But the author just took pictures of food & expected a realistic response?

There are dozens of ios/android apps with 100-300k+ ratings and god knows how many millions of installs which do exactly this

"Cal AI - Food Calorie Tracker: Just snap a photo and our smart AI calorie tracker analyzes your meal instantly."

308k ratings on ios, 264k ratings on androids easily 5-10m installs across both platforms.

zipy124 17 hours ago ago

Honestly it's scary how misunderstood this is by the general public, the media and EVEN scientists.

There is a shocking amount of Computer Vision tasks where the scientists claim you can get X info from a picture of Y and it's like, even with ML/AI you can't extract data where there isn't any. The fact I can add an arbritrary amount of high-calorie fat to a meal without changing the appearance by defintion shows it's pointless. A 1000 calorie and 100 calorie milkshake can look identical, and you'd have no way of working that out via an image even if it was a super-intelligent system.

Similarly I see it in things like extracting material of an object from an image of it in serious research papers, which for the same reason cannot be done, since how an object looks has very little to do with what its made of, else painting and other art would clearly be impossible. The information is just not there within the data.

[-]

mortenjorck 17 hours ago ago

It’s like CSI “enhance!” AI image upscaling. People will do it, see it fabricated details, and then draw the wrong lesson from it, that “AI fabricates things!” when that is exactly what they asked the model to do and there is no magic math that would extract ground truth that was never in the image to begin with.

whstl 15 hours ago ago

> But the author just took pictures of food & expected a realistic response?

You say this (and I agree), but I know of quite a few companies in this area, including a couple accelerated by YCombinator, and that's pretty much 100% of what they do in their backend.

chromacity 16 hours ago ago

> There's an incredibly serious lack of education with how LLMs & carb-counting works

Oh! Do the vendors offer trainings to make sure the users understand how LLMs work? If not, surely, the LLM itself is trained to know its limitations and politely decline in situations like that?...

The #1 use case for this tech is "here's a problem I don't feel like solving, let's have a computer do magic". It's how it's advertised on TV, it's how it promoted in the software I already use. Food preparation? Travel planning? Shopping? Tutoring your children? You can do anything now!

I just talked to a realtor who will make a killing on a real estate transaction. Instead of offering human insights, they sent me "AI reviews" of several properties. The AI has never been to any of these properties and has no idea how they actually look like. But I guess it's how we operate now as a society.

If you go to eBay, every other listing description for used items is AI-generated. This is an official platform feature for sellers. The AI doesn't know the condition of the item or what's included or missing. Doesn't matter, it's magic. It's AGI, it will figure it out.

Most of the uses of AI I encounter as a consumer are like that, and the companies selling this tech are 100% complicit.

[-]

endymion-light 15 hours ago ago

yes - the vast majority of labs offer a whole host of training material for users to understand how LLMs work. There's entire course websites created by each of the major vendors specifically to understand how LLMs work. Here's a couple of examples:

https://academy.openai.com/public/content

https://www.commonsense.org/education/articles/practical-tip...

Quote

>Getting the most out of generative AI depends on what you put in. To quote our Outreach team, "It's a tool, not magic!" As the technology evolves, more and more chatbots are designed for specific purposes.

https://www.anthropic.com/learn https://anthropic.skilljar.com/ai-fluency-framework-foundati...

https://grow.google/ai

All of the above are completely free, you could start 3 coursers today that specifically teach you how AI tools work in practice. yes, this is different from the marketing that some of these tools use, but these resources are there, free and available.

Maybe we need to create a form of driving license for responsible AI use, but saying the resources don't exist is not accurate

[-]

pertymcpert 14 hours ago ago

I expect crickets to your response.

joelthelion 13 hours ago ago

In my opinion there is also a deficiency of the models who should be able to say "I don't know" when asked for something unreasonable.

mathgradthrow 14 hours ago ago

a realistic response? What's a realistic response to "how many calories are in an avocado?"

If you are counting calories, you don't want the answer to "how many calories are in the average avocado?", you want to know how many calories are in this avocado. Remember that bodyweight is roughly linear with BMR, so a 10% error in calorie counting is an extra 10% of bodyweight.

macleginn 17 hours ago ago

It doesn't really matter if the model cannot make a good educated guess about calories in the food if it cannot give a consistent response given the same input.

winddude 16 hours ago ago

as a t-1 diabetic, this is exactly what we do for nearly every meal we eat, especially in restaurants, look at it and try to estimate the number of carbs.

SirMaster 15 hours ago ago

How about instead of blaming the user for not understanding how AI works, the AI makers stop letting their chatbots answer questions so confidently that they clearly can't answer...

If I ask the AI about some health issue, it says something along the lines of warning I'm not a doctor etc. So if I show it a picture and ask it to tell me the carbs, how about a warning telling me it can try, but that it probably wont be very accurate.

ilivethere 17 hours ago ago

> But the author just took pictures of food & expected a realistic response?

Outside our tech-enabled bubble, there are folks who have been sold the idea that ChatGPT et al is a miracle worker capable of replacing dieticians, gym coaches, psychologists, etc.

So it's VERY plausible to believe that there are folks out there snapping pics of their meals and asking GPT to spit out nutritional values.

[-]

endymion-light 17 hours ago ago

That's a good point - and I think there's a wider core lack of knowledge outside of the bubble.

I suppose I just expected this study to be a little less 'water is wet' which made me dissapointed, but that may be coming at it from a more technical perspective.

larodi 17 hours ago ago

> This entire article would be better suited to astrology.com than hackernews.

I laughed, but you nailed it. Sadly so many people lack even basic understanding of LLMs and the ViT tower that makes it vLLM, that I expect a whole industry, similar to fortune telling, to emerge out of it.

black6 16 hours ago ago

> There's an incredibly serious lack of education with how LLMs & carb-counting works

The public's education comes from the incessant marketing from AI companies that their models are the panacea for everything.

cyanydeez 14 hours ago ago

So, do you think his methodology is closer to a computer scientist, or someone on instagram.

jrm4 14 hours ago ago

This strikes me as a good "meta" article, though. As in, yes, people here probably don't need this. But perhaps a lot of other people do.

slumpt_ 15 hours ago ago

This is how a lot of regular people are engaging with AI, whether you consider it silly or not.

sleepybrett 15 hours ago ago

> There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.

This is because the people who promote these technologies, and the companies that sell these technologies, engage in a massive amount of puffery (aka hyperbolizing aka just straight telling lies).

These technologies are painted as the magical solution to whatever problem you have (all it costs you is a few tens of thousands of tokens, aka your water supply). There is literally nothing they CAN'T do if you will just let us build these gigantic small town destroying, noise polluting, water and electricity hungry 'AI data-centers'. So that we can use those datacenters to sell you more tokens to put into their slot machines.

YeGoblynQueenne 15 hours ago ago

>> But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

The aim of the study was to understand the variation in results returned by models and how that could cause risks for patients using those models. The main result was measuring within-model variation.

From the pre-print (https://www.diabettech.com/wp-content/uploads/2026/04/diabet...):

We aimed to characterise the within-image reproducibility of carbohydrate estimates from four large language model (LLM) vision APIs and to quantify the clinical risk for insulin dosing, stratifying accuracy by reference value quality.

Methods

Thirteen food photographs were each submitted 495–561 times to four LLM vision APIs (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) using an identical structured prompt adapted from the iAPS automated insulin delivery system (26,904 total queries, temperature 0.01). The primary outcome was within- image variation (coefficient of variation [CV], range, distributional normality). Secondary outcomes included accuracy against reference values for nine images, stratified by quality tier (packet label, weighed/measured, portioned, or visual estimate). Clinical risk was translated at an insulin-to-carbohydrate ratio of 1:10.

>> I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that

The ground truth was established by the author. There's an appendix in the pre-print (Appendix I) that describes the methodology. Methods are described in page 4 of the pre-print:

Reference values for accuracy analysis

For nine of the thirteen images, the author estimated the carbohydrate content using methods described in Appendix 1. Reference quality was categorised into four tiers:

Tier 1 (packet label): Carbohydrate values derived from manufacturer nutrition labelling. Two images (cheese sandwich, soup with bread) used bread with labelled carbohydrate content of 20 g per slice.

Tier 2 (weighed/measured): Portions directly weighed and cross-referenced with established composition data. Three images (Bakewell tart, bakery cookie, breakfast burrito).

Tier 3 (portioned): Portions estimated by the author (not weighed) and combined with USDA composition data. Three images (roast dinner, chilli con carne with rice, stuffed pork loin).

Tier 4 (visual estimate): Portions and composition estimated from visual inspection. One image (churros).

For the four restaurant dishes (pizza capricciosa, eggs benedict, crema catalana, paella), no reference value was established. These images were used for the primary reproducibility analysis only.

Carbohydrate values follow the EU convention with dietary fibre excluded.

jmyeet 16 hours ago ago

> But the author just took pictures of food & expected a realistic response?

If someone sent me a picture of a meal and asked me what the macros were or how many carbs this is, I would say "I can't tell from a photo. Nobody can". The problem is that current LLM chatbots don't seem to have a concept of telling you "I don't know", "you can't do that" or even "you're wrong".

You can say that somebody shouldn't trust an LLM for this but it's going to be a problem that LLMs give nonsencial answers. What I find particularly amusing is that there are still technical people (generally, not anyone specifically) who seem unable to acknowledge that LLMs hallucinate and lie.

There was a post on here recently that I couldn't find with some quick searching but the premise basically was that chatbots were trained like neurotypical people: A lot of affirmation and basically lying. Separately someone else characterized this NT style of communication as "tone poems" [1]. I keep thinking about that because to me that's so accurate.

Dunning-Kruger is a common refrain on HN, for good reason. Another way to put this is how often people are confidently wrong. I really wonder if this is an inevitable consequence of NT communication because most neurodivergent ("ND") people I know are incredibly intentional in what they say and mean.

[1]: https://news.ycombinator.com/item?id=47832952

harperlee 17 hours ago ago

There is a lot of hate in the comments but there is some merit to the post existing:

  1. Even if the task is unreasonable, it is good to showcase that the LLM will perform poorly - warning not to be used for diabetes.

  2. As it is a probabilistic model, the approach was to execute it multiple times and look at the distribution. They also tried to minimize variance: "All at the lowest randomness setting these models offer.", the post mentions. Yet the variance of the responses is surprising.

  3. A multimodal LLM should be in general able to discriminate between crema catalana and a cheese sandwich, and provide a textual, uncalculated range of how much calories the item has (internet is full with tables for calorie counting and things such as this https://fitia.app/calories-nutritional-information/cheese-sandwich-1205647).

  4. It is not clear that the "expose" surprised / outraged style is just a communication vehicle or if the author really thought that e.g. LLMs could be hypothetically able to provide confidence estimates.

[-]

bcjdjsndon 16 hours ago ago

Re: 2... I think it's interesting they add arbitrary randomness in the algorithm. The problem of wildly varying outputs to the same input wouldn't exist in the first place

[-]

gblargg 15 hours ago ago

Sounds like a kind of dithering to spread out errors and avoid getting stuck.

tantalor 14 hours ago ago

> LLM will perform poorly

We've been seeing examples of this constantly since 2022. How many more do we need?

jaccola 17 hours ago ago

It’s just an impossible problem. Photons don’t provide sufficient information to determine calories (at least not in any way they could practically be captured). Inside that sandwich could be drenched with olive oil or it could be hollow cheese with lettuce. It’s impossible to tell.

[-]

2ndorderthought 17 hours ago ago

The average person has no idea this is true. And the average person cannot tell when this is the case. So we have a bunch of people, going their way through school, and then when they get stuck relying on AI. The future is gonna be wild.

[-]

lordleft 17 hours ago ago

Yep. And it doesn't help that the people selling AI products act as if they're going to build God. Going, "well AI can't do that" isn't going to fly when you are lax about communicating its limitations!

[-]

2ndorderthought 17 hours ago ago

It also doesn't help when the messaging is linked to how "there will be no jobs where you use your brain anymore everything will be automated". What motivation does the average 16 year old have to try hard and learn anything beyond what they immediately need.

No jobs, ai Jesus is coming, and if you use ai it will use all of the worlds compute power to try to convince you it's correct even when it's not.

renticulous 17 hours ago ago

Here's technical literacy of population on display. I love these prank examples which show the true education of populace.

https://www.youtube.com/shorts/B7c9qJcRnVk

[-]

dredmorbius 14 hours ago ago

A more robust measurement might be the (former) US Department of Education's "Adult Literacy in the United States" survey, most recently conducted in 2019. The results of this are sobering enough:

<https://nces.ed.gov/pubs2019/2019179/index.asp>.

There's a related study of adult technical literacy conducted in 33 OECD nations:

<https://www.oecd-ilibrary.org/education/skills-matter_978926...>.

Both show that only a small fraction (5--10%) of adults operate at high levels of literacy (whether of text, numeracy, or technology), and that a large fraction (roughly 50%) operate at a minimal or below-minimal rate.

fcarraldo 17 hours ago ago

True education? What idiot would say yes to this?

Even if you _know_ the debit card transaction is safe, there’s no reason to risk it when a weirdo is filming you with some wild contraption.

rcxdude 16 hours ago ago

Anything like this is going to have a very heavy selection bias, don't take any of this kind of content as a reflection of the average person.

WarmWash 16 hours ago ago

Many of of witnessed the technical literacy of the general population when we ran to show them ChatGPT 3.5, and they just kind of shrugged like "So? What are you showing me?"

engineer_22 17 hours ago ago

I am asking a lot here, but school needs to be training people what AI is and what it's weaknesses are and how to use it... My school taught me to use a calculator. It also taught me how to check my work when I relied on the calculator.

AI is a very complicated calculator - you give it an input, magic happens, it gives you an output. Really no different, to a layman.

[-]

jaccola 17 hours ago ago

To be fair, this should probably be covered by basic physics/maybe cooking classes. “You can’t determine the calories in food by looking at it” isn’t really ML specific.

[-]

2ndorderthought 17 hours ago ago

Won't help much if kids are ai'ing their way through physics then ten years later need to go on a diet having not applied the knowledge possibly ever or exercised their critical thinking skills

garciasn 17 hours ago ago

Considering the lack of basic math skills I encounter each and every day, I don't think schools did enough; they certainly aren't going to do enough w/LLMs.

[-]

Ekaros 17 hours ago ago

Knowing the lack of understanding of basic chemistry and physics like fundamental thermodynamics... I have little hope any population can be trained to understand LLMs sufficiently...

2ndorderthought 17 hours ago ago

It's more complicated than a calculator. Even researchers who have dedicated their lives to the field don't know all of the limitations of any given model. That fact alone isn't helpful when a model is 80% correct in one area but 2% in another.

[-]

pirates 16 hours ago ago

If even experts in the field don’t know all of the limitations then it’s even more important to stress that relying on the output of an LLM is a poor choice without additional checking and verification.

Even with calculators, I was taught that you should double check by hand sometimes to make sure you got it right.

lesuorac 17 hours ago ago

If you're looking for a citation about this, the 1999 Dunning-Kurger paper "Unskilled and unaware" [1] is about this.

People who are unskilled at a task are unaware of what that task performed correctly is. So, somebody who can't count calories is unable to tell that the AI can't perform the task correctly either.

[1]: https://pubmed.ncbi.nlm.nih.gov/10626367/

[-]

hombre_fatal 16 hours ago ago

Fwiw invoking Dunning-Kruger is beyond trite at this point.

Which is a good thing because it means we can talk like normal humans ("people don't know that it's unreliable") instead of acting like we're making such a profound claim that it needs a citation and psychological dissection.

ozgung 16 hours ago ago

As a human, in the photo of that sandwich I see 4 slices of bread and 4 slices of cheese (distributed unevenly). I have no idea about the weight of the bread, flour type or its sugar content. I don't know the type of the cheese, dimensions of the slices or total amount of cheese inside the bread. I don't know if there is butter or anything else inside. I can guess the size of the plate as a size reference but I can't be sure. Human or AI, it's an ill-posed problem. There can be widely different estimates which can be equally plausible.

[-]

bcjdjsndon 16 hours ago ago

But why would the same llm give you wildly different answers EACH TIME you ask?

[-]

pkaye 15 hours ago ago

There is a parameter in LLMs called temperature that controls creativity/randomness. If you set it to 0 it makes the model deterministic. I think some LLMs expose this as a tunable parameter.

[-]

muwtyhg 14 hours ago ago

The study used a temperature of 0.01.

> "Thirteen food photographs were each submitted 495–561 times to four LLM vision APIs (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) using an identical structured prompt adapted from the iAPS automated insulin delivery system (26,904 total queries, temperature 0.01)"

jihadjihad 13 hours ago ago

> If you set it to 0 it makes the model deterministic.

No, it doesn't. It can help make the model more deterministic, but it does not guarantee it.

[-]

azakai 12 hours ago ago

The hardware can also add nondeterminism. GPUs reorder operations, leading to different results.

Vendors might also be running A/B testing or who knows what, even when you ask for a temperature of 0.

But, if you run a fixed model with temperature 0 on your local CPU, it will be deterministic (unless there are bugs).

zdragnar 16 hours ago ago

Because that's how they work? They aren't knowledge machines, they are random generators.

[-]

bcjdjsndon 16 hours ago ago

They're next word predictors. They explicitly add in randomness at various stages of the transformer itself, otherwise it'd be too obvious it's not actually intelligent and just a next word predictor

[-]

pertymcpert 14 hours ago ago

No that's not why.

drrotmos 16 hours ago ago

It is and it isn't. If you ask a human how many calories (or carbs) are in that sandwich, they can give you a qualified guess based on how a sandwich like that is typically constructed. They may not know the calories for a slice of bread or a slice of cheese by heart, but if you give them a food database, they can look it up.

They absolutely won't be 100% correct (bread sizes e.g. are going to be an estimate), but unless it's a trick sandwich drenched in olive oil or with hollow cheese, they're probably going to be in the right ballpark.

I don't think it's outside the realm of possibility for an LLM to be in the right ballpark as well, but that doesn't seem to be where we're at now.

[-]

muwtyhg 14 hours ago ago

And furthermore, once the person has "determined" how many calories the sandwich contains, they are likely to give you the same answer next time you ask instead of randomly changing their minds.

YeGoblynQueenne 15 hours ago ago

It's not impossible to tell. Diabetics and others with dietary restrictions, have to do that sort of thing every day to decide what they can and can't eat. If you pick up a loaf of bread at the baker's, the baker usually has no idea the amount of carbs, or salt, or sugar is in that loaf of bread. Try it. Ask the baker: "how much carbs in this loaf of bread?". They'll just stare at you. They can tell you whether the loaf has salt or sugar in it but can't tell you how much because they don't calculate the amount by loaf. So if you have dietary restrictions you have to know what you can and can't eat and that requires the ability to judge the contents of a piece of food from the way it looks.

Photons don't carry that information? Sure. But you don't just have photons to go by. You can rely on a large database of prior knowledge about how food is usually made and with what ingredients.

Other people who have to rely on their imperfect human senses to decide what they can and can't eat: people with allergies, people with heart problems, hypertensives, kidney patients, etc. etc.

bryanlarsen 17 hours ago ago

The question isn't about calories, it's about carbs. Drenching that sandwhich in olive oil won't change its carb count. From the picture it's a thin cheese sandwich -- we can see cheese and we can see it's thin enough that there's little else. Might be no butter, might be lots of butter, but that won't affect carb count. If there's lettuce in the sandwhich there's likely a negligble amount. Hand it to a knowledgeable human and you're going to get a very consistent carb reading -- 30g, the value of two slices of wonder bread.

It could be much different -- it could one of those breads with weird macros, or fake cheese, or it could be hollowed out and packed full of hidden vegetables. But a human is going to give you the answer for two slices of plain white bread.

beached_whale 17 hours ago ago

From personal experience, one can get practically close guessing such that the error isn't going to be more significant compared to the errors in insulin to carb ratios/sensitivity factors/...

I am pretty good at this and the cheese sandwich example threw me, I would have estimated around 10-15g of carb for each slice. So the 28g is fairly consistent with that, not 40g. The only real way would be to weigh it and use the labeling. Another thing that often gets people is the labeling often has a serving size of say 2 slices and a weight that does not reflect the actual weight of 2 slices.

Luckily with good tools the significance is reduced, people using closed loop insulin pumps will automatically correct for that. Lots more room to wiggle.

Ekaros 17 hours ago ago

Then it should refuse to answer 100% of time.

[-]

falcor84 17 hours ago ago

I don't think refusal is the right approach. I would much prefer that it respond with something like:

> There is not enough information to make an accurate estimate, but if you'd like, I can take a stab at it. If so, how much effort to put into it?

> Yes, go ahead and spend up to 5mins and $1 to analyze it.

> Done, I've had 100 subagents analyze the image and have arrived at a 95% confidence interval of the portion containing ...

[-]

muwtyhg 14 hours ago ago

I know this is just an example but my eyes kind of bugged out thinking about paying $1 every time I want to estimate the calories in my sandwich.

jaccola 17 hours ago ago

Indeed, I think any reasonable human might say “A few hundred calories but without measuring the ingredients I might be way off”. I think LLMs could get there, I don’t see anything stopping that. Though they have been notoriously bad at this so far.

dredmorbius 13 hours ago ago

If the problem is so evidently impossible then the LLM itself should recognise this, state that the problem isn't solvable, not* provide what's certain to be an inaccurate result, and suggest better approaches to arriving at a reasonable answer.

That said, it's notable that diabetes education materials often suggest estimating glycemic loads by rough portion size / plate ratios. Which is to say that absent accurate weight measurements (themselves subject to variations in ingredients, moisture levels, etc.) current clinical recommendations are themselves pretty rough.

Aurornis 17 hours ago ago

That’s exactly the point of this article.

Many of the comments here assume the authors are stupid and were surprised by the result, but the point of the article is to inform readers that AI carb counting apps don’t work. That’s why they did the study.

jeroenhd 17 hours ago ago

It's not even impossible from a technical point of view.

Your cheese sandwich may contain a lot more or a lot less calories, even if you take the numbers from the packaging and calculate the correct ratios by weight. The calories on the label are based on an average and individual packages may contain more or less of any listed nutrient to some margin. Of course, counting calories is meaningless if not done on a long-term scale anyway, but on a long-term scale the LLM doesn't need to guess the correct amount either.

unsupp0rted 17 hours ago ago

And what if that guy in the surveillance video is just 2 kids in a trench coat? There's no way for AI to be sure from the photons: we should scrap it.

ge96 17 hours ago ago

I was thinking at least if you had an advanced phone with lidar like iPhones can get volume but yeah the hidden/inner mass is a problem plus the oil as mentioned

tsimionescu 17 hours ago ago

This is a bad take. If LLMs are supposed to work as general purpose assistants, as they are being sold as by both the companies making them and by the majority of AI believers, then it is very much a solvable problem. The LLM could give a high level estimate (a sandwich is not going to be 0 Cal, and it's not going to be 5000 Cal, so you can give some kind of range), and then ask for the type of information needed to make a more accurate estimate.

p-e-w 17 hours ago ago

Then the correct answer is “I can’t tell.”

Not “Here’s a random guess that I just pulled out of my ass.”

LLMs have picked up the bad habit of trying to give an answer when no answer can be given from scientists, who overall don’t say “I don’t know” nearly as often as they should.

[-]

jeroenhd 17 hours ago ago

I tried asking LLMs about food before. They all say "I can't tell for certain, but this is an estimate based on the ingredients I can spot/infer/guess".

You need to write a specific prompt to avoid any warnings.

Of course a lot of people don't know what limitations LLMs have, so there's some value to a blog post about it, but it's not as black-and-white as the article might suggest with its graphs.

The prompt (documented here: https://www.diabettech.com/wp-content/uploads/2026/04/Supple...) lists specific instructions and a specific output format that doesn't allow the LLM any room for explanation or warning in processable data (only in notes fields). In fact, the prompt explicitly tells the LLM to ignore visual inferencing for some statistics and to rely on a nutrition authority instead.

Even in that intentionally restricted format, the English language output uses words like "roughly" and "estimated" in the LLMs I've tested.

Sure, if you take the numeric values and plot them in graphs, you get wildly inconsistent results, but that research method intentionally restricts the usefulness and reliability of the LLMs being researched.

What's much more troubling is this line from the preprint:

> The open-source iAPS automated insulin delivery (AID) system now offers food analysis through APIs from OpenAI, Anthropic and Google [8]

The linked app does seem to have a disclaimer, though:

> "AI nutritional estimates are approximations only. Always consult with your healthcare provider for medical decisions. Verify nutritional information whenever possible. Use at your own risk."

Ukv 17 hours ago ago

> Then the correct answer is “I can’t tell.”

From the paper they're using structured JSON schema mode opposed to freeform answers, so it can't. Models do typically caveat their answer for questions like this, in my experience.

[-]

professoretc 16 hours ago ago

They'll qualify their answers in English but as the article mentions, if your prompt asks for a confidence score, that "uncertainty" doesn't translate into low numerical confidence.

[-]

Ukv 16 hours ago ago

Quantifying their own confidence is also something they're not good at, and which the format would prevent them from refusing to do or preceding with a caveat if that's what you'd want of them. Particularly since the response format seems backwards - giving confidence, then carbs estimate, then observations/notes, rather than being able to base carbs estimate off of observations/notes and then confidence estimate off of both of those.

> They'll qualify their answers in English but [...]

That the default user-facing chat as a normal user would use it gives a warning is the key part IMO. I don't think expectations of there being no "wrong way" to use the model can necessarily extend to API usage with long custom system prompt and restricted output format.

agentultra 17 hours ago ago

LLMs had no agency to choose such a course of action.

They’re algorithms and they were designed this way.

badgersnake 15 hours ago ago

Correct. But why doesn’t the AI just say that.

therobots927 17 hours ago ago

Why is the AI answering questions without answers then?

[-]

pohl 17 hours ago ago

could be because, at the end of the day, it's just predicting the next likely token

[-]

SirMaster 15 hours ago ago

The next likely tokens for a response to a question that can't possibly be answered from the context should be "I" followed by "don't", followed by "know".

therobots927 16 hours ago ago

The paper millionaires did NOT like your comment.

Aurornis 17 hours ago ago

This will surprise nobody here, but it’s important to communicate to audiences that are new to LLMs.

This is targeted at people with diabetes because there are AI carb counting apps appearing in app stores

> If you’re using AI carb counting in a diabetes app

These apps are probably not even using the mainstream models used in the study because they would be too expensive for cheap or free apps, and they’re probably forcing structured output to get a response without any of the warnings that an LLM might include if you ask it directly.

[-]

tantalor 14 hours ago ago

Who in Al Gore's Internet is new to LLM in 2026?

[-]

Aurornis 11 hours ago ago

Most of the world.

rsynnott 17 hours ago ago

I am... unsure why anyone would think LLMs would be able to do this. They are not magic oracles. Like I think even most humans would be extremely bad at this.

Like, are people actually using LLMs for this? Please do not, it won't work.

[-]

kioleanu 17 hours ago ago

Yes, people are using LLMs for this because that is how they've been marketed, like being able to solve every day tasks like a personal assistant on one hand, but also like researchers being able to solve old problems that humans couldn't crack.

Does the model say it can't do that when asked? No, it answers confidentely.

Also it's easy to trust it if you don't know how it works

[-]

drtz 17 hours ago ago

Would people really trust their personal assistant to tell them how many calories are in a sandwich just by glancing at it on a plate? I'm doubtful, and I would also expect a diabetic to be even more skeptical.

[-]

tempaccount5050 16 hours ago ago

Yes.

jihadjihad 17 hours ago ago

> They are not magic oracles.

I came across a LinkedIn post a couple days ago where someone had asked ChatGPT, "What are the top things you get asked about $NICHE_INDUSTRY_THING_I_AM_SELLING?"

As if there is introspection like that at the meta level, where ChatGPT could actually provide hard numbers around its own usage and request patterns.

The fact that these products work with natural language beguiles people into thinking they are, indeed, magic oracles.

[-]

Ekaros 17 hours ago ago

This is the weird intersection where I think that data might exist and LLM might be able to query it. But any company would never give it out. So the bot would not have access to it.

pjc50 17 hours ago ago

> They are not magic oracles.

Anthropic's trillion dollar valuation hinges on the idea that it is just that, a magic oracle that can replace any worker for any type of task. Any programmer, any author, any musician, any kind of clerical work. All we've asked here is "sudo evaluate me a sandwich", the sort of estimation task that humans with internet resources might reasonably be expected to do, and it's given up?

(It would be fun to compare this to sending the picture out on Mechanical Turk and asking humans to eyeball the calorie count of said sandwich...)

Nicook 17 hours ago ago

You are severely overestimating the average, or even above average understanding of LLMs.

[-]

bluefirebrand 17 hours ago ago

Not to mention the fact that LLM marketing is trying to convince us that they can do anything

acchow 17 hours ago ago

Most people are convinced LLMs can do this.

Cal AI, which claims to generate a nutritional breakdown based off a photo, has $30 million in annual recurring revenue.

[-]

rsynnott 17 hours ago ago

... Bloody hell. I mean that's basically fraud, surely. It is _not possible to do this even vaguely accurately_.

faangguyindia 17 hours ago ago

It’s because AI can debug a programme and people start thinking it can do fitness and health stuff too, but the thing is, there is no “instant-reacting compiler" for health or fitness. Things change over a long time, till then AI would have run out of context or lost the data from its cache, or the user may have got bored and deleted their account.

PUSH_AX 17 hours ago ago

It’s worse, I bet there are apps in the App Store that do this, the users just have no idea on the accuracy

[-]

vector_spaces 17 hours ago ago

There is a very popular app for macro counting called Cal AI that was reported to have been written by a high school student with over $1M in revenue. Looks like it was just acquired by MyFitnessPal

[-]

lordgrenville 17 hours ago ago

Wow, yeah. "The result is an app that the creators say is 90% accurate".

https://techcrunch.com/2025/03/16/photo-calorie-app-cal-ai-d...

jeroenhd 17 hours ago ago

https://xkcd.com/1425/ strikes again.

As far as consumers know, LLMs can identify the towns pictures were taken (without metadata), can summarize entire movies, generate clips of your kid flying a rocket to the moon, can translate images from any language imaginable, but somehow they cannot estimate the calories in a cheese sandwich.

The supposed professional posting about an LLM deleting their prod database for their non-existent company asked the AI to explain itself. That's the level of LLM knowledge you should expect from most people that actually work with these tools.

heysoup 17 hours ago ago

They sold the idea that LLMs "have" information. That the LLM "is" intelligent.

Truth is the LLM is good at making intelligent decisions. But in order to make intelligent decision, you need context.

If you give proper context -> ask the LLM -> get almost perfect result every time.

Anything else is rolling dice, a very special type of dice, but dice anyhow. Not magic.

tarkin2 17 hours ago ago

OpenAI etc, are, however advertising them like they are magical oracles, on the verge of lifting humanity to next phrase of civilisation. The idea the majority of users know what nondeterministic even means it's a massive, massive ask

ambicapter 17 hours ago ago

If the LLM can correctly identify a food item some high percentage of the time, why would it be magic for it to guess the amount of calories in an object? It's perhaps a lookup and some simple math as an extra step.

kdheiwns 17 hours ago ago

They're marketed as AI. AI has a long standing image built up by movies and other media of being some omniscient computer capable of analyzing the world. These AI companies are very aware of this and leverage it.

And a person with sufficient knowledge could easily give a rough estimate of the calories. A slice of store bought sandwich bread of a given thickness generally has calories within a certain range. So do cheese slices. It's elementary school health class material. We all learn how to calculate calories in a meal. Packaging on food also always has calories, so clearly people know how to estimate it fairly accurately.

If a fifth grader can calculate it but an AI can't, that says a lot about how bad these AIs are. We'll get another series of paid and bought articles saying "AI analyzed IMPOSSIBLE math problem beyond human comprehension and solved it with FACTS and LOGIC", while at the same time being told "bro no you can't expect an ai to calculate calories in a sandwich bro that's impossible bro if you even try that then you're insane for even thinking ai should be used that way bro". These companies need to decide: is AI smart enough to solve hard questions, or is it too useless to calculate something any kid could do by googling calories in a slice of bread and doing some basic arithmetic?

[-]

rsynnott 17 hours ago ago

> Packaging on food also always has calories, so clearly people know how to estimate it fairly accurately.

That's not done by looking at it and guessing (or at least it _shouldn't_ be; manufacturers have been known to do this but it's bad practice and may cause them regulatory problems). In an ideal world it's done with one of these: https://en.wikipedia.org/wiki/Calorimeter ; less ideally it can be estimated based on the ingredients.

throwaway260124 17 hours ago ago

But nothing prevents llms from being RLed to do this right?

But does training llms to be better at this, improves their world model or does it only make changes at the surface?

[-]

vidarh 17 hours ago ago

Yes, something prevents llms from being RLed to do this: You can't see through something opaque to determine whether there's something high calorie or low calorie out of sight.

The problem itself is unsolvable given the data provided.

You could conceivable make it better at making guesses, but they will inherently always be guesses that will sometimes be wildly off.

[-]

pjc50 17 hours ago ago

> You can't see through something opaque to determine whether there's something high calorie or low calorie out of sight

https://www-users.york.ac.uk/~ss44/joke/3.htm "There is at least one field, containing at least one sheep, of which at least one side is black."

ben_w 17 hours ago ago

Estimate the calorie count of this door handle: https://m.youtube.com/watch?v=VDSzY52Mkrw&pp=0gcJCVACo7VqN5t...

Extreme example perhaps, but no, you can't just turn pixels into calories. Right now I'd be impressed if we could reliably estimate volume to within 30% from a photo, but even with that correct the contents of the food can easily be way off without visible sign.

rsynnott 17 hours ago ago

Okay, so take the sandwich. There is no way to know what is in it by looking at it. No amount of optimisation will fix this.

I'm sure one could produce a CV model that was a lot better at guessing here than these LLMs are, but fundamentally it is still guessing.

AndrewKemendo 17 hours ago ago

The vast majority of people using LLMs in my experience use them as though they are Oracles

They are surprised and upset when the Oracle is not perfect

Go ahead and search around on hacker news you’ll see precisely the same pattern with people who are ostensibly engineers and hackers

It’s actually pretty mind boggling but then again humans never fail to surprise and disappoint

sjsdaiuasgdia 17 hours ago ago

Some people are asking LLMs what's on the menu of restaurants they are actively sitting in, possibly with a menu on the table in front of them.

Some people have a very poor understanding of what LLMs are good for. Some people do see them as magic oracles.

hansmayer 17 hours ago ago

I mean people will shamelessly paste you a wall of text from LLM while chatting with you to prove a point, probably thinking how they outsmarted you now...

Jtarii 17 hours ago ago

>I am... unsure why anyone would think LLMs would be able to do this.

Well firstly the average IQ is 100. And also because people market products to consumers that claim to be able to count carbs from images. If you don't know the limitations of LLMs then there would be little reason to doubt it for an uniformed or below average intelligence person, of which there are hundreds of millions.

axlee 17 hours ago ago

"Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries."

----

Wikipedia for Crema catalana:

Crema catalana (Catalan for 'Catalan cream'), or crema cremada ('burnt cream'), is a Catalan dessert consisting of a custard topped with a layer of caramelized sugar.[1] It is "virtually identical"[2] to the French crème brûlée. It is made from milk, egg yolks, and sugar. Crema catalana and crème brûlée are made in the same way.

---

Oh no, my AI can't detect that an obscure clone of a famous dish is indeed the obscure clone, and not the commonly know version.

[-]

comes 13 hours ago ago

The difference between them is that crème brûlée is made heavy cream instead of milk and it tastes better. But my Catalan friends would kill me for this blasphemy… so you didn’t hear that from me.

They are both covered by burned sugar and therefore indistinguishable(!) visually.

tdeck 17 hours ago ago

In high school my Spanish teacher told us that Crema Catalana was the Spanish name for Creme Brulee.

ozbonus 17 hours ago ago

Before the next galaxy brain shows us all how smart and witty they are by adding the nth sarcastic comment about how obvious this result is, I hope they'll take a moment to consider a few things.

Yes, people are using LLMs for this kid of thing. Lots of people. All the time. I've met plenty of them and there loads of apps that offer this kind of "service". The authors are well aware that people are doing this and probably anticipated the result.

Why do the study at all? Because it's important to demonstrate and measure things, even obvious ones. Because it's not obvious to everyone, like the people who are already consulting LLMs for dietary information to manage their health. Because it's easier to enact official policies when there's hard evidence.

[-]

wrqvrwvq 14 hours ago ago

Bizarre thread so far. Some threads seem to attract a certain type of boosterism or opinion management. People all rush in to same similar things, without reading the prior comments. Seems designed to wash thoughts into a stream. Maybe coordinated pr or reputation management. It could even be organic but it doesn't seem that way.

nextlevelwizard 17 hours ago ago

I used LLMs to count calories, but not based on photos, I mean I also did that, but primarily I fed in my exact ingredients and then used weights to get calorie estimates.

Was it always correct? Certainly not. But it helped me lose 30kg of weight since keeping even some track of calories was so much easier with LLM than any app I had used before.

Also of course it didn’t matter if I was exactly on point since it wasn’t about any kind of medicine

[-]

edu 17 hours ago ago

Curious, why it was easier to use an LLM vs a non-AI app with a DB of foods?

Seems that in this case a traditional approach would be more precise and more environmentally efficient to get to the same results.

[-]

nextlevelwizard 17 hours ago ago

Any app I have used before has asked me to look up the foods and add them manually and usually there has been ads or subscriptions involved.

Much easier for me to take pictures of the packets while making the food, the weight the final bulk product and then when I eat just weight the plate and say “500g of casserole” and the LLM spits out the calories and keeps track of the daily consumption

[-]

jerkstate 16 hours ago ago

Nice. I vibe coded a similar kind of system, you can dump a recipe into the chat window and it will use tool-calling to lookup macros for any foods it doesn't have in the DB and put them in, estimate raw -> cooked changes in nutrition and weight (if needed), estimate total weight of the cooked product, and macros per gram (e.g. writes a 100 gram serving to the db, you can scale it up and down and it scales the macros linearly). Similar to you I have used this app to alter my macro mix from high-fat to high-carb (for workout performance) and cut my sodium from ~4g/day to ~2.4g/day by interrogating the DB about what foods I should eat more and less of. Found some surprising wins in my habitual diet that were easy to change to hit my health targets, and looking up and logging these things by hand without LLM assistance would have been too tedious and time-consuming for me to continue to do it for as long as I have been (maybe 3 months now)

Curious, what model are you using? I have found Qwen Flash to be really great for this - tool calling works well, it's smart enough, and very cheap.

tcoff91 17 hours ago ago

Are you giving the LLM the weights of the ingredients as you go? Sounds like a great system.

tcoff91 17 hours ago ago

The data entry is a pain in the ass with those apps when cooking food from scratch. It’s much much easier with LLMs and natural language and voice mode and pictures of a food scale and things like that.

mattnewport 16 hours ago ago

Ironic that they used an LLM to write the article:

> 42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.

[-]

thedanbob 15 hours ago ago

The vast majority of "AI screwed up" posts I've seen on HN have been written with AI.

zamadatix 16 hours ago ago

The title seems to be clickbait (the 13 foods in the paper didn't even have ranges such a title would be possible) but the results/paper are much more on point.

It'd be really interesting if it evaluated humans on the exact same image sets. The correct answer is just to feed in more data, such as the exact food itself, but the post makes it sound like it's using a model that is the only risk in this approach to counting carbs.

jasonkester 17 hours ago ago

LLMs seem really bad with reading numbers and reporting them back. I’m building a game, and to se how well its docs were being indexed, I tried asking simple questions to ChatGPT, Gemini, whatever Microsoft’s thing is, etc:

“What is the armour value for the Leather Shirt” in the game Stravaeger?”

It confidently got it wrong.

“You can find the game at https://stravaeger.com”

Different confident answers, also wrong.

“You’ll find it in a table on this page: https://stravaeger.com/docs.html?inventory_item=LEATHER_SHIR...“

Oh, sorry. I was inferring from other similar games. Here is a different confidently wrong number.

“It’s also in the .json file linked on that page”

And another wrong value. Random numbers should have got it right by now, but no. And the confident, authoritative tone never changed. Every model I tried was the same story.

amazingamazing 17 hours ago ago

With mass information you could infer much more from pictures. With some sort of standard cube in the picture as well as taking a picture at an angle that emphasizes all three dimensions you could also better estimate the relative volume.

It’s tractable I think, but not from a pic alone.

[-]

jaccola 17 hours ago ago

Yes one could potentially increase accuracy greatly. One big problem would be occlusion.

There is already a solution to this that would be very hard to beat (and one can choose to use or not use an LLM to assist): prepare food yourself and use the information provided by the manufacturer.

[-]

amazingamazing 17 hours ago ago

If you consider time at all what you suggest is hardly a solution. It is the most accurate, but even 50% accuracy at orders of magnitude faster to calculate would be more useful for the main use case which is losing weight.

However for diabetes accuracy is likely preferred and I’m not sure any computer vision would be palatable.

Centigonal 13 hours ago ago

maybe, but not always. I could make two identical-looking sandwiches with very different calorie content by changing the type and quantity of sauce on the inside of the bread. I could give you two "pasta with creamy sauce" dishes that look similar on camera but have different macros by partially swapping Greek yogurt for heavy cream. Dropping a couple tbsp olive oil into my marinara sauce does wonders for flavor but barely affects appearance when plated. Same with lard in my refried beans.

gus_massa 16 hours ago ago

Let's start with the wrong title:

> I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

If you look at the image https://www.diabettech.com/i-asked-ai-to-count-my-carbs-2700... it clearly shows some repeated values. I guess AI like multiples of 5 or 10 or something. It would be nice to look at the raw tables.

> A cheese sandwich on a plate. Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g. Simple, unambiguous, packet-label accuracy.

Real cheese of fake cheese that is actually flour paste with gum and colorant? Does it have mayo? I like mayo! Real mayo or fake mayo that is actually flour paste with less gum and another colorant? Does it has a slice of jam that is totally covered by the bread? Real jam or ilegal fake jam that is actually some grounded pork with flour paste with more gum and yet another colorant.

> The models don’t always know what they’re looking at. [...] Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries.

Can someone from Europe tell me the difference? I like it (at least one of them), and I eat it from time to time (like once a year, in a restaurant), but looking at the Wikipedia page of both I can't tell the difference.

[-]

comes 11 hours ago ago

The difference between them is that crème brûlée is made heavy cream instead of milk and it tastes better. But my Catalan friends would kill me for this blasphemy… so you didn’t hear that from me.

They are both covered by burned sugar and therefore indistinguishable(!) visually.

arjie 16 hours ago ago

This is pretty interesting. Not the content, but the technique. I suspect this was an entirely automated pipeline with Claude Code or Codex and that the author then just unleashed one of the commercial harnesses on the entire flow of querying the APIs and writing the post, including the headline. We've clearly reached the point in AI writing where a small set of inputs can create content that humans enjoy participating in discussion of. Good show.

recursivedoubts 17 hours ago ago

> You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.

No you wouldn't, not if you have a basic understanding of how LLMs work and what "temperature" is. They are stochastic algorithms picking the next token based on a highly structured (and often very useful) coin flip.

[-]

sluck 16 hours ago ago

just came here to read this thanks faith in discussions restored

Centigonal 15 hours ago ago

Context: there are a lot of very popular apps (e.g. Macrofactor) that are being promoted on social media and downloaded for exactly this feature (calculating nutrition based on pictures of food). The users don't understand that this is an impossible task. This is a scam that affects people's well-being, and it's good that there's data proving it.

fabian2k 17 hours ago ago

It does sound like a pretty terrible idea to try to count carbohydrates from an image. There just isn't enough information there to reliably do that. At best you could identify the object in the image and then show reference information on typical nutrition values. But if you need anything more accurate than that, you probably have to read the labels on the ingredients and calculate.

a7fort 17 hours ago ago

Finally we have a simple way to get machines to generate a truly random number

raymondgh 16 hours ago ago

To the defense of the models, the experiment was run with temperature set at 0.01 which is very low; setting this can lead to weird responses. My find-on-page also found no mention of “thinking” or “reasoning” in the paper. Not trying to discount the whole thing but very curious how changing the parameters might affect results

DontchaKnowit 17 hours ago ago

Not remotely surprising to anyone whose ever counted calories or carbs

techcode 16 hours ago ago

I've seen/noticed this simply from being on a low carb (aka KETO) diet.

Besides AI grossly over/under estimating values even when you give it a photo of the packaging with nutritional table and tell it weight you used.

The other thing that surprised me, at least until I read up on how LLMs are actually working. Was how it would confidently BS you for your daily total.

Even when the chat/messages are just "Ate ABC with XYZ values, what's my daily total?"

While I guess new chat for each day, or some MCP for storing and retrieval of record/meals would've helped with those daily totals.

The total would still be wrong - unless you explicitly specified each of the values you need to track (e.g. carbs, fat, protein, kcal) to be put into records.

At which point of course - you're not really using AI/LLM but basically an CRUD application.

sarusso 17 hours ago ago

For context: a LOT of people, maybe naively, are now using AI to help them count carbs, and some of these features are already in beta, if not shipping.

That is why I believe this piece from Tim is remarkable: it shows the limitations in a language the diabetes community can understand, and this is why I posted it.

embedding-shape 17 hours ago ago

> You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.

Already the first paragraph highlights the issue; unless you set temperature=0.0 and the model can actually do reproducible inference, none of the "answers" you get are deterministic!

But it's a very common misconception that "same question gets same answer" would be true, when it's almost by accident you get the same answer for the same question. The part that people expect this, is the problem, as most platforms are not built to provide that experience. Of course you'd get different responses, it's on purpose!

[-]

monegator 17 hours ago ago

This is one of the reasons i never used LLMs for anything related to coding. And i never intend to do. If i tell the thing to generate, will it generate the same thing, every time? will it change stuff that is working because the random number generator will conjure a slightly different answer?

i'd be ok with it if i was generating a picture of X, or some word salad about Y, but not for code. Never for code.

[-]

embedding-shape 17 hours ago ago

You'll learn to work around it, just like ML practitioners got used to imprecise math in regards to floats. But that LLMs are using imprecise math and doesn't have 100% reproducible output doesn't make them impossible to work with, just a bit harder.

But, if what you're doing right now works for you, do continue as-is if you so wish, I have no stake in if people use LLMs or not, just hope people make choices based on good information :)

nunez 13 hours ago ago

I don't know why people are using AI to meal track and count calories. MyFitnessPal is dumb easy to use already, and it has, by far, the most robust nutrition facts database out there (they've been in the same for 20 years now; I've been using it since 2008.)

Any nutrition facts these models might use are vectoring either to data from this database or FatSecret. Anything custom, like estimating meals at most restaurants, is going to involve adding and multiplying stuff, and we know how great LLMs are at that.

sjhatfield 16 hours ago ago

This post made the rounds on the open source DIY looping community in Facebook. In my opinion this isn't a good way to use AI to estimate carbs. Using AI to estimate carbs is just one of a large list of tools at our disposal including nutritional info, company websites, weighing with a scale, etc. Just taking a photo of a food with no other input isn't going to give good results. Taking a photo, along with a description including a brand name, an idea of size, a recipe url etc will do much better. My opinions as a parent of a type 1 child

umvi 16 hours ago ago

Food companies try every trick to make carb counting difficult. Companies will tout "zero sugar" in the label even though the first ingredient is maltodextrin or maltitol or some other thing that quickly turns into sugar the moment you ingest it. The only way to get good at it is to wear a CGM and then see how your body reacts to things and then keep a mental list after that. A company may claim some product only has 2 net carbs, but I've found those claims to be false a lot of the time, with bigger companies being the biggest offenders.

philipphutterer 17 hours ago ago

I agree to others that the intent of this study could be written more expressively, but honestly, doesn't this show exactly one thing to the people in the tech world? We need better education and communication for people without technical knowledge about what to use which AI models for and what NOT to do with them. For me, quite often I try to give quick help and information on what to expect from an LLM for given input whenever someone non-tech close to me is running into unexpected output. AI just seems so simple and non-complex to most people, it's shocking.

cj 16 hours ago ago

The mistake the article makes is providing a photo with zero context. That's why it's mistaking a cheese sandwich for creme brulee. You'll get much more consistent responses if you share a text description along with a picture.

I use AI to estimate calories / macros multiple times per week. I always ask both ChatGPT and Gemini, and then I use my brain to decide what I actually want to log in my calorie tracking app.

About 80% of the time, ChatGPT and Gemini give estimates that are very close to one another.

Ekaros 17 hours ago ago

Also makes one question about task that we think AI can do. If the variance produced output is that large. What does it tells of failure rate in other tasks? Or reliability in general for uses cases?

In real world the acceptable failure rates in many cases are lot lower than we now accept. One in thousand could be too high if you process say thousand times. So in reality good enough error rate should be in one in million or lot rarer...

nvahalik 16 hours ago ago

Man. I built an AI food system at my previous company and it was tough: we ended up just using it as a way to look up foods in a real DB and allowed guestimation but ultimately the win was "I don't have to search for everything on this place" we surface _what_ and then allowed the user to enter the real weights.

And this... really was and (will be) the only way for this to ever work.

rao-v 15 hours ago ago

I messed with this a bunch (still have a prototype floating around somewhere). Add a food weight signal with a Bluetooth scale and you’ll get a much much more grounded answers. Standardized the output format, soft match against nutritional databases and run through the model for confirmation and it does even better.

emadda 17 hours ago ago

Related: I created an app to track the molecules in your foods:

https://kg.enzom.dev/

You specify your foods in grams with plaintext (no pictures).

I never liked the "take a picture to measure calories" approach, as you could have 10 table spoons of olive oil which would drastically change the calories but would not show in a picture.

[-]

jerkstate 16 hours ago ago

This is really cool. I vibe coded almost exactly the same app, mine also tracks saturated fat, cholesterol, sodium, and fiber (I'm getting older so these macros are pretty important to me). One of the really cool things you can do if you have tool calling hooked up is to have the LLM analyze your diet and tell you what you can do better to hit your targets - swap pizza for pasta, decrease the amount of cheese you put on your sandwich, if you're gonna have fast food don't get the fries and eat low fat/low sodium the rest of the day, etc. What model are you using? I have found Qwen Flash to be really good for my app - smart enough, tool calling works really well, and very cheap.

[-]

emadda 10 hours ago ago

Thanks.

It is using Gemini Flash 3 at the moment.

There is a lot to learn in nutrition. The glycemic load metric is quite revealing for pizza vs pasta (slow digesting carbs are supposed to be better). Al-dente cooked pasta is also slower digesting than well cooked pasta.

Another interesting thing is how each plant food has unique molecules that can be health promoting in humans. That was one aspect I wanted to reveal/compare for the foods I ate.

Dr Weil / Perfect Health Diet / Marks Daily Apple are three sources I like to check for information on nutrition.

827a 17 hours ago ago

To be fair, if you ask 10 people to eat visually identical food 10 times each, then magically measure the calories consumed by each individual, you'd probably get ~70 different values. The internal density of food is extremely difficult to reason about from the outside. The personal variance is also difficult to reason about.

NiloCK 17 hours ago ago

I think the headline oversells this a little?

The reported variance in Sonnet 4.6's estimates here are actually quite low, and in general terms, not so bad across models. Damn paella.

This does seem like a task well suited to a for-purpose training run against a bunch of labelled data. Is there any reason they wouldn't improve at it?

voidUpdate 17 hours ago ago

> "The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example."

This idea is seriously being implemented in a production app? And people are using that app to make health choices? Oh god...

gcanyon 15 hours ago ago

I'd be super-curious to see how many estimates you have to take to bring down the std dev to a reasonable level. (And of course if the mean isn't too far off) If it's 2-5 samples then an app could salvage this.

mottiden 17 hours ago ago

I am surprised that people believe that calories can be counted correctly from a single photo

[-]

boelboel 17 hours ago ago

It's like the 'enhance' bs they do in crime shows. All of a sudden the computer can make up a sharp image out of nowhere

edu 17 hours ago ago

Issue is there are many apps claiming they can do that, and for many people are “magic”.

We should not allow companies to lie blatantly to the customers.

Edit: r/blame/lie/

[-]

tcoff91 17 hours ago ago

These calorie counting picture apps should be sued for false advertising.

newshackr 16 hours ago ago

Maybe not great for the intended use case but guessing 28g of carbs for a 40g sandwich seems pretty close to me, particularly without knowing the dimensions of the bread etc

amelius 15 hours ago ago

There's a lot of unknowns, even in an image. That cheese sandwich could have a sauce on it.

Maybe they should ask: what are the worst case and best case numbers for this lunch?

larodi 17 hours ago ago

Time to ask it 20k times which is more harmful - alcohol or weed. Curiously in my attempts alcohol always tops the harm ratings miles before all else, including some class 1 drugs.

smusamashah 14 hours ago ago

Everytime I see Claude doing better than the rest in charts, it reminds of https://www.reddit.com/r/TopRightMessi/

alexdns 17 hours ago ago

Non deterministic AI returns non deterministic results who could've guessed

[-]

sumtechguy 17 hours ago ago

I wanted an accountant I got a poet.

a-dub 17 hours ago ago

i've found that multiple queries with the same prompt that requests a short answer is an excellent way to gain a confidence style measure that actually works.

gyosko 17 hours ago ago

I always love AI discussion. Using AI like they fucking sell it to us? You're doing it wrong!!LLMs can't do that!!

No shit sherlock, but the AI gurus are just telling people that this fucking parrot CAN DO EVERY FUCKING THING.

Why wouldn't an ordinary guy just ask these question to an AI when everybody is telling him that AI is intelligent enough to answer accurately?

sathish316 17 hours ago ago

Feel the AGI of next-word or next-number carbs prediction

NiloCK 17 hours ago ago

Another more general comment:

There general interest across a variety of disciplines to kick the tires of LLMs with respect to their competence in DOMAIN_X. This is good in general terms, but, especially with larger studies, they tend to be out-of-date by the time of publication, and super out-of-date by the time they hit the media circuit. Out-of-date here in terms of testing against models 1 or 2 or more generations back from SOTA.

The DOMAIN_X experts do have a lot to offer in terms of defining success criteria across domain tasks, but the studies (snapshots in time) could be much more impactful if they were instead packaged as benchmarks (that could track model progress over time, and even steer it).

AI community / industry could probably do some outreach work to streamline or standardize methods for general researchers to produce reusable benchmarks.

Waterluvian 17 hours ago ago

It's funny how with AI this comic is basically reversed: https://xkcd.com/1425/

tim-tday 16 hours ago ago

LLMS can’t count. This is well known. Give them a calculator or allow them to write code to do it.

jan_Sate 17 hours ago ago

Oh. I read "crabs" and I was confused until I clicked into the article. Guess I need coffee.

jchw 17 hours ago ago

> 42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.

Shit like this is why you shouldn't involve AI output in your writing process. It's especially ironic in an article about LLMs being unreliable... but it's pointless when the pre-print seems just fine at least to my eyes.

juancn 16 hours ago ago

Does this surprise anyone?

I mean these models are inherently probabilistic.

If you run enough samples you'll get results matching the learned probability distribution, the more you sample the higher the chances that you'll land on an unlikely response.

mbesto 16 hours ago ago

probabilistic != deterministic

FrustratedMonky 17 hours ago ago

Is this about AI?

1. If I feed the exact same image in, it does not deterministically give me the exact same result every time.

2. Or is this about calories, because even if a package label says "200 Calories", if you were to measure every package, each one would all be different. 198,199,200,201,202. Plus/Minus a pretty big range.

>>> answered own question. " It’s the same photo, the same model, the same question. But you won’t get the same answer"

engineer_22 17 hours ago ago

To me, someone without a full understanding of the AI systems, it seems like the problem is most strongly influenced by image classification. The next logical step in this research is to remove image classification from the loop, since it's a confounding factor.

fHr 17 hours ago ago

ASI/AGI reached kap

bethekidyouwant 17 hours ago ago

I asked an AI to guess how much a picture of a rock weighed 500 times… But it does propose an interesting idea. Which is burn after labelling. (maybe it could be really good at this)

algoth1 17 hours ago ago

I’ll save you a click: ‘Llms can’t perform direct calorimetry through a photo of a meal. Llms can’t even perform basic atomic spectroscopy’ in other news…

Marciplan 17 hours ago ago

skill issue

feverzsj 17 hours ago ago

Bullshit machine can't even do bullshit job?

christkv 17 hours ago ago

LLMs going to llm.

monooso 17 hours ago ago

Tomorrow on HN, "water is wet."

tom1337 17 hours ago ago

random number generator returns random numbers on each call. more news at 11

[-]

falcor84 17 hours ago ago

Even more news at randint(1, 12)

ori_b 17 hours ago ago

Slop

dyauspitr 17 hours ago ago

What a dumb article. The picture of the sandwich is essentially just a picture of bread. You can’t see what’s inside. A human wouldn’t be able to tell you. These are essentially AI hit pieces.

[-]

tsimionescu 17 hours ago ago

This is based on a real app that someone is selling to real people. It's not a hit against AI, but it is very much a legitimate hit against uses of LLMs for this purpose.

Also, if LLMs worked as they are often advertised, they should have easily been able to answer "there isn't enough information in this picture to give you an accurate estimate. Try taking a picture of the label, or at least of the inside of the sandwich, or list the ingredients used".

[-]

sjhatfield 15 hours ago ago

iAPS is not software you pay for. This is open source software to dose insulin where you assume full responsibility for the outcomes.

[-]

tsimionescu 15 hours ago ago

That's irrelevant to my point. Feel free to replace "selling" with "distributing" - my point stands: using an LLM to estimate carbs from a picture is a real use case that real software developers are suggesting to real people is a real solution. It's not some contrived idea that the researcher concocted to make AI look bad.

The GP, and much of the thread, is basically acting as if it should be obvious to anyone who is not an idiot that this is not possible to do precisely; and that the researcher just made up some use case that LLMs can't do and wrote a paper about it to disparage AI.

nextlevelwizard 17 hours ago ago

And it can be made better easily. Take a picture of the nutrition label of the bread and cheese first and then feed in this picture and you should get way better results

[-]

sjhatfield 15 hours ago ago

The sandwich example is silly because you almost certainly know the fill nutiritonal info from the packaging so just use that...

lolc 17 hours ago ago

As a diabetic I have done this exact exercise: Look at photos and guess carbs. Two slices of bread is easy mode.

Why assume trick ingredients?

He asked AI to count carbs 27000 times. It couldn't give the same answer twice