He asked AI to count carbs 27000 times. It couldn't give the same answer twice

199 points • by sarusso • today at 12:38 PM • 254 comments • view on HN

Comments

There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.

When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like the size of the meal into an actual model, using the integration of pre-existing tools that are (slightly more) accurate. Hell - most food literally is required to have calorie information, and you can query open source data for others!

But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

This is akin to the instagram reels that talk to chatGPT and ask it to time how long they're run is. Except those are treated as funny jokes rather than being turned into studies.

I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that.

➕ show 33 replies

harperlee • today at 1:01 PM

There is a lot of hate in the comments but there is some merit to the post existing:

  1. Even if the task is unreasonable, it is good to showcase that the LLM will perform poorly - warning not to be used for diabetes.

  2. As it is a probabilistic model, the approach was to execute it multiple times and look at the distribution. They also tried to minimize variance: "All at the lowest randomness setting these models offer.", the post mentions. Yet the variance of the responses is surprising.

  3. A multimodal LLM should be in general able to discriminate between crema catalana and a cheese sandwich, and provide a textual, uncalculated range of how much calories the item has (internet is full with tables for calorie counting and things such as this https://fitia.app/calories-nutritional-information/cheese-sandwich-1205647).

  4. It is not clear that the "expose" surprised / outraged style is just a communication vehicle or if the author really thought that e.g. LLMs could be hypothetically able to provide confidence estimates.

➕ show 1 reply

jaccola • today at 12:49 PM

It’s just an impossible problem. Photons don’t provide sufficient information to determine calories (at least not in any way they could practically be captured). Inside that sandwich could be drenched with olive oil or it could be hollow cheese with lettuce. It’s impossible to tell.

➕ show 17 replies

Aurornis • today at 1:09 PM

This will surprise nobody here, but it’s important to communicate to audiences that are new to LLMs.

This is targeted at people with diabetes because there are AI carb counting apps appearing in app stores

> If you’re using AI carb counting in a diabetes app

These apps are probably not even using the mainstream models used in the study because they would be too expensive for cheap or free apps, and they’re probably forcing structured output to get a response without any of the warnings that an LLM might include if you ask it directly.

rsynnott • today at 12:47 PM

I am... unsure why anyone would think LLMs would be able to do this. They are not magic oracles. Like I think even most humans would be extremely bad at this.

Like, are people actually using LLMs for this? Please do not, it won't work.

➕ show 18 replies

axlee • today at 1:10 PM

"Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries."

----

Wikipedia for Crema catalana:

Crema catalana (Catalan for 'Catalan cream'), or crema cremada ('burnt cream'), is a Catalan dessert consisting of a custard topped with a layer of caramelized sugar.[1] It is "virtually identical"[2] to the French crème brûlée. It is made from milk, egg yolks, and sugar. Crema catalana and crème brûlée are made in the same way.

---

Oh no, my AI can't detect that an obscure clone of a famous dish is indeed the obscure clone, and not the commonly know version.

➕ show 1 reply

zamadatix • today at 2:38 PM

The title seems to be clickbait (the 13 foods in the paper didn't even have ranges such a title would be possible) but the results/paper are much more on point.

It'd be really interesting if it evaluated humans on the exact same image sets. The correct answer is just to feed in more data, such as the exact food itself, but the post makes it sound like it's using a model that is the only risk in this approach to counting carbs.

ozbonus • today at 1:14 PM

Before the next galaxy brain shows us all how smart and witty they are by adding the nth sarcastic comment about how obvious this result is, I hope they'll take a moment to consider a few things.

Yes, people are using LLMs for this kid of thing. Lots of people. All the time. I've met plenty of them and there loads of apps that offer this kind of "service". The authors are well aware that people are doing this and probably anticipated the result.

Why do the study at all? Because it's important to demonstrate and measure things, even obvious ones. Because it's not obvious to everyone, like the people who are already consulting LLMs for dietary information to manage their health. Because it's easier to enact official policies when there's hard evidence.

mattnewport • today at 2:25 PM

Ironic that they used an LLM to write the article:

> 42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.

➕ show 1 reply

gus_massa • today at 2:34 PM

Let's start with the wrong title:

> I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

If you look at the image https://www.diabettech.com/i-asked-ai-to-count-my-carbs-2700... it clearly shows some repeated values. I guess AI like multiples of 5 or 10 or something. It would be nice to look at the raw tables.

> A cheese sandwich on a plate. Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g. Simple, unambiguous, packet-label accuracy.

Real cheese of fake cheese that is actually flour paste with gum and colorant? Does it have mayo? I like mayo! Real mayo or fake mayo that is actually flour paste with less gum and another colorant? Does it has a slice of jam that is totally covered by the bread? Real jam or ilegal fake jam that is actually some grounded pork with flour paste with more gum and yet another colorant.

> The models don’t always know what they’re looking at. [...] Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries.

Can someone from Europe tell me the difference? I like it (at least one of them), and I eat it from time to time (like once a year, in a restaurant), but looking at the Wikipedia page of both I can't tell the difference.

Centigonal • today at 2:47 PM

Context: there are a lot of very popular apps (e.g. Macrofactor) that are being promoted on youtube channels and downloaded for exactly this feature. The users don't understand that this is an impossible problem. It's a scam that affects people's well-being, and it's good that there's data proving it.

nextlevelwizard • today at 12:55 PM

I used LLMs to count calories, but not based on photos, I mean I also did that, but primarily I fed in my exact ingredients and then used weights to get calorie estimates.

Was it always correct? Certainly not. But it helped me lose 30kg of weight since keeping even some track of calories was so much easier with LLM than any app I had used before.

Also of course it didn’t matter if I was exactly on point since it wasn’t about any kind of medicine

➕ show 1 reply

jasonkester • today at 1:18 PM

LLMs seem really bad with reading numbers and reporting them back. I’m building a game, and to se how well its docs were being indexed, I tried asking simple questions to ChatGPT, Gemini, whatever Microsoft’s thing is, etc:

“What is the armour value for the Leather Shirt” in the game Stravaeger?”

It confidently got it wrong.

“You can find the game at https://stravaeger.com”

Different confident answers, also wrong.

“You’ll find it in a table on this page: https://stravaeger.com/docs.html?inventory_item=LEATHER_SHIR...“

Oh, sorry. I was inferring from other similar games. Here is a different confidently wrong number.

“It’s also in the .json file linked on that page”

And another wrong value. Random numbers should have got it right by now, but no. And the confident, authoritative tone never changed. Every model I tried was the same story.

rao-v • today at 3:03 PM

I messed with this a bunch (still have a prototype floating around somewhere). Add a food weight signal with a Bluetooth scale and you’ll get a much much more grounded answers. Standardized the output format, soft match against nutritional databases and run through the model for confirmation and it does even better.

arjie • today at 1:48 PM

This is pretty interesting. Not the content, but the technique. I suspect this was an entirely automated pipeline with Claude Code or Codex and that the author then just unleashed one of the commercial harnesses on the entire flow of querying the APIs and writing the post, including the headline. We've clearly reached the point in AI writing where a small set of inputs can create content that humans enjoy participating in discussion of. Good show.

amazingamazing • today at 12:54 PM

With mass information you could infer much more from pictures. With some sort of standard cube in the picture as well as taking a picture at an angle that emphasizes all three dimensions you could also better estimate the relative volume.

It’s tractable I think, but not from a pic alone.

➕ show 1 reply

sjhatfield • today at 2:35 PM

This post made the rounds on the open source DIY looping community in Facebook. In my opinion this isn't a good way to use AI to estimate carbs. Using AI to estimate carbs is just one of a large list of tools at our disposal including nutritional info, company websites, weighing with a scale, etc. Just taking a photo of a food with no other input isn't going to give good results. Taking a photo, along with a description including a brand name, an idea of size, a recipe url etc will do much better. My opinions as a parent of a type 1 child

techcode • today at 2:13 PM

I've seen/noticed this simply from being on a low carb (aka KETO) diet.

Besides AI grossly over/under estimating values even when you give it a photo of the packaging with nutritional table and tell it weight you used.

The other thing that surprised me, at least until I read up on how LLMs are actually working. Was how it would confidently BS you for your daily total.

Even when the chat/messages are just "Ate ABC with XYZ values, what's my daily total?"

While I guess new chat for each day, or some MCP for storing and retrieval of record/meals would've helped with those daily totals.

The total would still be wrong - unless you explicitly specified each of the values you need to track (e.g. carbs, fat, protein, kcal) to be put into records.

At which point of course - you're not really using AI/LLM but basically an CRUD application.

umvi • today at 2:22 PM

Food companies try every trick to make carb counting difficult. Companies will tout "zero sugar" in the label even though the first ingredient is maltodextrin or maltitol or some other thing that quickly turns into sugar the moment you ingest it. The only way to get good at it is to wear a CGM and then see how your body reacts to things and then keep a mental list after that. A company may claim some product only has 2 net carbs, but I've found those claims to be false a lot of the time, with bigger companies being the biggest offenders.

gcanyon • today at 2:55 PM

I'd be super-curious to see how many estimates you have to take to bring down the std dev to a reasonable level. (And of course if the mean isn't too far off) If it's 2-5 samples then an app could salvage this.

recursivedoubts • today at 1:00 PM

> You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.

No you wouldn't, not if you have a basic understanding of how LLMs work and what "temperature" is. They are stochastic algorithms picking the next token based on a highly structured (and often very useful) coin flip.

➕ show 1 reply

fabian2k • today at 1:06 PM

It does sound like a pretty terrible idea to try to count carbohydrates from an image. There just isn't enough information there to reliably do that. At best you could identify the object in the image and then show reference information on typical nutrition values. But if you need anything more accurate than that, you probably have to read the labels on the ingredients and calculate.

cj • today at 2:07 PM

The mistake the article makes is providing a photo with zero context. That's why it's mistaking a cheese sandwich for creme brulee. You'll get much more consistent responses if you share a text description along with a picture.

I use AI to estimate calories / macros multiple times per week. I always ask both ChatGPT and Gemini, and then I use my brain to decide what I actually want to log in my calorie tracking app.

About 80% of the time, ChatGPT and Gemini give estimates that are very close to one another.

nvahalik • today at 2:23 PM

Man. I built an AI food system at my previous company and it was tough: we ended up just using it as a way to look up foods in a real DB and allowed guestimation but ultimately the win was "I don't have to search for everything on this place" we surface _what_ and then allowed the user to enter the real weights.

And this... really was and (will be) the only way for this to ever work.

a7fort • today at 1:00 PM

Finally we have a simple way to get machines to generate a truly random number

amelius • today at 2:58 PM

There's a lot of unknowns, even in an image. That cheese sandwich could have a sauce on it.

Maybe they should ask: what are the worst case and best case numbers for this lunch?

sarusso • today at 1:10 PM

For context: a LOT of people, maybe naively, are now using AI to help them count carbs, and some of these features are already in beta, if not shipping.

That is why I believe this piece from Tim is remarkable: it shows the limitations in a language the diabetes community can understand, and this is why I posted it.

DontchaKnowit • today at 12:47 PM

Not remotely surprising to anyone whose ever counted calories or carbs

embedding-shape • today at 1:08 PM

Already the first paragraph highlights the issue; unless you set temperature=0.0 and the model can actually do reproducible inference, none of the "answers" you get are deterministic!

But it's a very common misconception that "same question gets same answer" would be true, when it's almost by accident you get the same answer for the same question. The part that people expect this, is the problem, as most platforms are not built to provide that experience. Of course you'd get different responses, it's on purpose!

➕ show 2 replies

philipphutterer • today at 1:29 PM

I agree to others that the intent of this study could be written more expressively, but honestly, doesn't this show exactly one thing to the people in the tech world? We need better education and communication for people without technical knowledge about what to use which AI models for and what NOT to do with them. For me, quite often I try to give quick help and information on what to expect from an LLM for given input whenever someone non-tech close to me is running into unexpected output. AI just seems so simple and non-complex to most people, it's shocking.

raymondgh • today at 1:53 PM

To the defense of the models, the experiment was run with temperature set at 0.01 which is very low; setting this can lead to weird responses. My find-on-page also found no mention of “thinking” or “reasoning” in the paper. Not trying to discount the whole thing but very curious how changing the parameters might affect results

Ekaros • today at 1:18 PM

Also makes one question about task that we think AI can do. If the variance produced output is that large. What does it tells of failure rate in other tasks? Or reliability in general for uses cases?

In real world the acceptable failure rates in many cases are lot lower than we now accept. One in thousand could be too high if you process say thousand times. So in reality good enough error rate should be in one in million or lot rarer...

newshackr • today at 2:25 PM

Maybe not great for the intended use case but guessing 28g of carbs for a 40g sandwich seems pretty close to me, particularly without knowing the dimensions of the bread etc

emadda • today at 1:11 PM

Related: I created an app to track the molecules in your foods:

https://kg.enzom.dev/

You specify your foods in grams with plaintext (no pictures).

I never liked the "take a picture to measure calories" approach, as you could have 10 table spoons of olive oil which would drastically change the calories but would not show in a picture.

➕ show 1 reply

827a • today at 1:12 PM

To be fair, if you ask 10 people to eat visually identical food 10 times each, then magically measure the calories consumed by each individual, you'd probably get ~70 different values. The internal density of food is extremely difficult to reason about from the outside. The personal variance is also difficult to reason about.

NiloCK • today at 1:10 PM

I think the headline oversells this a little?

The reported variance in Sonnet 4.6's estimates here are actually quite low, and in general terms, not so bad across models. Damn paella.

This does seem like a task well suited to a for-purpose training run against a bunch of labelled data. Is there any reason they wouldn't improve at it?

voidUpdate • today at 12:54 PM

> "The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example."

This idea is seriously being implemented in a production app? And people are using that app to make health choices? Oh god...

larodi • today at 1:38 PM

Time to ask it 20k times which is more harmful - alcohol or weed. Curiously in my attempts alcohol always tops the harm ratings miles before all else, including some class 1 drugs.

mottiden • today at 1:00 PM

I am surprised that people believe that calories can be counted correctly from a single photo

➕ show 2 replies

tim-tday • today at 2:06 PM

LLMS can’t count. This is well known. Give them a calculator or allow them to write code to do it.

alexdns • today at 12:47 PM

Non deterministic AI returns non deterministic results who could've guessed

➕ show 1 reply

gyosko • today at 1:21 PM

I always love AI discussion. Using AI like they fucking sell it to us? You're doing it wrong!!LLMs can't do that!!

No shit sherlock, but the AI gurus are just telling people that this fucking parrot CAN DO EVERY FUCKING THING.

Why wouldn't an ordinary guy just ask these question to an AI when everybody is telling him that AI is intelligent enough to answer accurately?

NiloCK • today at 1:20 PM

Another more general comment:

There general interest across a variety of disciplines to kick the tires of LLMs with respect to their competence in DOMAIN_X. This is good in general terms, but, especially with larger studies, they tend to be out-of-date by the time of publication, and super out-of-date by the time they hit the media circuit. Out-of-date here in terms of testing against models 1 or 2 or more generations back from SOTA.

The DOMAIN_X experts do have a lot to offer in terms of defining success criteria across domain tasks, but the studies (snapshots in time) could be much more impactful if they were instead packaged as benchmarks (that could track model progress over time, and even steer it).

AI community / industry could probably do some outreach work to streamline or standardize methods for general researchers to produce reusable benchmarks.

a-dub • today at 12:58 PM

i've found that multiple queries with the same prompt that requests a short answer is an excellent way to gain a confidence style measure that actually works.

sathish316 • today at 12:52 PM

Feel the AGI of next-word or next-number carbs prediction

juancn • today at 1:59 PM

Does this surprise anyone?

I mean these models are inherently probabilistic.

If you run enough samples you'll get results matching the learned probability distribution, the more you sample the higher the chances that you'll land on an unlikely response.

jan_Sate • today at 12:53 PM

Oh. I read "crabs" and I was confused until I clicked into the article. Guess I need coffee.

mbesto • today at 1:53 PM

probabilistic != deterministic

alt Hacker News

He asked AI to count carbs 27000 times. It couldn't give the same answer twice

Comments

🔗 View 16 more comments