New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

108 points • by jonahbard • today at 6:47 PM • 75 comments • view on HN

Comments

I am somewhat skeptical of this.

First, the headline result of 0.7*sigma improvement is the output of a statistical based on lessons/reviews they engaged with and their mid-term score, with that shift being for "full engagement". Based on their tables something like ~16 students (11% of the group) actually reached that level of engagement

Second, trying to incorporate past grades into their modelling is not a substitute for a randomized trial.

Third, the headline engagement number of 90% is for "engaging with the platform, via Module Review or Lesson Quizzes, at least once". I don't know why much of that couldn't just be attributed to novelty. Or even partly a professor with all sorts of enthusiasm for the platform.

Fourth, the "full dosage" effectiveness is measured based the final exam scores. Were these exam questions produced independently from the "Phosphor" materials? (e.g. by blinding?) Were they checked for direct overlap with those materials? The 0.7 sigma shift is 3 points on a 24 point exam; if even a few of the questions on that exam were very similar to those materials it could account for almost all of it. This is not clear to me from the manuscript.

If this was the case, then it's a question less of "is AI effective" vs. "did the students look at the materials". You could still argue that the AI platform got them to read, but that is a somewhat different statement than the AI helped them learn.

➕ show 4 replies

KaiserPro • today at 8:16 PM

I'm not an expert, but how much of this is down to novelty, ie https://en.wikipedia.org/wiki/Hawthorne_effect ?

(ie changing the environment can lead to short term productivity gains because either participants are aware they are being watch, or it breaks up the monotony and makes people work a bit harder. )

baq • today at 7:35 PM

I'm on record saying that a system like this with some extra hardware (i.e. a way for the LLM to have live understanding of the student's paper notebook or handout which are being written in with a plain old pencil) combines the best of both worlds - individual tutoring with approximately zero screen time which scales linearly with the number of students. The role of the teacher or professor then becomes a manager of the student - agentic tutor pairs, a referee when the student and model disagree, etc. and most importantly still being the human teacher you can just talk to in the human education process.

I'm convinced this is the future of education - models are there, we need the classroom tech to catch up. The alternative is obvious and quantified in the paper - students just use models to do their work for them and learn nothing.

➕ show 2 replies

usernametaken29 • today at 11:19 PM

Wow, putting effort into training material, thoughtfully designing it, and relating the material to the final exam, will increase performance on said exam. So much AI so much wow. Like seriously. Most university statistics courses suck big time, so literally any effort put into them will Improve the field. I’m happy the authors want to improve education but they don’t seem to understand that preparing questionare style material is a confounding factor which could very well explain the better performance too… instead of cramming AI into the next thing. I’m generally opposed to AI on basic textbooks. You don’t want hallucinations imprinted on students who have no idea and can’t judge the quality of the generated text. Some things require effort, reading intro to statistics is one of them, and it’s for a reason, the effort IS the learning

delis-thumbs-7e • today at 11:11 PM

I currently study Multivariate Calculus by using very new and nodern method: I read the text book, while solving the examples of the general for,ulas, or try to come up with my own. Then I do a s$ht-ton of exercises. I only use LLM’s to quickly clarify confusing topics or notation, but not really much else. I cancelled my Claude subscription. Now I use just Mistral and local Vibethinker-3B, but they work just fine.

Earlier I used Claude by giving it the course material and asking it to generate me exercises (our cpurse work went way over my head) and yeah i learned to differentiate a gradient or Jacobian, but it was very shallow - I knew the formulas, but not what they meant or how to apple them correctly. After I just filled glaring holes I had in Univariate Calculus by readong and doing, I actually started to understand something.

Lon story short, in my experience Learning with LLM’s is ok with very unfamiliar material that is not too complex (there’s obvious problems of LLM’s themselves being pretty ghastly with maths sometimes), but at least it os not better than the traditional method of just putting your nose on the grimd stone.

wxw • today at 8:36 PM

The title is misleading. This isn't an AI tutor so much as a practice quiz platform with an AI autograder.

> constructed-response questions (CRQ) are graded by Claude Sonnet 4.6 against instructor-defined, question-specific rubric criteria

> Crucially, LLMs make it feasible to grade formative CRQ against rubric criteria at scale, a capability that appears pedagogically significant rather than merely convenient.

They specifically call out that the "RAG chat assistant" part of Phosphor (the platform) wasn't used much.

I commend the effort here, but I don't think these results are particularly noteworthy. The conclusion is essentially that people who do practice quizzes will do better on exams.

➕ show 1 reply

rictic • today at 8:12 PM

Yes! Very exciting to see this.

Bloom's Two Sigma Opportunity suggests that there's another SD improvement available: https://en.wikipedia.org/wiki/Bloom%27s_2_sigma_problem

➕ show 1 reply

rusbus • today at 7:28 PM

This is exciting because the effect size is so large. But as the author's acknowledged, selection bias is nearly impossible to control for in this non-randomized study:

> and lacks randomized controls. Self-selection is the central threat: students who complete more quizzes may be more motivated or higher-performing generally

But this is still a strong result. I'm excited to see more in this space.

➕ show 2 replies

bobajeff • today at 10:48 PM

Even if the research is flawed I'm happy they are trying this. They are taking advantage of LLMs to have less rigid tests and also give feedback.

I think there is more potential applications possible with combining LLMs with reference/text books. Like how about an assistant that points you to the correct books/chapter/paragraph for the concept you need to understand better for a project you are working on? Or clarify any confusion you are having?

Like a human tutor but infinitely patient and non-judgy + search engine.

mmarian • today at 7:43 PM

Conflicted about this study. On one hand, LLMs have been incredible for my personal learnings of new concepts.

On the other, I'm sceptical of that it'll have "strong benefits" at scale; I'd be more in favor if the wording was "some"/"moderate". I reckon self-selection plays a huge part, as mentioned in the "Limitations" section of the paper.

I'd also caution against attaching the tool to grading. That means students have to put more effort into the course, which increases the chances that they will use LLMs to save time rather than make the investment.

or_am_i • today at 9:09 PM

The article explicitly calls out selection bias (this is entirely based on 90% that opted into using the tutor, there was no control group), I wish the headline did as well. "Engaged students score 0.71 - 1.30 SD better in tests" sounds like a much simpler explanation.

➕ show 2 replies

zerobees • today at 9:11 PM

While there's some skepticism in the thread, I'm not particularly surprised if this is true. Children who can get human tutoring do a lot better. An LLM that can answer questions and patiently explain likely offers some benefit.

What creeps me out about bringing LLM into early education is that it's a period where kids learn to socialize and cope with problems, and I do worry about forming substitute relationships with chatbots that are engineered for sycophancy / enablement. But I guess that's a problem either way, because almost every student will try an LLM at some point.

➕ show 1 reply

NeutralForest • today at 9:12 PM

Interesting article, wonder where we're going with this though, I find it's very difficult to keep LLMs on track and critical enough to be useful.

Just want to say that:

>In our deployment, student-reported reading completion baselines for MATH 010 were approximately 15%, with instructors estimating 10%. Individual student reports of reading compliance ranged from "literally no one does that" to "is this being recorded?"

is hilarious

boulos • today at 7:10 PM

Do you have a larger study planned for the Fall? It definitely seems promising.

I'm curious how well you feel this worked because the subject was Statistics (objective grading) versus something more subjective like Civics or Literature.

PS - I'd say this qualifies for Show HN, too!

Do you

➕ show 1 reply

RA_Fisher • today at 8:58 PM

This is super, but students will have access to AI during the test in real life, so it's ironically less realistic to remove it (thinking of the "... GPT-4 actually harmed subsequent performance by 17% when the tool was removed ..." part).

I'm more curious how students perform on the test with vs. without AI.

ilaksh • today at 7:40 PM

Shocking that a well executed AI tutor improves outcomes.

Hasn't computer assisted interactive learning already been proven for years? Why does there seem to be so much skepticism about enhancing it with AI?

Is this just something like, astoundingly slow adoption or poor execution? Being held back by paper textbook makers? Teachers unions dragging their feet?

How can interactive AI driven individually paced learning _not_ be obviously dramatically more effective?

➕ show 3 replies

constantius • today at 7:30 PM

Interesting, congrats.

Are you planning on opening access to Phosphor?

glenstein • today at 9:06 PM

In mice!

Jk, but the skepticism is inevitable. I think we can be dubious about how AI mobilizes global capital while also appreciating tutoring as one of its best targeted use cases.

klustregrif • today at 9:35 PM

A lot of pessimism in the comments, but I am just happy that we are seeing some work towards bridging the 2 Sigma gap for regular education vs. elite private tutoring. I can't imagine that people assume it's the physical presence of the tutor that is making the difference, it has to come down to the personalisation and expertise which is exactly what AI can provide in a form. And yea it might not be "there" yet. But if we don't start trying and studying then it'll never get there.

➕ show 1 reply

Rperry2174 • today at 7:25 PM

Honestly whether or not this was effective seems less important to me than the adoption numbers.

Text book reading in this course was 10-15% at baseline ... but this AI thing got 90% voluntary usage ungraded.

Even if its worse per-hour than a textbook, you're now teaching 6x as many students _something_ instead of teaching a small minority everything.

So really it just becomes an optimization problem at that point because most students are at least in the funnel/in the running to learn something.

The paper kind of proves this itself ... they tweaked the quize formats mid-semester and where able to iterate which you can't do on a textbook that nobody opens in the first place

➕ show 2 replies

kubb • today at 7:21 PM

Too bad the educational use case doesn't make any money. Good LLMs are a game changer for people motivated to learn.

➕ show 2 replies

albinahlback • today at 7:15 PM

Very nicely typeset.

tancop • today at 9:16 PM

[dead]

MoneyBurning • today at 7:53 PM

Curious how this holds up across different learning styles. SD effect sizes look impressive, but I'd want to see retention data at 30/90 days before drawing conclusions.

➕ show 1 reply

alt Hacker News

New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

Comments