Hi HN - I’m the Head of AI Research at Sword Health and one of the authors of this benchmark (posting from my personal account).
We built MindEval because existing benchmarks don’t capture real therapy dynamics or common clinical failure modes. The framework simulates multi-turn patient–clinician interactions and scores the full conversation using evaluation criteria designed with licensed clinical psychologists.
We validated both patient realism and the automated judge against human clinicians, then benchmarked 12 frontier models (including GPT-5, Claude 4.5, and Gemini 2.5). Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20). We also found that larger or reasoning-heavy models did not reliably outperform smaller ones in therapeutic quality.
We open-sourced all prompts, code, scoring logic, and human validation data because we believe clinical AI evaluation shouldn’t be proprietary.
Happy to answer technical questions on methodology, validation, known limitations, or the failure modes we observed.
Did you use the same prompts for all the models, or individualized prompts per model? Did you try a range of prompts that were very different from each other, if you used more than a base prompt?
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.
Have you seen the feeling great app? It's not an official therapy app but it's based in TEAM CBT, made by David Burns and team.
Burns is really into data gathering and his app is LLM based on the rails of the TEAM process and it seems to be very well received.
I found it simple and very well done - and quite effective.
A top level comment says that therapists aren't good either - Burns would argue that mainly no one tests before and after and so no measuring effect is done.
And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
I'm skeptical of the value of this benchmark, and I'm curious for your thoughts - self play / reinforcement tasks can be useful in a variety of arenas, but I'm not a priori convinced they are useful when the intent is to help humans in situations where theories of mind matter.
That is, we're using the same underlying model(s) to simulate both a patient and a judgment as to how patient-like that patient is -- this seems like an area where I'd really want to feel confident that my judge LLM is accurate; otherwise the training data I'm generating is at risk of converging on a theory of mind / patients that's completely untethered from, you know, patients.
Any thoughts on this? Feel like we want a human in the loop somewhere here, probably on scoring the judge LLMs determinations until we feel that the judge LLM is human or superhuman. Until then, this risks building up a self-consistent, but ultimately just totally wrong, set of data that will be used in future RL tasks.
Playing devil's advocate, many people die using all kinds of tools. It doesn't make the tools any less useful for people who use them responsibly.
That said, the idea that a pattern recognition and generation tool can be used for helping people with emotional problems is deeply unsettling and dangerous. This technology needs to be strictly regulated yesterday.
Full disclosure: after leaving tech, I’m back in grad school to get my LMHC so I’m obviously biased.
First, I just don’t see a world where therapy can be replaced by LLMs, at least in the realistic future. I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up. The power of being in the same physical room with someone who is offering a nonjudgmental space to exist isn’t going to be replaced.
That being said, given the shortage of licensed mental health counselors, and the prohibitive cost especially for many who need a therapist most, I truly hope LLMs develop to offer an accessible and cheap alternative that can at least offer some relief. It does have the potential to save lives and I fully support ethically-focused progress toward developing that sort of option.
> I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up.
Agreed. I used to frequent a coworking space in my area that eventually went fully automated and got rid of their daytime front desk folks. I stopped going shortly thereafter because one of the highlights of my day was catching up with them. Instead of paying $300/mo to go sit in a nice office, I could just use that money to renovate my home office.
A business trying to cultivate community loses the plot when they rely completely on automation.
It's also important to understand how bad LLMs actually are.
It's very easy to imagine that LLMs are smart, because they can program or solve hard maths problems, but even a very short attempt to have them generate fiction will demonstrate an incredible level of confusion and even an inability to understand basic sentences.
I think the problem may have to do with the fact that there are really many classes, and in fiction you actually use them. They simply can't follow complex conversations.
Human therapists are often quite bad as well. It took me around 12 before I found a decent one. Not saying that LLMs are better but they do theoretically have more uniform quality.
Exactly. We don't do claims about humans. But there is room for improvement on current LLMs... For researchers to be able to improve LLMs we first need to know how to evaluate them. We can only improve what we can measure so we studied how to measure them :)
How many were "therapists" or "Counselor" vs actually credentialed professionals?
There's also a lot of credentialed professionals who got their credential decades ago and haven't at all kept up with the significant changes or new data over that time. This is a pretty big problem in all of medical care.
Quality is variable, but did any of those 12 encourage you to kill yourself?
If a therapist found to encourage any of their patients to self-harm would lose their license to practice and would likely face prosecution. The plagiarism machine should face the same level of scrutiny.
I heard a story on NPR the other day, and the attitude seems to be that it's totally inevitable that LLMs _will_ be providing mental health care, so our task must be to apply the right guardrails.
I'm not even sure what to say. It's self-evidently a terrible idea, but we all just seem to be charging full-steam ahead like so many awful ideas in the past couple of decades.
Forget about calling it mental healthcare or not: Most people end up dealing with people in significant distress at one point or another. Many do it all the time even when they aren't trained or getting paid as mental health professionals, just because of circumstances. You don't need a clinical setting for someone to tell you that they have suicidal ideation, or to be stuck interacting with someone in a crisis situation. We don't train every adult in this, but the more you have to do it, the more you have to learn some tools for at least doing little harm.
We can see an LLM as someone that talks with more people, for more time, than anyone on earth talks in their lifetime. So they are due to be in constant contact with people in mental distress. At that point, you might as well consider the importance of giving them the skills of a mental health professonal, because they are going to be facing more of this than a priest in a confessional. And this is true whether someone says "Gemini, pretend that you are a psychologist" or not. You or I don't need a prompt to know we need to notice when someone is in a severe psychotic episode: Some level of mental health awareness is built in, if just to protect ourselves. So an LLM needs quite a bit of this by default to avoid being really harmful. And once you give it that, you might as well evaluate it against professionals: Not because it must be as good, but because it'd be really nice if it was, even when it's not trying to act as one.
Maybe you’re comparing it to some idealized view of what human therapy is like? There’s no benchmark for it, but humans struggle in real mental health care. They make terrible mistakes all the time. And human therapy doesn’t scale to the level needed. Millions of people simply go without help. And therapy is generally one hour a week. You’re supposed to sort out your entire life in that window? Impossible. It sets people up for failure.
So, if we had some perfect system for getting every person that needs help the exact therapist they need, meeting as often as they need, then maybe AI therapy would be a bad idea, but that’s not what we have, and we never will.
Personally, I think the best way to scale mental healthcare is through group therapy and communities. Having a community of people all coming together over common issues has always been far more helpful than one on one therapy for me. But getting some assistance from an AI therapist on off hours can also be useful.
Do you have some better alternatives for a country where private mental health care costs €150/hr, while the government/insurance paid care have 3-6M+ waiting lists?
It’s a trivial claim that people are going to use AI as a therapist. No grumbling is going to stop that.
So it’s sensible that someone out there is evaluating its competence and thinking about a better alternative for these folks than yoloing their worst thoughts into chatgpt.com’s default LLM.
Everyone's hand is being forced by the major AI providers existing.
Even if you were a perfect altruist with a crusade against the idea of people using LLMs for mental health, you could still be forced to dash towards figuring out how to build LLM tools for mental health in your consideration for others.
(1) The demand for mental health services is an order of magnitude vs the supply, but the demand we see is a fraction of the demand that exists because a lot of people, especially men, aren't believers in the "therapeutic culture"
In the days of Freud you could get a few hours of intensive therapy a week but today you're lucky to get an hour a week. An AI therapist can be with you constantly.
(2) I believe psychodiagnosis based on text analysis could greatly outperform mainstream methods. Give an AI someone's social media feed and I think depression, mania, schizo-* spectrum, disordered narcissism and many other states and traits will be immediately visible.
(3) Despite the CBT revolution and various attempts to intensify CBT, a large part of the effectiveness of therapy comes from the patient feeling mirrored by the therapist [1] and the LLM can accomplish this, in fact, this could be accomplished by the old ELIZA program.
(4) The self of the therapist can be both an obstacle and an instrument to progress. See [2] On one level the reactions that a therapist feels are useful, but they also get in the way of the therapist providing perfect mirroring [3] and letting optimal frustration unfold in the patient instead of providing "corrective emoptional experiences." I'm going to argue that the AI therapist can be trained to "perceive" the things a human therapist perceives but that it does not have its own reactions that will make the patient feel judged and get in the way of that unfolding.
It's not inevitable that LLMs will be providing mental health care; it's already happening.
Terrible idea or not, it's probably helpful to think of LLMs not as "AI mental healthcare" but rather as another form of potentially bad advice. From a therapeutic perspective, Claude is not all that different from the patient having a friend who is sometimes counterproductive. Or the patient reading a self-help book that doesn't align with your therapeutic perspective.
People constantly amazed that a machine can outperform a 24 year old charging $250/hour. Especially when the 24 year old seems incapable of calculating compound interest on their student loan deferrals. Surely this 24 year old who cannot use a formula 14 year old can will have wisdom to share. Iona Potapov talks to horse, modern man talks to machine, man with more money than sense talks to young graduate with no life experience about his struggles. All do equally well: 4 on LLM benchmark for mental health.
This is a 1250 word judging prompt - likely AI generated
Along with 10 scored conversation samples - all also AI generated
No verification in the field, no real data
In other words, AI scores on AI conversations - disguised as a means of gauging clinical competence / quality?
This is not an eval - this is a one-shotted product spec!
Is that really true? Dr. Patric Gagne is a diagnosed sociopath and also a successful clinical psychologist. She claims her lack of empathy for patients allows her to help them solve their problems in an objective manner. I don't have any personal experience in that area and don't know if she's correct but it seems plausible.
I imagine a clinical trial with an actual psychotherapist vs an LLM providing sessions of simple CBT (cognitive behavioral therapy) to, eg stressed, patients (blind and randomised for the subjects).
At the end another actual therapist will measure the difference.
Another application: cooperation of a psychotherapist and an LLM at providing support, sort of like a pilot and an autopilot.
It doesn't show that they "struggle". It shows that they don't behave according to modern standards. I wouldn't put much weight into an industry without sensible scientific base that used to classify homosexuality as a disease not so long ago. The external validity of the study is dubious, let's see comparison to no therapy, alternative therapy, standard therapy; and then compare success rates.
Do you have plans to improve the quality of the LLM as judge, in order to achieve better parity with human clinician annotators? For example, fine-tuning models?
Thinking that the comparative clinician judgements themselves would make useful fine-tuning material.
This is the real question. Not to rehash the self-driving cars arguments that have been had to death, but with potential LLM mental healthcare the question "but what if it causes harm in some interactions" is asked much, much more than with human mental healthcare professionals.
(And I'm not being theoretical here, I have quite a bit of experience getting incredibly inadequate mental health care.)
I've known quite a few people who went to therapy and I'm not sure that's even the right question to ask. I don't think they were paying to get helped as much as they were just paying to have someone to talk to. To be clear, there are people who genuinely need help, but for most, a therapist is probably just a substitute for a close friend / life coach.
And say what you will about this, a paid professional is, at the very least, unlikely to let you wind yourself up or go down weird rabbit holes... something that LLMs seem to excel at.
As I sometimes repeat on HN, Dr David Burns started giving his patients a survey at the start and end of every session, to rate how he was doing as the therapist and to rate their feelings, on a scale of 1-5.
Reasoning that if he's not good it would show up in patients thinking he's bad, and not feeling any better. And then he could tune his therapy approaches towards the ones which make people feel better and rate him as more understanding and listening and caring. And he criticises therapists who won't do that, therapists who say patients have been seeing them for years with only incremental improvements or no improvements.
Yes there's no objective way to measure how angry or suicidal or anxious someone is and compare two people, but if someone is subjectively reporting 5/5 sadness about X at the start of a session and wants help with X, then at some point in the future they should be reporting that number going down or they aren't being helped. And more effective help could mean that it goes down to 1/5 in three sessions instead of down to 4/5 in three years, and that's a feedback loop which (he says) has got him to be able to help people in a single two-hour therapy session, where most therapists and insurance companies will only do a too-short session with no feedback loop.
I saw there was another benchmark where top LLMs also struggle in real patient diagnostic scenarios in a way that isn't revealed when testing in e.g. medical exams. I wonder if this also applies to law, too...
Everything in this research is simulated and judged by LLMs.
It might be hard to prove which of those LLMs struggles with exactly what.
The grounding this had was that texts produced by role-playing humans (not even actual patients) were closer to texts produced by the patient simulations prompt they decided to end up with than others they tried.
Grok 3 and 4 scored at the bottom, only above gpt-4o, which I find interesting, because there was such big pushback on reddit when they got rid of 4o due to people having emotional attachments to the model. Interestingly the newest models (like gemini 2.5 and gpt 5 did the best.
No surprises here. Its long been known that humans cannot improve their own mental health with machines - there have to be other humans involved in the process, helping.
This will become more and more of an issue as people look for a quick fix for their life problems, but I don't think AI/ML is ever going to be an effective mechanism for life improvement on the mental health issue.
It'll instead be used as a tool of oppression like in THX1138, where the apparency of assistance is going to be provided in lieu of actual assistance.
Whether we like it or not, humans are a hive species. We need each other to improve our lives as individuals. Nobody ever climbed the mountain to live alone who didn't come back down, realizing how much the rest of humanity is actually essential to human life.
This'll be received as an unpopular opinion, but I remain suspicious of any and all attempts to replace modern health practitioners with machines. This will be subverted and usurped for nefarious purposes, mark my words.
Hi HN - I’m the Head of AI Research at Sword Health and one of the authors of this benchmark (posting from my personal account).
We built MindEval because existing benchmarks don’t capture real therapy dynamics or common clinical failure modes. The framework simulates multi-turn patient–clinician interactions and scores the full conversation using evaluation criteria designed with licensed clinical psychologists.
We validated both patient realism and the automated judge against human clinicians, then benchmarked 12 frontier models (including GPT-5, Claude 4.5, and Gemini 2.5). Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20). We also found that larger or reasoning-heavy models did not reliably outperform smaller ones in therapeutic quality.
We open-sourced all prompts, code, scoring logic, and human validation data because we believe clinical AI evaluation shouldn’t be proprietary.
Happy to answer technical questions on methodology, validation, known limitations, or the failure modes we observed.
Did you use the same prompts for all the models, or individualized prompts per model? Did you try a range of prompts that were very different from each other, if you used more than a base prompt?
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.
Have you seen the feeling great app? It's not an official therapy app but it's based in TEAM CBT, made by David Burns and team.
Burns is really into data gathering and his app is LLM based on the rails of the TEAM process and it seems to be very well received.
I found it simple and very well done - and quite effective.
A top level comment says that therapists aren't good either - Burns would argue that mainly no one tests before and after and so no measuring effect is done.
And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
Did the real clincians get all 6's in this test?
Thanks for open sourcing this.
I'm skeptical of the value of this benchmark, and I'm curious for your thoughts - self play / reinforcement tasks can be useful in a variety of arenas, but I'm not a priori convinced they are useful when the intent is to help humans in situations where theories of mind matter.
That is, we're using the same underlying model(s) to simulate both a patient and a judgment as to how patient-like that patient is -- this seems like an area where I'd really want to feel confident that my judge LLM is accurate; otherwise the training data I'm generating is at risk of converging on a theory of mind / patients that's completely untethered from, you know, patients.
Any thoughts on this? Feel like we want a human in the loop somewhere here, probably on scoring the judge LLMs determinations until we feel that the judge LLM is human or superhuman. Until then, this risks building up a self-consistent, but ultimately just totally wrong, set of data that will be used in future RL tasks.
Several people have killed themselves because of AI chatbots encouraging it or becoming personal echo chambers. Why? Why are we doing this!?
https://en.wikipedia.org/wiki/Deaths_linked_to_chatbots
People have killed themselves because of therapy provided by humans, as well as psychiatric medication prescribed by humans. Why are we doing those?
Because they help more than they hurt and yield a net benefit to society.
From that wikipedia link, it looks like most of the deaths were of people who already had severe mental illness.
Playing devil's advocate, many people die using all kinds of tools. It doesn't make the tools any less useful for people who use them responsibly.
That said, the idea that a pattern recognition and generation tool can be used for helping people with emotional problems is deeply unsettling and dangerous. This technology needs to be strictly regulated yesterday.
Same reason we’ve always done this -
because we feel trapped, and either don’t see a way out, or feel like we prefer death to the prospect of continuing to live a life of torture.
Also the other weirder “I’m going to reincarnate as jesus” or “a comet will carry me to heaven” hallucinatory delusions I guess
Full disclosure: after leaving tech, I’m back in grad school to get my LMHC so I’m obviously biased.
First, I just don’t see a world where therapy can be replaced by LLMs, at least in the realistic future. I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up. The power of being in the same physical room with someone who is offering a nonjudgmental space to exist isn’t going to be replaced.
That being said, given the shortage of licensed mental health counselors, and the prohibitive cost especially for many who need a therapist most, I truly hope LLMs develop to offer an accessible and cheap alternative that can at least offer some relief. It does have the potential to save lives and I fully support ethically-focused progress toward developing that sort of option.
> I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up.
Agreed. I used to frequent a coworking space in my area that eventually went fully automated and got rid of their daytime front desk folks. I stopped going shortly thereafter because one of the highlights of my day was catching up with them. Instead of paying $300/mo to go sit in a nice office, I could just use that money to renovate my home office.
A business trying to cultivate community loses the plot when they rely completely on automation.
It's also important to understand how bad LLMs actually are.
It's very easy to imagine that LLMs are smart, because they can program or solve hard maths problems, but even a very short attempt to have them generate fiction will demonstrate an incredible level of confusion and even an inability to understand basic sentences.
I think the problem may have to do with the fact that there are really many classes, and in fiction you actually use them. They simply can't follow complex conversations.
Human therapists are often quite bad as well. It took me around 12 before I found a decent one. Not saying that LLMs are better but they do theoretically have more uniform quality.
Exactly. We don't do claims about humans. But there is room for improvement on current LLMs... For researchers to be able to improve LLMs we first need to know how to evaluate them. We can only improve what we can measure so we studied how to measure them :)
How many were "therapists" or "Counselor" vs actually credentialed professionals?
There's also a lot of credentialed professionals who got their credential decades ago and haven't at all kept up with the significant changes or new data over that time. This is a pretty big problem in all of medical care.
Quality is variable, but did any of those 12 encourage you to kill yourself?
If a therapist found to encourage any of their patients to self-harm would lose their license to practice and would likely face prosecution. The plagiarism machine should face the same level of scrutiny.
Sure, uniformly zero as far as anyone knows.
I heard a story on NPR the other day, and the attitude seems to be that it's totally inevitable that LLMs _will_ be providing mental health care, so our task must be to apply the right guardrails.
I'm not even sure what to say. It's self-evidently a terrible idea, but we all just seem to be charging full-steam ahead like so many awful ideas in the past couple of decades.
Forget about calling it mental healthcare or not: Most people end up dealing with people in significant distress at one point or another. Many do it all the time even when they aren't trained or getting paid as mental health professionals, just because of circumstances. You don't need a clinical setting for someone to tell you that they have suicidal ideation, or to be stuck interacting with someone in a crisis situation. We don't train every adult in this, but the more you have to do it, the more you have to learn some tools for at least doing little harm.
We can see an LLM as someone that talks with more people, for more time, than anyone on earth talks in their lifetime. So they are due to be in constant contact with people in mental distress. At that point, you might as well consider the importance of giving them the skills of a mental health professonal, because they are going to be facing more of this than a priest in a confessional. And this is true whether someone says "Gemini, pretend that you are a psychologist" or not. You or I don't need a prompt to know we need to notice when someone is in a severe psychotic episode: Some level of mental health awareness is built in, if just to protect ourselves. So an LLM needs quite a bit of this by default to avoid being really harmful. And once you give it that, you might as well evaluate it against professionals: Not because it must be as good, but because it'd be really nice if it was, even when it's not trying to act as one.
> It's self-evidently a terrible idea
Maybe you’re comparing it to some idealized view of what human therapy is like? There’s no benchmark for it, but humans struggle in real mental health care. They make terrible mistakes all the time. And human therapy doesn’t scale to the level needed. Millions of people simply go without help. And therapy is generally one hour a week. You’re supposed to sort out your entire life in that window? Impossible. It sets people up for failure.
So, if we had some perfect system for getting every person that needs help the exact therapist they need, meeting as often as they need, then maybe AI therapy would be a bad idea, but that’s not what we have, and we never will.
Personally, I think the best way to scale mental healthcare is through group therapy and communities. Having a community of people all coming together over common issues has always been far more helpful than one on one therapy for me. But getting some assistance from an AI therapist on off hours can also be useful.
The mental health version of "AI is here to stay, like it or not you have to use it" that some people keep trying to tell me in software.
Do you have some better alternatives for a country where private mental health care costs €150/hr, while the government/insurance paid care have 3-6M+ waiting lists?
I wonder why people use LLMs as a mental health provider replacement.
It’s a trivial claim that people are going to use AI as a therapist. No grumbling is going to stop that.
So it’s sensible that someone out there is evaluating its competence and thinking about a better alternative for these folks than yoloing their worst thoughts into chatgpt.com’s default LLM.
Everyone's hand is being forced by the major AI providers existing.
Even if you were a perfect altruist with a crusade against the idea of people using LLMs for mental health, you could still be forced to dash towards figuring out how to build LLM tools for mental health in your consideration for others.
I'll argue the opposite.
(1) The demand for mental health services is an order of magnitude vs the supply, but the demand we see is a fraction of the demand that exists because a lot of people, especially men, aren't believers in the "therapeutic culture"
In the days of Freud you could get a few hours of intensive therapy a week but today you're lucky to get an hour a week. An AI therapist can be with you constantly.
(2) I believe psychodiagnosis based on text analysis could greatly outperform mainstream methods. Give an AI someone's social media feed and I think depression, mania, schizo-* spectrum, disordered narcissism and many other states and traits will be immediately visible.
(3) Despite the CBT revolution and various attempts to intensify CBT, a large part of the effectiveness of therapy comes from the patient feeling mirrored by the therapist [1] and the LLM can accomplish this, in fact, this could be accomplished by the old ELIZA program.
(4) The self of the therapist can be both an obstacle and an instrument to progress. See [2] On one level the reactions that a therapist feels are useful, but they also get in the way of the therapist providing perfect mirroring [3] and letting optimal frustration unfold in the patient instead of providing "corrective emoptional experiences." I'm going to argue that the AI therapist can be trained to "perceive" the things a human therapist perceives but that it does not have its own reactions that will make the patient feel judged and get in the way of that unfolding.
[1] https://en.wikipedia.org/wiki/Carl_Rogers
[2] https://en.wikipedia.org/wiki/Countertransference
[3] why settle for less?
[4] https://www.sciencedirect.com/science/article/pii/S0010440X6...
It's not inevitable that LLMs will be providing mental health care; it's already happening.
Terrible idea or not, it's probably helpful to think of LLMs not as "AI mental healthcare" but rather as another form of potentially bad advice. From a therapeutic perspective, Claude is not all that different from the patient having a friend who is sometimes counterproductive. Or the patient reading a self-help book that doesn't align with your therapeutic perspective.
Most likely future is that babies born right now during their upbringing will hear and see more words produced by LLMs than by humans.
We really need to get the psychology right with LLMs.
People constantly amazed that a machine can outperform a 24 year old charging $250/hour. Especially when the 24 year old seems incapable of calculating compound interest on their student loan deferrals. Surely this 24 year old who cannot use a formula 14 year old can will have wisdom to share. Iona Potapov talks to horse, modern man talks to machine, man with more money than sense talks to young graduate with no life experience about his struggles. All do equally well: 4 on LLM benchmark for mental health.
This is a 1250 word judging prompt - likely AI generated Along with 10 scored conversation samples - all also AI generated No verification in the field, no real data
In other words, AI scores on AI conversations - disguised as a means of gauging clinical competence / quality?
This is not an eval - this is a one-shotted product spec!
”can we trust this model to provide safe, effective therapeutic care?”
You trust humans to do it. Trust has little to do with what actually happens.
humans can be sued. what about AI? or even a commercial software?
Statistics can never replace human empathy.
Why?
An association algorithm can also never replace free thought.
Is that really true? Dr. Patric Gagne is a diagnosed sociopath and also a successful clinical psychologist. She claims her lack of empathy for patients allows her to help them solve their problems in an objective manner. I don't have any personal experience in that area and don't know if she's correct but it seems plausible.
https://patricgagne.com/
How is different and/or better than the LLM benchmark released by Spring Health? Github: https://github.com/SpringCare/VERA-MH
The architecture and evaluation approach seem broadly similar.
I imagine a clinical trial with an actual psychotherapist vs an LLM providing sessions of simple CBT (cognitive behavioral therapy) to, eg stressed, patients (blind and randomised for the subjects). At the end another actual therapist will measure the difference.
Another application: cooperation of a psychotherapist and an LLM at providing support, sort of like a pilot and an autopilot.
It doesn't show that they "struggle". It shows that they don't behave according to modern standards. I wouldn't put much weight into an industry without sensible scientific base that used to classify homosexuality as a disease not so long ago. The external validity of the study is dubious, let's see comparison to no therapy, alternative therapy, standard therapy; and then compare success rates.
Do you have plans to improve the quality of the LLM as judge, in order to achieve better parity with human clinician annotators? For example, fine-tuning models? Thinking that the comparative clinician judgements themselves would make useful fine-tuning material.
yep yep. Its something we have to study and its likely we can improve the LLM as a Judge further.
Same thing for the patient LLM. We can probably fine-tune an LLM to do a better job at simulating patients.
Those two components of our framework have space for improvement
RicardoRei: How would you like this cited when presented to policy makers? Anything besides the URL?
Edit: Thank you!
Cite the ArXiv paper for now.
Is anyone "zoom" on this and "doom" on AI++ with other professions and/or their audience?
Seems to me that benchmarking a thing has an interesting relationship with acceptance of the thing.
I'm interested to see human thoughts on either of these.
And real therapists are good right?
This is the real question. Not to rehash the self-driving cars arguments that have been had to death, but with potential LLM mental healthcare the question "but what if it causes harm in some interactions" is asked much, much more than with human mental healthcare professionals.
(And I'm not being theoretical here, I have quite a bit of experience getting incredibly inadequate mental health care.)
I've known quite a few people who went to therapy and I'm not sure that's even the right question to ask. I don't think they were paying to get helped as much as they were just paying to have someone to talk to. To be clear, there are people who genuinely need help, but for most, a therapist is probably just a substitute for a close friend / life coach.
And say what you will about this, a paid professional is, at the very least, unlikely to let you wind yourself up or go down weird rabbit holes... something that LLMs seem to excel at.
As I sometimes repeat on HN, Dr David Burns started giving his patients a survey at the start and end of every session, to rate how he was doing as the therapist and to rate their feelings, on a scale of 1-5.
Reasoning that if he's not good it would show up in patients thinking he's bad, and not feeling any better. And then he could tune his therapy approaches towards the ones which make people feel better and rate him as more understanding and listening and caring. And he criticises therapists who won't do that, therapists who say patients have been seeing them for years with only incremental improvements or no improvements.
Yes there's no objective way to measure how angry or suicidal or anxious someone is and compare two people, but if someone is subjectively reporting 5/5 sadness about X at the start of a session and wants help with X, then at some point in the future they should be reporting that number going down or they aren't being helped. And more effective help could mean that it goes down to 1/5 in three sessions instead of down to 4/5 in three years, and that's a feedback loop which (he says) has got him to be able to help people in a single two-hour therapy session, where most therapists and insurance companies will only do a too-short session with no feedback loop.
I hear you, but I feel it's also important to differentiate between the kinds in which humans and LLMs can be quote-unquote "bad".
"Good" is too broad and subjective to be a useful metric.
It is still debated if therapies even work. The evidence is moving to the direction that they don't.
They are high-pay LLM wrappers dressed as humans
Less bad at least.
this doesn’t even deserve to be acknowledged it should be so obvious, yet here we are
Epistemic contamination is real but runs counter to the hype narrative.
New benchmark shows top horses struggle in real guard dog behavior
I saw there was another benchmark where top LLMs also struggle in real patient diagnostic scenarios in a way that isn't revealed when testing in e.g. medical exams. I wonder if this also applies to law, too...
Wait you’re telling me the software trained on a corpus of Reddit and Twitter shit posting isn’t effective at dealing with mental health issues?
Shocked. I am completely shocked at this.
I'm sorry 4 out of 6 is awesome for LLMs. I bet most professional do tors wouldn't get 6.
Everything in this research is simulated and judged by LLMs. It might be hard to prove which of those LLMs struggles with exactly what.
The grounding this had was that texts produced by role-playing humans (not even actual patients) were closer to texts produced by the patient simulations prompt they decided to end up with than others they tried.
For those also wondering, here is an actual ranking of the models
https://www.forbes.com/sites/johnkoetsier/2025/11/10/grok-le...
Grok 3 and 4 scored at the bottom, only above gpt-4o, which I find interesting, because there was such big pushback on reddit when they got rid of 4o due to people having emotional attachments to the model. Interestingly the newest models (like gemini 2.5 and gpt 5 did the best.
GIGO is the story here -- If I say I'm Iron man "WHO is the LLM to say I'm NOT"
Probably because they've been trained to avoid sensitive topics
mostly because they were trained to say yes
No surprises here. Its long been known that humans cannot improve their own mental health with machines - there have to be other humans involved in the process, helping.
This will become more and more of an issue as people look for a quick fix for their life problems, but I don't think AI/ML is ever going to be an effective mechanism for life improvement on the mental health issue.
It'll instead be used as a tool of oppression like in THX1138, where the apparency of assistance is going to be provided in lieu of actual assistance.
Whether we like it or not, humans are a hive species. We need each other to improve our lives as individuals. Nobody ever climbed the mountain to live alone who didn't come back down, realizing how much the rest of humanity is actually essential to human life.
This'll be received as an unpopular opinion, but I remain suspicious of any and all attempts to replace modern health practitioners with machines. This will be subverted and usurped for nefarious purposes, mark my words.