"Can AI do math for us" is the canonical wrong question. People want self-driving cars so they can drink and watch TV. We should crave tools that enhance our abilities, as tools have done since prehistoric times.
I'm a research mathematician. In the 1980's I'd ask everyone I knew a question, and flip through the hard bound library volumes of Mathematical Reviews, hoping to recognize something. If I was lucky, I'd get a hit in three weeks.
Internet search has shortened this turnaround. One instead needs to guess what someone else might call an idea. "Broken circuits?" Score! Still, time consuming.
I went all in on ChatGPT after hearing that Terry Tao had learned the Lean 4 proof assistant in a matter of weeks, relying heavily on AI advice. It's clumsy, but a very fast way to get suggestions.
Now, one can hold involved conversations with ChatGPT or Claude, exploring mathematical ideas. AI is often wrong, never knows when it's wrong, but people are like this too. Read how the insurance incidents for self-driving taxis are well below the human incident rates? Talking to fellow mathematicians can be frustrating, and so is talking with AI, but AI conversations go faster and can take place in the middle of the night.
I don't want AI to prove theorems for me, those theorems will be as boring as most of the dreck published by humans. I want AI to inspire bursts of creativity in humans.
> AI is often wrong, never knows when it's wrong, but people are like this too.
When talking with various models of ChatGPT about research math, my biggest gripe is that it's either confidently right (10% of my work) or confidently wrong (90%). A human researcher would be right 15% of the time, unsure 50% of the time, and give helpful ideas that are right/helpful (25%) or wrong/a red herring (10%). And only 5% of the time would a good researcher be confidently wrong in a way that ChatGPT is often.
In other words, ChatGPT completely lacks the meta-layer of "having a feeling/knowing how confident it is", which is so useful in research.
these numbers are just your perception. The way you ask the question will very much influence the output and certain topics more than others. I get much better results when I share my certainty levels in my questions and say things like "if at all", "if any" etc.
I agree with this approach and use it myself, but these confidence markers can also skew output in undesirable ways. All of these heuristics are especially fragile when the subject matter touches the frontiers of what is known.
In any case my best experiences with LLMs for pure math research have been for exploring the problem space and ideation -- queries along the line of "Here's a problem I'm working on ... . Do any other fields have a version of this problem, but framed differently?" or "Give me some totally left field methods, even if they are from different fields or unlikely to work. Assume I've exhausted all the 'obvious' approaches from field X"
Yeah, blame the users for "using it wrong" (phrase of the week I would say after the o3 discussions), and then sell the solution as almost-AGI.
PS: I'm starting to see a lot of plausible deniability in some comments about LLMs capabilites. When LLMs do great => "cool, we are scaling AI". when LLMs do something wrong => "user problem", "skill issues", "don't judge a fish for its ability to fly".
It think it is every sci-fiction dreamer to teach a robot to love.
I don't think AI will think conventionally. It isn't thinking to begin with. It is weighing options. Those options permutate and that is why every response is different.
I agree. I think it comes down to the motivation behind why one does mathematics (or any other field for that matter). If it's a means to an end, then sure have the AI do the work and get rid of the researchers. However, that's not why everyone does math. For many it's more akin to why an artist paints. People still paint today even though a camera can produce much more realistic images. It was probably the case (I'm guessing!) that there was a significant drop in jobs for artists-for-hire, for whom painting was just a means to an end (e.g. creating a portrait), but the artists who were doing it for the sake of art survived and were presumably made better by the ability to see photos of other places they want to paint or art from other artists due to the invention of the camera.
> People want self-driving cars so they can drink and watch TV. We should crave tools that enhance our abilities, as tools have done since prehistoric times.
Improved tooling and techniques have given humans the free time and resources needed for arts, culture, philosophy, sports, and spending time to enjoy life! Fancy telecom technologies have allowed me to work from home and i love it :)
> Talking to fellow <humans> can be frustrating, and so is talking with AI, but AI conversations go faster and can take place in the middle of the night.
I made a slight change to generalise your statement, I think you have summarised the actual marketing opportunity.
I think I'm missing your point? You still want to enjoy doing math yourself? Is that what you are saying? So you equate "Can AI do math in my place?" with "Can AI drink and watch TV in my place?"
Ingredients to a top HN comment on AI include some nominal expert explaining why actually labor won’t be replaced and it will be a collaborative process so you don’t need to worry sprinkled with a little bit of ‘the status quo will stay still even though this tech only appeared in the last 2 years’
AI will not do math for us, but maybe eventually it will lead to another mainstream tool for mathematicians. Along with R, Matlab, Sage, GAP, Magma, ...
It would be interesting if in the future mathematicians are just as fluent in some (possibly AI-powered) proof verifying tool, as they are with LaTeX today.
Can AI solve “toy” math problems that computers have not been able to do? Yes. Can AI produce novel math research? No, it hasn’t yet. So “AI will not do math for us” is only factually wrong if you take the weaker definition of “doing math for us”. The stronger definition is not factually wrong yet.
More problematic with that statement is that a timeline isn’t specified. 1 year? Probably not. 10 years? Probably. 20 years? Very likely. 100 years? None of us here will be alive to be proven wrong but I’ll venture that that’s a certainty.
This is a pretty strong position to take in the comments of a post where a mathematician declared the 5 problems he'd seen to be PhD level, and speculated that the real difficulty with switching from numerical answers to proofs will be finding humans qualified to judge the AI's answers.
I will agree that it's likely none of us here will be alive to be proven wrong, but that's in the 1 to 10 year range.
Your optimism should be tempered with the downside of progress meaning that AI in the near future may not only inspire creativity in humans, but it can replace human creativity all together.
Why do I need to hire an artist for my movie/video game/advertisement when AI can replicate all the creativity I need.
There was a little more information in that reddit thread. Of the three difficulty tiers, 25% are T1 (easiest) and 50% are T2. Of the five public problems that the author looked at, two were T1 and two were T2. Glazer on reddit described T1 as "IMO/undergraduate problems", but the article author says that they don't consider them to be undergraduate problems. So the LLM is already doing what the author says they would be surprised about.
Also Glazer seemed to regret calling T1 "IMO/undergraduate", and not only because of the disparity between IMO and typical undergraduate. He said that "We bump problems down a tier if we feel the difficulty comes too heavily from applying a major result, even in an advanced field, as a black box, since that makes a problem vulnerable to naive attacks from models"
The reddit thread is ... interesting (direct link[1]). It seems to be a debate among mathematicians some of whom do have access to the secret set. But they're debating publicly and so naturally avoiding any concrete examples that would give the set away so wind-up with fuzzy-fiddly language for the qualities of the problem tiers.
The "reality" of keeping this stuff secret 'cause someone would train on it is itself bizarre and certainly shouldn't be above questioning.
It's not about training directly on the test set, it's about people discussing questions in the test set online (e.g., in forums), and then this data is swept up into the training set. That's what makes test set contamination so difficult to avoid.
That is the "reality" - that because companies can train their models on the whole Internet, companies will train their (base) models on the entire Internet.
And in this situation, "having heard the problem" actually serves as a barrier to understanding of these harder problems since any variation of known problem will receive a standard "half-assed guestimate".
And these companies "can't not" use these base models since they're resigned to the "bitter lesson" (better the "bitter lesson viewpoint" imo) that they need large scale heuristics for the start of their process and only then can they start symbolic/reasoning manipulations.
But hold-up! Why couldn't an organization freeze their training set and their problems and release both to the public? That would give us an idea where the research stands. Ah, the answer comes out, 'cause they don't own the training set and the result they want to train is a commercial product that needs every drop of data to be the best. As Yan LeCun has said, this isn't research, this is product development.
>> It's not about training directly on the test set, it's about people discussing questions in the test set online
Don't kid yourself. There are 10's of billions of dollars going into AI. Some of the humans involved would happily cheat on comparative tests to boost investment.
The incentives are definitely there, but even CEOs and VCs know that if they cheat the tests just to get more investment, they're only cheating themselves. No one is liquidating within the next 5 years so either they end up getting caught and lose everything or they spent all this energy trying to cheat while having a subpar model which results in them losing to competitors who actually invested in good technology.
Having a higher valuation could help with attracting better talent or more funding to invest in GPUs and actual model improvements but I don't think that outweighs the risks unless you're a tiny startup with nothing to show (but then you wouldn't have the money to bribe anyone).
> So the LLM is already doing what the author says they would be surprised about.
that's if you unconditionally believe in result without any proofreading, confirmation, reproducability and even barely any details (we are given only one slide).
I just spent a few days trying to figure out some linear algebra with the help of ChatGPT. It's very useful for finding conceptual information from literature (which for a not-professional-mathematician at least can be really hard to find and decipher). But in the actual math it constantly makes very silly errors. E.g. indexing a vector beyond its dimension, trying to do matrix decomposition for scalars and insisting on multiplying matrices with mismatching dimensions.
O1 is a lot better at spotting its errors than 4o but it too still makes a lot of really stupid mistakes. It seems to be quite far from producing results itself consistently without at least a somewhat clueful human doing hand-holding.
LLMs have been very useful for me in explorations of linear algebra, because I can have an idea and say "what's this operation called?" or "how do I go from this thing to that thing?", and it'll give me the mechanism and an explanation, and then I can go read actual human-written literature or documentation on the subject.
It often gets the actual math wrong, but it is good enough at connecting the dots between my layman's intuition and the "right answer" that I can get myself over humps that I'd previously have been hopelessly stuck on.
It does make those mistakes you're talking about very frequently, but once I'm told that the thing I'm trying to do is achievable with the Gram-Schmidt process, I can go self-educate on that further.
The big thing I've had to watch out for is that it'll usually agree that my approach is a good or valid one, even when it turns out not to be. I've learned to ask my questions in the shape of "how do I", rather than "what if I..." or "is it a good idea to...", because most of the time it'll twist itself into shapes to affirm the direction I'm taking rather than challenging and refining it.
It reliably fails also basic real analysis proofs, but I think this is not too surprising since those require a mix of logic and computation that is likely hard to just infer from statistical likelihood of tokens
When you give it a large math problem and the answer is "seven point one three five ... ", and it shows a plot of the result v some randomly selected domain, well there could be more I'd like to know.
You can unlock a full derivation of the solution, for cases where you say "Solve" or "Simplify", but what I (and I suspect GP) might want, is to know why a few of the key steps might work.
It's a fantastic tool that helped get me through my (engineering) grad work, but ultimately the breakthrough inequalities that helped me write some of my best stuff were out of a book I bought in desperation that basically cataloged linear algebra known inequalities and simplifications.
When I try that kind of thing with the best LLM I can use (as of a few months ago, albeit), the results can get incorrect pretty quickly.
> [...], but what I (and I suspect GP) might want, is to know why a few of the key steps might work.
It's been some time since I've used the step-by-step explainer, and it was for calculus or intro physics problems at best, but IIRC the pro subscription will at least mention the method used to solve each step and link to reference materials (e.g., a clickable tag labeled "integration by parts").
Doesn't exactly explain why but does provide useful keywords in a sequence that can be used to derive the why.
Its understanding of problems was very bad last time I used it. Meaning it was difficult to communicate what you wanted it to do. Usually I try to write in the Mathematica language, but even that is not foolproof.
Hopefully they have incorporated more modern LLM since then, but it hasn’t been that long.
Wolfram Alpha's "smartness" is often Clippy level enraging. E.g. it makes assumptions of symbols based on their names (e.g. a is assumed to be a constant, derivatives are taken w.r.t. x). Even with Mathematica syntax it tends to make such assumptions and refuses to lift them even when explicitly directed. Quite often one has to change the variable symbols used to try to make Alpha to do what's meant.
What's surprising to me is that this would surely be in OpenAI's interests, too -- free RLHF!
Of course there would be the risk of adversaries giving bogus feedback, but my gut says it's relatively straightforward to filter out most of this muck.
Wolfram Alpha can solve equations well, but it is terrible at understanding natural language.
For example I asked Wolfram Alpha "How heavy a rocket has to be to launch 5 tons to LEO with a specific impulse of 400s", which is a straightforward application of the Tsiolkovsky rocket equation. Wolfram Alpha gave me some nonsense about particle physics (result: 95 MeV/c^2), GPT-4o did it right (result: 53.45 tons).
Wolfram alpha knows about the Tsiolkovsky rocket equation, it knows about LEO (low earth orbit), but I found no way to get a delta-v out of it, again, more nonsense. It tells me about Delta airlines, mentions satellites that it knows are not in LEO. The "natural language" part is a joke. It is more like an advanced calculator, and for that, it is great.
You know, "You're using it wrong" is usually meant to carry an ironic or sarcastic tone, right?
It dates back to Steve Jobs blaming an iPhone 4 user for "holding it wrong" rather than acknowledging a flawed antenna design that was causing dropped calls. The closest Apple ever came to admitting that it was their problem was when they subsequently ran an employment ad to hire a new antenna engineering lead. Maybe it's time for Wolfram to hire a new language-model lead.
No, “holding it wrong” is the sarcastic version. “You’re using it wrong” is a super common way to tell people they are literally using something wrong.
The problem has always been that you only get good answers if you happen to stumble on a specific question that it can handle. Combining Alpha with an LLM could actually be pretty awesome, but I'm sure it's easier said than done.
Before LLMs exploded nobody really expected WA to perform well at natural language comprehension. The expectations were at the level of "an ELIZA that knows math".
Wolfram Alpha is mostly for "trivia" type problems. Or giving solutions to equations.
I was figuring out some mode decomposition methods such as ESPRIT and Prony and how to potentially extend/customize them. Wolfram Alpha doesn't seem to have a clue about such.
Probably mostly not. The errors tend to be logical/conceptual. E.g. mixing up scalars and matrices is unlikely to be from tokenization. Especially if using spaces between the variables and operators, as AFAIK GPTs don't form tokens over spaces (although tokens may start or end with them).
The only thing I've consistently had issues with while using AI is graphs. If I ask it to put some simple function, it produces a really weird image that has nothing to do with the graph I want. It will be a weird swirl of lines and words, and it never corrects itself no matter what I say to it.
Has anyone had any luck with this? It seems like the only thing that it just can't do.
And works very well - it made me a nice general "draw successively accurate Fourier series approximations given this lambda for coefficients and this lambda for the constant term". PNG output, no real programming errors (I wouldn't remember if it had some stupid error, I'm a python programmer). Even TikZ in LaTeX isn't hopeless (although I did ending up reading the tikz manual)
Ask it to plot the graph with python plotting utilities. Not using its image generator. I think you need a ChatGPT subscription though for it to be able to run python code.
You seem to get 2(?) free Python program runs per week(?) as part of the 01 preview.
When you visit chatgpt on the free account it automatically gives you the best model and then disables it after some amount of work and says to come back later or upgrade.
It was, for a while. I think this is an area where there may have been some regression. It can still write code to solve problems that are a poor fit for the language model, but you may need to ask it to do that explicitly.
The agentic reasoning models should be able to fix this if they have the ability to run code instead of giving each task to itself. "I need to make a graph" "LLMs have difficulty graphing novel functions" "Call python instead" is a line of reasoning I would expect after seeing what O1 has come up with on other problems.
Giving AI the ability to execute code is the safety peoples nightmare though, wonder if we'll hear anything from them as this is surely coming
Abstract: In the coming decades, developments in automated reasoning will likely transform the way that research mathematics is conceptualized and carried out. I will discuss some ways we might think about this. The talk will not be about current or potential abilities of computers to do mathematics—rather I will look at topics such as the history of automation and mathematics, and related philosophical questions.
That was wonderful, thank you for linking it. For the benefit of anyone who doesn't have time to watch the whole thing, here are a few really nice quotes that convey some main points.
"We might put the axioms into a reasoning apparatus like the logical machinery of Stanley Jevons, and see all geometry come out of it. That process of reasoning are replaced by symbols and formulas... may seem artificial and puerile; and it is needless to point out how disastrous it would be in teaching and how hurtful to the mental development; how deadening it would be for investigators, whose originality it would nip in the bud. But as used by Professor Hilbert, it explains and justifies itself if one remembers the end pursued." Poincare on the value of reasoning machines, but the analogy to mathematics once we have theorem-proving AI is clear (that the tools and the lie direct outputs are not the ends. Human understanding is).
"Even if such a machine produced largely incomprehensible proofs, I would imagine that we would place much less value on proofs as a goal of math. I don't think humans will stop doing mathematics... I'm not saying there will be jobs for them, but I don't think we'll stop doing math."
"Mathematics is the study of reproducible mental objects." This definition is human ("mental") and social (it implies reproducing among individuals). "Maybe in this world, mathematics would involve a broader range of inquiry... We need to renegotiate the basic goals and values of the discipline." And he gives some examples of deep questions we may tackle beyond just proving theorems.
As someone who has a 18 yo son who wants to study math, this has me (and him)
... worried ... about becoming obsolete?
But I'm wondering what other people think of this analogy.
I used to be a bench scientist (molecular genetics).
There were world class researchers who were more creative than I was. I even had a Nobel Laureate once tell me that my research was simply "dotting 'i's and crossing 't's".
Nevertheless, I still moved the field forward in my own small ways. I still did respectable work.
So, will these LLMs make us completely obsolete? Or will there still be room for those of us who can dot the "i"?--if only for the fact that LLMs don't have infinite time/resources to solve "everything."
I don't know. Maybe I'm whistling past the graveyard.
I was just thinking about this. I already posted a comment here, but I will say that as a mathematician (PhD in number theory), that for me, AI signficantly takes away the beauty of doing mathematics within a realm in which AI is used.
The best part of math (again, just for me) is that it was a journey that was done by hand with only the human intellect that computers didn't understand. The beauty of the subject was precisely that it was a journey of human intellect.
As I said elsewhere, my friends used to ask me why something was true and it was fun to explain it to them, or ask them and have them explain it to me. Now most will just use some AI.
Soulless, in my opinion. Pure mathematics should be about the art of the thing, not producing results on an assembly line like it will be with AI. Of course, the best mathematicians are going into this because it helps their current careers, not because it helps the future of the subject. Math done with AI will be a lot like Olympic running done with performance-enhancing drugs.
Yes, we will get a few more results, faster. But the results will be entirely boring.
There are many similarities in your comment to how grandmasters discuss engines. I have a hunch the arc of AI in math will be very similar to the arc of engines in chess.
I agree with that, in the sense that math will become more about who can use AI the fastest to generate the most theories, which sort of side-steps the whole point of math.
As a chess aficionado and a former tournament player, who didn’t get very far, I can see pros & cons. They helped me train and get significantly better than I would’ve gotten without them. On the other hand, so did the competition. :) The average level of the game is so much higher than when I was a kid (30+ years ago) and new ways of playing that were unthinkable before are possible now. On the other hand cheating (online anyway) is rampant and all the memorization required to begin to be competitive can be daunting, and that sucks.
Hey I play chess too. Not a very good player though. But to be honest, I enjoy playing with people who are not serious because I do think an overabundance of knowledge makes the game too mechanical. Just my personal experience, but I think the risk of cheaters who use programs and the overmechanization of chess is not worth becoming a better player. (And in fact, I think MOST people can gain satisfaction by improving just by studying books and playing. But I do think that a few who don't have access to opponents benefit from a chess-playing computer).
I agree wholeheartedly about the beauty of doing mathematics. I will add though that the author of this article, Kevin Buzzard, doesn't need to do this for his career and from what I know of him is somebody who very much cares about mathematics and the future of the subject. The fact that a mathematician of that calibre is interested in this makes me more interested.
We also seem to suffer these automation delusions right now.
I could see how AI could assist me with learning pure math but the idea AI is going to do pure math for me is just absurd.
Not only would I not know how to start, more importantly I have no interest in pure math. There will still be a huge time investment to get up to speed with doing anything with AI and pure math.
You have to know what questions to ask. People with domain knowledge seem to really be selling themselves short. I am not going to randomly stumble on a pure math problem prompt when I have no idea what I am doing.
Presumably people who get into math going forward will feel differently.
For myself, chasing lemmas was always boring — and there’s little interest in doing the busywork of fleshing out a theory. For me, LLMs are a great way to do the fun parts (conceptual architecture) without the boring parts.
And I expect we’ll such much the same change as with physics: computers increase the complexity of the objects we study, which tend to be rather simple when done by hand — eg, people don’t investigate patterns in the diagrams of group(oids) because drawing million element diagrams isn’t tractable by hand. And you only notice the patterns in them when you see examples of the diagrams at scale.
Just a counterpoint, but I wonder how much you'll really understand if you can't even prove the whole thing yourself. Personally, I learn by proving but I guess everyone is different.
My hunch is it won't be much different, even when we can simply ask a machine that doesn't have a cached proof, "prove riemann hypothesis" and it thinks for ten seconds and spits out a fully correct proof.
As Erdos(I think?) said, great math is not about the answers, it's about the questions. Or maybe it was someone else, and maybe "great mathematicians" rather than "great math". But, gist is the same.
"What happens when you invent a thing that makes a function continuous (aka limit point)"? "What happens when you split the area under a curve into infinitesimal pieces and sum them up"? "What happens when you take the middle third out of an interval recursively"? "Can we define a set of axioms that underlie all mathematics"? "Is the graph of how many repetitions it takes for a complex number to diverge interesting"? I have a hard time imagining computers would ever have a strong enough understanding of the human experience with mathematics to even begin pondering such questions unprompted, let alone answer them and grok the implications.
Ultimately the truths of mathematics, the answers, soon to be proved primarily by computers, already exist. Proving a truth does not create the truth; the truth exists independent of whether it has been proved or not. So fundamentally math is closer to archeology than it may appear. As such, AI is just a tool to help us dig with greater efficiency. But it should not be considered or feared as a replacement for mathematicians. AI can never take away the enlightenment of discovering something new, even if it does all the hard work itself.
> I have a hard time imagining computers would ever have a strong enough understanding of the human experience with mathematics to even begin pondering such questions unprompted, let alone answer them and grok the implications.
The key is that the good questions however come from hard-won experience, not lazily questioning an AI.
Even current people will feel differently. I don't bemoan the fact that Lean/Mathlib has `simp` and `linarith` to automate trivial computations. A "copilot for Lean" that can turn "by induction, X" or "evidently Y" into a formal proof sounds great.
The the trick is teaching the thing how high powered of theorems to use or how to factor out details or not depending on the user's level of understanding. We'll have to find a pedagogical balance (e.g. you don't give `linarith` to someone practicing basic proofs), but I'm sure it will be a great tool to aid human understanding.
A tool to help translate natural language to formal propositions/types also sounds great, and could help more people to use more formal methods, which could make for more robust software.
I think it will become apparent how bad they are at it. They’re algorithms and not sentient beings. They do not think of themselves, their place in the world, and do not fathom the contents of the minds of others. They do no care what others think of them.
Whatever they write only happens to contain some truth by virtue of the model and the training data. An algorithm doesn’t know what truth is or why we value it. It’s a bullshitter of the highest calibre.
Then comes the question: will they write proofs that we will consider beautiful and elegant, that we will remember and pass down?
Or will they generate what they’ve been asked to and nothing less? That would be utterly boring to read.
If you looked at how the average accountant spent their time before the arrival of the digital spreadsheet, you might have predicted that automated calculation would make the profession obsolete. But it didn't.
This time could be different, of course. But I'll need a lot more evidence before I start telling people to base their major life decisions on projected technological change.
That's before we even consider that only a very slim minority of the people who study math (or physics or statistics or biology or literature or...) go on to work in the field of math (or physics or statistics or biology or literature or...). AI could completely take over math research and still have next to impact on the value of the skills one acquires from studying math.
Or if you want to be more fatalistic about it: if AI is going to put everyone out of work then it doesn't really matter what you do now to prepare for it. Might as well follow your interests in the meantime.
It's important to base life decisions on very real technological change. We don't know what the change will be, but it's coming. At the very least, that suggests more diverse skills.
We're all usually (but not always) better off, with more productivity, eventually, but in the meantime, jobs do disappear. Robotics did not fully displace machinists and factory workers, but single-skilled people in Detroit did not do well. The loom, the steam engine... all of them displaced often highly-trained often low-skilled artisans.
If AI reaches this level socioeconomic impact is going to be so immense, that choosing what subject you study will have no impact on your outcome - no matter what it is - so it's a pointless consideration.
Another PhD in maths here and I would say not to worry. It's the process of doing and understanding mathematics, and thinking mathematically that is ultimately important.
There's never been the equivalent of the 'bench scientist' in mathematics and there aren't many direct careers in mathematics, or pure mathematics at least - so very few people ultimately become researchers. Instead, I think you take your way of thinking and apply it to whatever else you do (and it certainly doesn't do any harm to understand various mathematical concepts incredibly well).
What LLMs can do is limited, they are superior to wet-wear in some tasks like finding and matching patterns in higher dimensional space, they are still fundamentally limited to a tiny class of problems outside of that pattern finding and matching.
LLMs will be tools for some math needs and even if we ever get quantum computers will be limited in what they can do.
LLMs, without pattern matching, can only do up to about integer division, and while they can calculate parity, they can't use it in their calculations.
There are several groups sitting on what are known limitations of LLMs, waiting to take advantage of those who don't understand the fundamental limitations, simplicity bias etc...
The hype will meet reality soon and we will figure out where they work and where they are problematic over the next few years.
But even the most celebrated achievements like proof finding with Lean, heavily depends on smart people producing hints that machines can use.
Basically lots of the fundamental hints of the limits of computation still hold.
Model logic may be an accessable way to approach the limits of statistical inference if you want to know one path yourself.
A lot of what is in this article relates to some the known fundamental limitations.
Remember that for all the amazing progress, one of the core founders of the perceptron, Pitts drank him self to death in the 50s because it was shown that they were insufficient to accurately model biological neurons.
Optimism is high, but reality will hit soon.
So think of it as new tools that will be available to your child, not a replacement.
"LLMs, without pattern matching, can only do up to about integer division, and while they can calculate parity, they can't use it in their calculations." - what do you mean by this? Counting the number of 1's in a bitstring and determining if it's even or odd?
The point being that the ability to use parity gates is different than being able to calculate it, which is where the union of the typically ram machine DLOGTIME with the circuit complexity of uniform TC0 comes into play.
PARITY, MAJ, AND, and OR are all symmetric, and are in TCO, but PARITY is not in DLOGTIME-uniform TC0, which is first-order logic with Majority quantifiers.
Another path, if you think about symantic properties and Rice's theorem, this may make sense especially as PAC learning even depth 2 nets is equivalent to the approximate SVP.
PAC-learning even depth-2 threshold circuits is NP-hard.
For me thinking about how ZFC was structured so we can keep the niceties of the law of the excluded middle, and how statistics pretty much depends on it for the central limit and law of large numbers, IID etc...
But that path runs the risk of reliving the Brouwer–Hilbert controversy.
Most likely AI will be good at some things and not others, and mathematicians will just move to whatever AI isn't good at.
Alternatively, if AI is able to do all math at a level above PhDs, then its going to be a brave new world and basically the singularity. Everything will change so much that speculating about it will probably be useless.
Let's put it this way, from another mathematician, and I'm sure I'll probably be shot for this one.
Every LLM release moves half of the remaining way to the minimum viable goal of replacing a third class undergrad. If your business or research initiative is fine with that level of competence then you will find utility.
The problem is that I don't know anyone who would find that useful. Nor does it fit within any existing working methodology we have. And on top of that the verification of any output can take considerably longer than just doing it yourself in the first place, particularly where it goes off the rails, which it does all the time. I mean it was 3 months ago I was arguing with a model over it not understanding place-value systems properly, something we teach 7 year olds here?
But the abstract problem is at a higher level. If it doesn't become a general utility for people outside of mathematics, which is very very evident at the moment by the poor overall adoption and very public criticism of the poor result quality, then the funding will dry up. Models cost lots of money to train and if you don't have customers it's not happening and no one is going to lend you the money any more. And then it's moot.
Well said. As someone with only a math undergrad and as a math RLHF’er, this speaks to my experience the most.
That craving for an understanding an elegant proof is nowhere to be found with verifying an LLM’s proof.
Like sure, you could put together a car by first building an airplane, disassembling all of it minus the two front seats, and having zero elegance and still get a car at the end. But if you do all that and don’t provide novelty in results or useful techniques, there’s no business.
Hell, I can’t even get a model to calculate compound interest for me (save for the technicality of prompt engineering a python function to do it). What do I expect?
This is a great point that nobody will shoot you over :)
But the main question is still: assuming you replace an undergrad with a model, who checks the work? If you have a good process around that already, and find utility as an augmented system, then get you’ll get value - but I still think it’s better for the undergrad to still have the job and be at the wheel, and does things faster and better when leveraging a powerful tool.
Shot already for criticising the shiny thing (happened with crypto and blockchain already...)
Well to be fair no one checks what the graduates do properly, even if we hired KPMG in. That is until we get sued. But at least we have someone to blame then. What we don't want is something for the graduate to blame. The buck stops at someone corporeal because that's what the customers want and the regulators require.
That's the reality and it's not quite as shiny and happy as the tech industry loves to promote itself.
My main point, probably cleared up with a simple point: no one gives a shit about this either way.
I used to do bench top work too; and was blessed with “the golden hands” in that I could almost always get protocols working. To me this always felt more like intuition than deductive reasoning. And it made me a terrible TA. My advice to students in lab was always something along the lines of “just mess around with it, and see how it works.” Not very helpful for the stressed and struggling student -_-
Digression aside, my point is that I don’t think we know exactly what makes or defines “the golden hands”. And if that is the case, can we optimize for it?
Another point is that scalable fine tuning only works for verifiable stuff. Think a priori knowledge. To me that seems to be at the opposite end of the spectrum from “mess with it and see what happens”.
The mathematicians of the future will still have to figure out the right questions, even if llms can give them the answers. And "prompt engineering" will require mathematical skills, at the very least.
Evaluating the output of llms will also require mathematical skills.
But I'd go further, if your son enjoys mathematics and has some ability in the area, it's wonderful for your inner life. Anyone who becomes sufficiently interested in anything will rediscover mathematics lurking at the bottom.
What part do you think is going to become obsolete? Because Math isn't about "working out the math", it's about finding the relations between seemingly unrelated things to bust open a problem. Short of AGI, there is no amount of neural net that's going to realize that a seemingly impossible probabilistic problem is actually equivalent to a projection of an easy to work with 4D geometry. "Doing the math" is what we have computers for, and the better they get, the easier the tedious parts of the job become, but "doing math" is still very much a human game.
> What part do you think is going to become obsolete?
Thank you for the question.
I guess what I'm saying is:
Will LLMs (or whatever comes after them) be _so_ good and _so_ pervasive that we will simply be able to say, "Hey ChatGPT-9000, I'd like to see if the xyz conjecture is correct." And then ChatGPT-9000 just does the work without us contributing beyond asking a question.
Or will the technology be limited/bound in some way such that we will still be able to use ChatGPT-9000 as a tool of our own intellectual augmentation and/or we could still contribute to research even without it.
Hopefully, my comment clarifies my original post.
Also, writing this stuff has helped me think about it more. I don't have any grand insight, but the more I write, the more I lean toward the outcome that these machines will allow us to augment our research.
As amazing as they may seem, they're still just autocompletes, it's inherent to what an LLM is. So unless we come up with a completely new kind technology, I don't see "test this conjecture for me" becoming more real than the computer assisted proof tooling we already have.
I was referring to Linus's harmful and evil promotion of Vitamin C as the cure for everything and cancer. I don't think Linus was attaching that garbage to any particular Nobel prize. But people did say to their doctors: "Are you a Nobel winner, doctor?". Don't think they cared about particular prize either.
Which is "harmful and evil" thanks to your afterknowledge. He had based his books on the research that failed to replicate. But given low toxicity of vitamin C it's not that "evil" to recommend treatment even if probabilistic estimation of positive effects is not that high.
Sloppy, but not exceptionally bad. At least it was instrumental in teaching me to not expect marvels coming from dietary research.
If Pauling's eugenics policies were bad, then the laws against incest that are currently on the books in many states (which are also eugenics policies that use the same mechanism) are also bad. There are different forms of eugenics policies, and Pauling's proposal to restrict the mating choices of people carrying certain recessive genes so their children don't suffer is ethically different from Hitler exterminating people with certain genes and also ethically different from other governments sterilizing people with certain genes. He later supported voluntary abortion with genetic testing, which is now standard practice in the US today, though no longer in a few states with ethically questionable laws restricting abortion. This again is ethically different from forced abortion.
FWIW my understanding is that the policies against incest you mention actually have much less to do with controlling genetic reproduction and are more directed at combating familial rape/grooming/etc.
Not a fun thing to discuss, but apparently a significant issue, which I guess should be unsurprising given some of the laws allowing underage marriage if the family signs off.
Mentioning only to draw attention to the fact that theoretical policy is often undeniable in a vacuum, but runs aground when faced with real world conditions.
This is mentioned in my link: "According to Pauling, carriers should have an obvious mark, (i.e. a tattoo on the forehead) denoting their disease, which would allow carriers to identify others with the same affliction and avoid marrying them."
The goal wasn't to mark people for ostracism but to make it easier for people carrying these genes to find mates that won't result in suffering for their offspring.
Eventually we may produce a collection of problems exhaustive enough that these tools can solve almost any problem that isn't novel in practice, but I doubt that they will ever become general problem solvers capable of what we consider to be reasoning in humans.
Historically, the claim that neural nets were actual models of the human brain and human thinking was always epistemically dubious. It still is. Even as the practical problems of producing better and better algorithms, architectures, and output have been solved, there is no reason to believe a connection between the mechanical model and what happens in organisms has been established. The most important point, in my view, is that all of the representation and interpretation still has to happen outside the computational units. Without human interpreters, none of the AI outputs have any meaning. Unless you believe in determinism and an overseeing god, the story for human beings is much different. AI will not be capable of reason until, like humans, it can develop socio-rational collectivities of meaning that are independent of the human being.
Researchers seemed to have a decent grasp on this in the 90s, but today, everyone seems all too ready to make the same ridiculous leaps as the original creators of neural nets. They did not show, as they claimed, that thinking is reducible to computation. All they showed was that a neural net can realize a boolean function—which is not even logic, since, again, the entire semantic interpretive side of the logic is ignored.
> Unless you believe in determinism and an overseeing god
Or perhaps, determinism and mechanistic materialism - which in STEM-adjacent circles has a relatively prevalent adherence.
Worldviews which strip a human being of agency in the sense you invoke crop up quite a lot today in such spaces. If you start of adopting a view like this, you have a deflationary sword which can cut down most any notion that's not mechanistic in terms of mechanistic parts. "Meaning? Well that's just an emergent phenomenon of the influence of such and such causal factors in the unrolling of a deterministic physical system."
Similar for reasoning, etc.
Now obviously large swathes of people don't really subscribe to this - but it is prevalent and ties in well with utopian progress stories. If something is amenable to mechanistic dissection, possibly it's amenable to mechanistic control. And that's what our education is really good at teaching us. So such stories end up having intoxicating "hype" effects and drive fundraising, and so we get where we are.
For one, I wish people were just excited about making computers do things they couldn't do before, without needing to dress it up as something more than it is. "This model can prove a set of theorems in this format with such and such limits and efficiency"
Agreed. If someone believes the world is purely mechanistic, then it follows that a sufficiently large computing machine can model the world---like Leibniz's Ratiocinator. The intoxication may stem from the potential for predictability and control.
The irony is: why would someone want control if they don't have true choice? Unfortunately, such a question rarely pierces the intoxicated mind when this mind is preoccupied with pass the class, get an A, get a job, buy a house, raise funds, sell the product, win clients, gain status, eat right, exercise, check insta, watch the game, binge the show, post on Reddit, etc.
> Agreed. If someone believes the world is purely mechanistic, then it follows that a sufficiently large computing machine can model the world---like Leibniz's Ratiocinator.
I don’t think it does. Taking computers as an analogy… if you have a computer with 1GB memory, then you can’t simulate a computer with more than 1GB memory inside of it.
> If someone believes the world is purely mechanistic, then it follows that a sufficiently large computing machine can model the world
Is this controversial in some way? The problem is that to simulate a universe you need a bigger universe -- which doesn't exist (or is certainly out of reach due to information theoretical limits)
> ---like Leibniz's Ratiocinator. The intoxication may stem from the potential for predictability and control.
I really don't understand the 'control' angle here. It seems pretty obvious that even in a purely mechanistic view of the universe, information theory forbids using the universe to simulate itself. Limited simulations, sure... but that leaves lots of gaps wherein you lose determinism (and control, whatever that means).
People wish to feel safe. One path to safety is controlling or managing the environment. Lack of sufficient control produces anxiety. But control is only possible if the environment is predictable, i.e., relatively certain knowledge that if I do X then the environment responds with Y. Humans use models for prediction. Loosely speaking, if the universe is truly mechanistic/deterministic, then the goal of modeling is to get the correct model (though notions of "goals" are problematic in determinism without real counterfactuals). However, if we can't know whether the universe is truly deterministic, then modeling is a pragmatic exercise in control (or management).
My comments are not about simulating the universe on a real machine. They're about the validity and value of math/computational modeling in a universe where determinism is scientifically indeterminable.
Choice is over rated. This gets to an issue Ive long had with Nozicks experience machine. Not only would I happily spend my days in such a machine, Im pretty sure most other people would too. Maybe they say they wouldnt but if you let them try it out and then offered them the question again I think theyd say yes. The real conclusion of the experience machine is that the unknown is scary.
> there is no reason to believe a connection between the mechanical model and what happens in organisms has been established
The universal approximation theorem. And that's basically it. The rest is empirical.
No matter which physical processes happen inside the human brain, a sufficiently large neural network can approximate them. Barring unknowns like super-Turing computational processes in the brain.
The universal approximation theorem is set in a precise mathematical context; I encourage you to limit its applicability to that context despite the marketing label "universal" (which it isn't). Consider your concession about empiricism. There's no empirical way to prove (i.e. there's no experiment that can demonstrate beyond doubt) that all brain or other organic processes are deterministic and can be represented completely as functions.
Function is the most general way of describing relations. Non-deterministic processes can be represented as functions with a probability distribution codomain. Physics seems to require only continuous functions.
Sorry, but there's not much evidence that can support human exceptionalism.
I don't understand your point here. A (logical) relation is, by definition, a more general way of describing relations than a function, and it is telling that we still suck at using and developing truly relational models that are not univalent (i.e. functions). Only a few old logicians really took the calculus of relations proper seriously (Pierce, for one). We use functions precisely because they are less general, they are rigid, and simpler to work with. I do not think anyone is working under the impression that a function is a high fidelity means to model the world as it is experienced and actually exists. It is necessarily reductionistic (and abstract). Any truth we achieve through functional models is necessarily a general, abstracted, truth, which in many ways proves to be extremely useful but in others (e.g. when an essential piece of information in the particular is not accounted for in the general reductive model) can be disastrous.
I'm not a big fan of philosophy. The epistemology you are talking about is another abstraction on top of the physical world. But the evolution of the physical world as far as we know can be described as a function of time (at least, in a weak gravitational field when energies involved are well below the grand unification energy level, that is for the objects like brains).
The brain is a physical system, so whatever it does (including philosophy) can be replicated by modelling (a (vastly) simplified version of) underlying physics.
Anyway, I am not especially interested in discussing possible impossibility of an LLM-based AGI. It might be resolved empirically soon enough.
Some differential equations that model physics admit singularities and multiple solutions. Therefore, functions are not the most general way of describing relations. Functions are a subset of relations.
Although "non-deterministic" and "stochastic" are often used interchangeably, they are not equivalent. Probability is applied analysis whose objects are distributions. Analysis is a form of deductive, i.e. mechanical, reasoning. Therefore, it's more accurate (philosophically) to identify mathematical probability with determinism. Probability is a model for our experience. That doesn't mean our experience is truly probabilistic.
Humans aren't exceptional. Math modeling and reasoning are human activities.
For example, the Euler equations model compressible flow with discontinuities (shocks in the flow field variables) and rarefaction waves. These theories are accepted and used routinely.
Great. A useful approximation of what really happens in the fluid. But I'm sure there are no shocks and rarefactions in physicists' neurons while they are thinking about it.
Switching into a less facetious mode...
Do you understand that in context of this dialogue it's not enough to show some examples of discontinuous or otherwise unrepresentable by NNs functions? You need at least to give a hint why such functions cannot be avoided while approximating functionality of the human brain.
Many things are possible, but I'm not going to keep my mind open to a possibility of a teal Russell's teapot before I get a hint at its existence, so to speak.
That's not useful by itself, because "anything cam model anything else" doesn't put any upper bound on emulation cost, which for one small task could be larger than the total energy available in the entire Universem
Either the brain violates the physical Church-Turing thesis or it's not.
If it does, well, it will take more time to incorporate those physical mechanisms into computers to get them on par with the brain.
I leave the possibility that it's "magic"[1] aside. It's just impossible to predict, because it will violate everything we know about our physical world.
[1] One example of "magic": we live in a simulation and the brain is not fully simulated by the physics engine, but creators of the simulation for some reason gave it access to computational resources that are impossible to harness using the standard physics of the simulated world. Another example: interactionistic soul.
Quantum computing actually isn't super-Turing, it "just" computes some things faster. (Strictly speaking it's somewhere between a standard Turing machine and a nondeterministic Turing machine in speed, and the first can emulate the second.)
If we're nitpicking: quantum computing algorithms could (if implemented) compute certain things faster than the best classical algorithms we know. We don't know any quantum algorithms that are provably faster than all possible classical algorithms.
I'm with you. Interpreting a problem as a problem requires a human (1) to recognize the problem and (2) to convince other humans that it's a problem worth solving. Both involve value, and value has no computational or mechanistic description (other than "given" or "illusion"). Once humans have identified a problem, they might employ a tool to find the solution. The tool has no sense that the problem is important or even hard; such values are imposed by the tool's users.
It's worth considering why "everyone seems all too ready to make ... leaps ..." "Neural", "intelligence", "learning", and others are metaphors that have performed very well as marketing slogans. Behind the marketing slogans are deep-pocketed, platformed corporate and government (i.e. socio-rational collective) interests. Educational institutions (another socio-rational collective) and their leaders have on the whole postured as trainers and preparers for the "real world" (i.e. a job), which means they accept, support, and promote the corporate narratives about techno-utopia. Which institutions are left to check the narratives? Who has time to ask questions given the need to learn all the technobabble (by paying hundreds of thousands for 120 university credits) to become a competitive job candidate?
I've found there are many voices speaking against the hype---indeed, even (rightly) questioning the epistemic underpinnings of AI. But they're ignored and out-shouted by tech marketing, fundraising politicians, and engagement-driven media.
I hear these arguments a lot from law and philosophy students, never from those trained in mathematics. It seems to me, "literary" people will still be discussing these theoretical hypotheticals as technology passes them by building it.
I straddle both worlds. Consider that using the lens of mathematical reasoning to understand everything is a bit like trying to use a single mathematical theory (eg that of groups) to comprehend mathematics as a whole. You will almost always benefit and enrich your own understanding by daring to incorporate outside perspectives.
Consider also that even as digital technology and the ratiomathimatical understanding of the world has advanced it is still rife with dynamics and problems that require a humanistic approach. In particular, a mathematical conception cannot resolve teleological problems which require the establishment of consensus and the actual determination of what we, as a species, want the world to look like. Climate change and general economic imbalance are already evidence of the kind of disasters that mount when you limit yourself to a reductionistic, overly mathematical and technological understanding of life and existence. Being is not a solely technical problem.
I don't disagree, I just don't think it is done well or at least as seriously as it used to. In modern philosophy, there are many mathematically specious arguments, that just make clear how large the mathematical gap has become e.g. improper application of Godel's incompleteness theorems. Yet Godel was a philosopher himself, who would disagree with its current hand-wavy usage.
19th/20th was a golden era of philosophy with a coherent and rigorous mathematical lens to apply with other lenses. Russel, Turing, Godel, etc. However this just doesn't exist anymore
While I agree that these are titans of 20th c. philosophy, particularly of the philosophy of mathematics and logic, the overarching school they belonged to (logical positivism) has been thoroughly and rightly criticized, and it is informative to read these criticisms to understand why a view of life that is overly mathematical is in many ways inadequate. Your comment still argues from a very limited perspective. There is no reason that correct application of Gödel s theorem should be any indication of the richness of someone's philosophical views unless you are already a staunchly committed reductionist who values mathematical arguments above all else (why? can maths help you explain and understand the phenomena of love in a way that will actually help you experience love? this is just one example domain where it does not make much sense), or unless they are specifically attempting a philosophy of mathematics. The question of whether or not we can effectively model cognition and human mental function using mathematical models is not a question of mathematical philosophy, but rather one of epistemology. If you really want to head a spurious argument, read McCulloch and Pitts. They essentially present an argument of two premises, the brain is finite, and we can create a machine of formal "neurons" (which are not even complete models of real neurons) that computes a boolean function, they then conclude that they must have a model of cognition, that cognition must be nothing more than computation, and that the brain must basically be a Turing machine.
The relevance of mathematics to the cognitive problem must be decided outside of mathematics. As another poster said, even if you buy the theorems, it is still an empirical question as to whether or not they really model what they claim to model, and whether or not that model is of a fidelity that we find acceptable for a definition of general intelligence. Often, people reach claims of adequacy today not by producing really fantastic models but instead by lowering the bar enormously. They claim that these models approximate humans by severely reducing the idea of what it means to be an intelligent human to the specific talents their tech happens to excel at (e.g. apparently being a language parrot is all that intelligence is, ignoring all the very nuanced views and definitions of intelligence we have come up with over the course of history. A machine that is not embodied ina skeletal structure and cannot even experience, let alone solve, the vast number of physical, anatomical problems we contend with on a daily basis is, in my view, still very far from anything I would call general intelligence).
I don't have much to opine from an advanced maths perspective, but I'd like to point out a couple examples of where ChatGPT made basic errors in questions I asked it as an undergrad CS student.
1. I asked it to show me the derivation of a formula for the efficiency of Stop-and-Wait ARQ and it seemed to do it, but a day later, I realised that in one of the steps, it just made a term vanish to get to the next step. Obviously, I should have verified more carefully, but when I asked it to spot the mistake in that step, it did the same thing twice more with bs explanations of how the term is absorbed.
2. I asked it to provide me syllogisms that I could practice proving. An overwhelming number of the syllogisms it gave me were inconsistent and did not hold. This surprised me more because syllogisms are about the most structured arguments you can find, having been formalized centuries ago and discussed extensively since then. In this case, asking it to walk step-by-step actually fixed the issue.
Both of these were done on the free plan of ChatGPT, but I can remember if it was 4o or 4.
The first question is always: which model? Which fortunately you at least addressed:
>free plan of ChatGPT, but I can remember if it was 4o or 4.
Since chatgpt-4o, there has been o1-preview, and o1 (full) is out. They just announced o3 got a 25% on frontiermath which is what this article is a reaction to. So, any tests on 4o are at least TWO (or three) AI releases with new capabilities.
I didn't see anyone else ask this but.. isn't the FrontierMath dataset compromised now? At the very least OpenAI now knows the questions if not the answers. I would expect that the next iteration will "magically" get over 80% on the FrontierMath test. I imagine that experiment was pretty closely monitored.
I figured their model was independently evaluated against the questions/answers. That's not to say it's not compromised by "Here's a bag of money" type methods, but I don't even think it'd be a reasonable test if they just handed over the dataset.
I'm sure it was independently evaluated, but I'm sure the folks running the test were not given an on-prem installation of ChatGPT to mess with. It was still done via API calls, presumably through the chat interface UI.
That means the questions went over the fence to OpenAI.
I'm quite certain they are aware of that, and it would be pretty foolish not to take advantage of at least knowing what the questions are.
Sure, but given the resourcing at OpenAI, it would not be hard to clean[1] the inputs. I'm just trying to be realistic here, there are plenty of ways around contractual obligations and a significant incentive to do so.
Insightful comment. The thing that's extremely frustrating is look at all the energy poured into this conversation around benchmarks. There is a fundamental assumption of honesty and integrity in the benchmarking process by at least some people. But when the dataset is compromised and generation N+1 has miraculous performance gains, how can we see this as anything other than a ploy to pump up valuations? Some people have millions of dollars at stake here and they don't care about the naysayers in the peanut gallery like us.
It's sadly inevitable that when billions in funding and industry hype are tied to performance on a handful of benchmarks, scores will somehow, magically, continue to go up.
Needless to say, it doesn't bring us any closer to AGI.
The only solution I see here is people crafting their own, private benchmarks that the big players don't care about enough to train on. That, at least, gives you a clearer view of the field.
Not sure why your comment was downvoted, but it certainly shows the pressure going against people who point out fundamental flaws. This is pushing us towards "AVI" rather than AGI-- "Artificially Valued Intelligence". The optimization function here is around the market.
I'm being completely serious. You are correct, despite the downvotes, that this could not be pushing us towards AGI because if the dataset is leaked you can't claim the G-- generalizability.
The point of the benchmark is to lead is to believe that this is a substantial breakthrough. But a reasonable person would be forced to conclude that the results are misleading to due to optimizing around the training data.
It's fascinating that this has run into the exact same problem as the Quantum research. Ie, in the quantum research to demonstrate any valuable forward progress you must compute something that is impossible to do with a traditional computer. If you can't do it with a traditional computer, it suddenly becomes difficult to verify correctness (ie, you can't just check it was matching the traditional computer's answer.
In the same way ChatGPT scores 25% on this and the question is "How close were those 25% to questions in the training set". Or to put it another way we want to answer the question "Is ChatGPT getting better at applying it's reasoning to out-of-set problems or is it pulling more data into it's training set". Or "Is the test leaking into the training".
Maybe the whole question is academic and it doesn't matter, we solve the entire problem by pulling all human knowledge into the training set and that's a massive benefit. But maybe it implies a limit to how far it can push human knowledge forward.
>in the quantum research to demonstrate any valuable forward progress you must compute something that is impossible to do with a traditional computer
This is factually wrong. The most interesting problems motivating the quantum computing research are hard to solve, but easy to verify on classical computers. The factorization problem is the most classical example.
The problem is that existing quantum computers are not powerful enough to solve the interesting problems, so researchers have to invent semi-artificial problems to demonstrate "quantum advantage" to keep the funding flowing.
There is a plethora of opportunities for LLMs to show their worth. For example, finding interesting links between different areas of research or being a proof assistant in a math/programming formal verification system. There is a lot of ongoing work in this area, but at the moment signal-to-noise ratio of such tools is too low for them to be practical.
No, it is factually right, at least if Scott Aaronson is to be believed:
> Having said that, the biggest caveat to the “10^25 years” result is one to which I fear Google drew insufficient attention. Namely, for the exact same reason why (as far as anyone knows) this quantum computation would take ~10^25 years for a classical computer to simulate, it would also take ~10^25 years for a classical computer to directly verify the quantum computer’s results!! (For example, by computing the “Linear Cross-Entropy” score of the outputs.) For this reason, all validation of Google’s new supremacy experiment is indirect, based on extrapolations from smaller circuits, ones for which a classical computer can feasibly check the results. To be clear, I personally see no reason to doubt those extrapolations. But for anyone who wonders why I’ve been obsessing for years about the need to design efficiently verifiable near-term quantum supremacy experiments: well, this is why! We’re now deeply into the unverifiable regime that I warned about.
It's a property of the "semi-artificial" problem chosen by Google. If anything, it means that we should heavily discount this claim of "quantum advantage", especially in the light of inherent probabilistic nature of quantum computations.
Note that the OP wrote "you MUST compute something that is impossible to do with a traditional computer". I demonstrated a simple counter-example to this statement: you CAN demonstrate forward progress by factorizing big numbers, but the problem is that no one can do it despite billions of investments.
If they can't, then is it really quantum supremacy?
They claimed it last time in 2019 with Sycamore, which could perform in 200 seconds a calculation that Google claimed would take a classical supercomputer 10,000 years.
That was debunked when a team of scientists replicated the same thing on an ordinary computer in 15 hours with a large number of GPUs. Scott Aaronson said that on a supercomputer, the same technique would have solved the problem in seconds.[1]
So if they now come up with another problem which they say cannot even be verified by a classical computer and uses it to claim quantum advantage, then it is right to be suspicious of that claim.
> If they can't, then is it really quantum supremacy?
Yes, quantum supremacy on an artificial problem is quantum supremacy (even if it's "this quantum computer can simulate itself faster than a classical computer"). Quantum supremacy on problems that are easy to verify would of course be nicer, but unfortunately not all problems happen to have an easy verification.
that applies specifically to this artificial problem google created to be hard for classical computers and in fact in the end it turned out it was not so much. IBM came up with a method to do what google said it would take 10.000 years on a classical computers in just 2 days. I would not be surprised if a similar reduction happened also to their second attempt if anyone was motivated enough to look at it.
In general we have thousands of optimisations problems that are hard to solve but immediate to verify.
What's factually wrong about it? OP said "you must compute something that is impossible to do with a traditional computer" which is true, regardless of the output produced. Verifying an output is very different from verifying the proper execution of a program. The difference between testing a program and seeing its code.
What is being computed is fundamentally different from classical computers, therefore the verification methods of proper adherence to instructions becomes increasingly complex.
They left out the key part which was incorrect and the sentence right after "If you can't do it with a traditional computer, it suddenly becomes difficult to verify correctness"
The point stands that for actually interesting problems verifying correctness of the results is trivial. I don't know if "adherence to instructions" transudates at all to quantum computing.
> This is factually wrong. The most interesting problems motivating the quantum computing research are hard to solve, but easy to verify on classical computers.
You parent did not talk about quantum computers. I guess he rather had predictions of novel quantum-field theories or theories of quantum gravity in the back of his mind.
I agree with the issue of ”is the test dataset leaking into the training dataset” being an issue with interpreting LLM capabilities in novel contexts, but not sure I follow what you mean on the quantum computing front.
My understanding is that many problems have solutions that are easier to verify than to solve using classical computing. e.g. prime factorization
Oh it's a totally different issue on the quantum side that leads to the same issue with difficulty verifying. There, the algorithms that Google for example is using today, aren't like prime factorization, they're not easy to directly verify with traditional computers, so as far as I'm aware they kind of check the result for a suitably small run, and then do the performance metrics on a large run that they hope gave a correct answer but aren't able to directly verify.
How much of this could be resolved if its training set were reduced? Conceivably, most of the training serves only to confuse the model when only aiming to solve a math equation.
Depends on your understanding of human knowledge I guess? People talk about the frontier of human knowledge and if your view of knowledge is like that of a unique human genius pushing forward the frontier then yes - it'd be stuck. But if you think of knowledge as more complex than that you could have areas that are kind of within our frontier of knowledge (that we could reasonably know, but don't actually know) - taking concepts that we already know in one field and applying them to some other field. Today the reason that doesn't happen is because genius A in physics doesn't know about the existence of genius B in mathematics (let alone understand their research), but if it's all imbibed by "The Model" then it's trivial to make that discovery.
Reasoning is essentially the creation of new knowledge from existing knowledge. The better the model can reason the less constrained it is to existing knowledge.
The challenge is how to figure out if a model is genuinely reasoning
Reasoning is a very minor (but essential) part of knowledge creation.
Knowledge creation comes from collecting data from the real world, and cleaning it up somehow, and brainstorming creative models to explain it.
NN/LLM's version of model building is frustrating because it is quite good, but not highly "explainable". Human models have higher explainability, while machine models have high predictive value on test examples due to an impenetrable mountain of algebra.
There are likely lots of connections that could be made that no individual has made because no individual has all of existing human knowledge at their immediate disposal.
I don't think many expect AI to push knowledge forward? A thing that basically just regurgitates consensus historic knowledge seems badly suited to that
I think current "AI" (i.e. LLMs) is unable to push human knowledge forward, but not because it's constrained by existing human knowledge. It's more like peeking into a very large magic-8 ball, new answers everytime you shake it. Some useful.
It may be able to push human knowledge forward to an extent.
In the past, there was quite a bit of low hanging fruit such that you could have polymaths able to contribute to a wide variety of fields, such as Newton.
But in the past 100 years or so, the problem is there is so much known, it is impossible for any single person to have deep knowledge of everything. e.g. its rare to find a really good mathematician who also has a deep knowledge (beyond intro courses) about say, chemistry.
Would a sufficiently powerful AI / ML model be able to come up with this synthesis across fields?
That's not a strong reason. Yes, that means ChatGPT isn't good at wholly independently pushing knowledge forward, but a good brainstormer that is right even 10% of the time is an incredible found of knowledge.
No comment on the article it's just always interesting to get hit with intense jargon from a field I know very little about.
I understood the statements of all five questions. I could do the third one relatively quickly (I had seen the trick before that the function mapping a natural n to alpha^n was p-adically continuous in n iff the p-adic valuation of alpha-1 was positive)
> I am dreading the inevitable onslaught in a year or two of language model “proofs” of the Riemann hypothesis which will just contain claims which are vague or inaccurate in the middle of 10 pages of correct mathematics which the human will have to wade through to find the line which doesn’t hold up.
I wonder what the response of working mathematicians will be to this. If the proofs look credible it might be too tempting to try and validate them, but if there’s a deluge that could be a hug time sync. Imagine if Wiles or Perelman had produced a thousand different proofs for their respective problems.
Maybe the coming onslaught of AI slop "proofs" will give a little bump to proof assistants like Coq. Of course, it would still take a human mathematician some time to verify theorem definitions.
Honestly I think it won’t be that different from today, where there is no shortage of cranks producing “proofs” of the Riemann Hypothesis and submitting them to prestigious journals.
> As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me [that] I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so you moment you make a database of maths questions public, the language models will train on it.
Well, yes and no. This is only true because we are talking about closed models from closed companies like so-called "OpenAI".
But if all models were truly open, then we could simply verify what they had been trained on, and make experiments with models that we could be sure had never seen the dataset.
Decades ago Microsoft (in the words of Ballmer and Gates) famously accused open source of being a "cancer" because of the cascading nature of the GPL.
But it's the opposite. In software, and in knowledge in general, the true disease is secrecy.
> But if all models were truly open, then we could simply verify what they had been trained on
How do you verify what a particular open model was trained on if you haven’t trained it yourself? Typically, for open models, you only get the architecture and the trained weights. How can you reliably verify what the model was trained on from this?
Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."
> Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."
If they've done it right, you can re-run the training and get the same weights. And maybe you could spot-check parts of it without running the full training (e.g. if there are glitch tokens in the weights, you'd look for where they came from in the training data, and if they weren't there at all that would be a red flag). Is it possible to release the wrong training set (or the wrong instructions) and hope you don't get caught? Sure, but demanding that it be published and available to check raises the bar and makes it much more risky to cheat.
The OP said "truly open" not "open model" or any of the other BS out there. If you are truly open you share the training corpora as well or at least a comprehensive description of what it is and where to get it.
Lots of ai researchers have shown that you can both give credit and discredit "open models" when you are given a dataset and training steps.
Many lauded papers fell into reddit Ml or twitter ire when people couldnt reproduce the model or results.
If you are given the training set, the weights, the steps required, and enough compute, you can do it.
Having enough compute and people releasing the steps is the main impediment.
For my research I always release all of my code, and the order of execution steps, and of course the training set. I also give confidence intervals based on my runs so people can reproduce and see if we get similar intervals.
After playing with and using AI for almost two years now it is not getting better from both a cost perspective and performance.
So the higher the cost the better the performance. While models and hardware can be improved the curve is still steep.
The big answer is what are people using it for? We'll they are using lightweight simplistic models to do targeted tasks. To do many smaller and easier to process tasks.
Most of the news on AI is just there to promote a product to earn more cash.
I am fairly optimistic about LLMs as a human math -> theorem-prover translator, and as a fan of Idris I am glad that the AI community is investing in Lean. As the author shows, the answer to "Can AI be useful for automated mathematical work?" is clearly "yes."
But I am confident the answer to the question in the headline is "no, not for several decades." It's not just the underwhelming benchmark results discussed in the post, or the general concern about hard undergraduate math using different skillsets than ordinary research math. IMO the deeper problem still seems to be a basic gap where LLMs can seemingly do formal math at the level of a smart graduate student but fail at quantitative/geometric reasoning problems designed for fish. I suspect this holds for O3, based on one of the ARC problems it wasn't able to solve: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... (via https://www.interconnects.ai/p/openais-o3-the-2024-finale-of...) ANNs are simply not able to form abstractions, they can only imitate them via enormous amounts of data and compute. I would say there has been zero progress on "common sense" math in computers since the invention of Lisp: we are still faking it with expert systems, even if LLM expert systems are easier to build at scale with raw data.
It is the same old problem
where an ANN can attain superhuman performance on level 1
of Breakout, but it has to be retrained for level 2. I am not convinced it makes sense to say AI can do math if AI doesn't understand what "four" means with the same depth as a rat, even if it can solve sophisticated modular arithmetic problems. In human terms, does it make sense to say a straightedge-and-compass AI understands Euclidean geometry if it's not capable of understanding the physical intuition behind Euclid's axioms? It makes more sense to say it's a brainless tool that helps with the tedium and drudgery of actually proving things in mathematics.
To give a sense if scale: It’s not that o3 failed to solve that red blue rectangle problem once: o3 spent thousands of gpu hours putting out text about that problem, creating by my math about a million pages of text, and did not find the answer anywhere in those pages. For other problems it did find the answer around the million page mark, as at the ~$3000 per problem spend setting the score was still slowly creeping up.
If the trajectory of the past two years is any guide, things that can be done at great compute expense now will rapidly become possible for a fraction of the cost.
it can take my math and point out a step I missed and then show me the correct procedure but still get the wrong result because it can't reliably multiply 2-digit numbers
it's a "language" model (LLM), not a "math" model. when it is generating your answer, predicting and outputing a word after word it is _not_ multiplying your numbers internally.
Yes, I know. It's just kind of interesting how it can make inferences about complicated things but not get multiplications correct that would almost definitely have been in its training set many times (two digit by two digit)
Which is actually a problem I have with ARC (and IQ tests more generally): it is computationally cheaper to go from ARC transformation rule -> ARC problem than it is the other way around. But this means it’s pretty easy to generate ARC problems with non-unique solutions.
One thing I know is that there wouldn’t be machines entering IMO 2025. The concept of “marker” does not exist in IMO - scores are decided by negotiations between team leaders of each country and the juries. It is important to get each team leader involved for grading the work of students for their country, for accountability as well as acknowledging cultural differences. And the hundreds of people are not going to stay longer to grade AI work.
So here's what I'm perplexed about. There are statements in Presburger arithmetic that take time doubly exponential (or worse) in the size of the statement to reach via any path of the formal system whatsoever. These are arithmetic truths about the natural numbers. Can these statements be reached faster in ZFC? Possibly—it's well-known that there exist shorter proofs of true statements in more powerful consistent systems.
But the problem then is that one can suppose there are also true short statements in ZFC which likewise require doubly exponential time to reach via any path. Presburger Arithmetic is decidable whereas ZFC is not, so these statements would require the additional axioms of ZFC for shorter proofs, but I think it's safe to assume such statements exist.
Now let's suppose an AI model can resolve the truth of these short statements quickly. That means one of three things:
1) The AI model can discover doubly exponential length proof paths within the framework of ZFC.
2) There are certain short statements in the formal language of ZFC that the AI model cannot discover the truth of.
3) The AI model operates outside of ZFC to find the truth of statements in the framework of some other, potentially unknown formal system (and for arithmetical statements, the system must necessarily be sound).
How likely are each of these outcomes?
1) is not possible within any coherent, human-scale timeframe.
2) IMO is the most likely outcome, but then this means there are some really interesting things in mathematics that AI cannot discover. Perhaps the same set of things that humans find interesting. Once we have exhausted the theorems with short proofs in ZFC, there will still be an infinite number of short and interesting statements that we cannot resolve.
3) This would be the most bizarre outcome of all. If AI operates in a consistent way outside the framework of ZFC, then that would be equivalent to solving the halting problem for certain (infinite) sets of Turing machine configurations that ZFC cannot solve. That in itself itself isn't too strange (e.g., it might turn out that ZFC lacks an axiom necessary to prove something as simple as the Collatz conjecture), but what would be strange is that it could find these new formal systems efficiently. In other words, it would have discovered an algorithmic way to procure new axioms that lead to efficient proofs of true arithmetic statements. One could also view that as an efficient algorithm for computing BB(n), which obviously we think isn't possible. See Levin's papers on the feasibility of extending PA in a way that leads to quickly discovering more of the halting sequence.
> and for arithmetical statements, the system must necessarily be sound
Why do you say this? The AI doesn't know or care about soundness. Probably it has mathematical intuition that makes unsound assumptions, like human mathematicians do.
> How likely are each of these outcomes?
I think they'll all be true to a certain extent, just as they are for human mathematicians. There will probably be certain classes of extremely long proofs that the AI has no trouble discovering (because they have some kind of structure, just not structure that can be expressed in ZFC), certain truths that the AI makes an intuitive leap to despite not being able to prove them in ZFC (just as human mathematicians do), and certain short statements that the AI cannot prove one way or another (like Goldbach or twin primes or what have you, again, just as human mathematicians can't).
ZFC is way worse than Presburger arithmetic -- since it is undecidable, we know that the length of the minimal proof of a statement cannot be bounded by a computable function of the length of the statement.
This has little to do with the usefulness of LLMs for research-level mathematics though. I do not think that anyone is hoping to get a decision procedure out of it, but rather something that would imitate human reasoning, which is heavily based on analogies ("we want to solve this problem, which shares some similarities with that other solved problem, can we apply the same proof strategy? if not, can we generalise the strategy so that it becomes applicable?").
2 is definitely true. 3 is much more interesting and likely true but even saying it takes us into deep philosophical waters.
If every true theorem had a proof in a computationally bounded length the halting problem would be solvable. So the AI can't find some of those proofs.
The reason I say 3 is deep is that ultimately our foundational reasons to assume ZFC+the bits we need for logic come from philosohical groundings and not everyone accepts the same ones. Ultrafinitists and large cardinal theorists are both kinds of people I've met.
My understanding is that no model-dependent theorem of ZFC or its extensions (e.g., ZFC+CH, ZFC+¬CH) provides any insight into the behavior of Turing machines. If our goal is to invent an algorithm that finds better algorithms, then the philosophical angle is irrelevant. For computational purposes, we would only care about new axioms independent of ZFC if they allow us to prove additional Turing machine configurations as non-halting.
> There are statements in Presburger arithmetic that take time doubly exponential (or worse) in the size of the statement to reach via any path of the formal system whatsoever.
This is a correct statement about the worst case runtime. What is interesting for practical applications is whether such statements are among those that you are practically interested in.
I would certainly think so. The statements mathematicians seem to be interested in tend to be at a "higher level" than simple but true statements like 2+3=5. And they necessarily have a short description in the formal language of ZFC, otherwise we couldn't write them down (e.g., Fermat's last theorem).
If the truth of these higher level statements instantly unlocks many other truths, then it makes sense to think of them in the same way that knowing BB(5) allows one to instantly classify any Turing machine configuration on the computation graph of all n ≤ 5 state Turing machines (on empty tape input) as halting/non-halting.
Every profession seems to have a pessimistic view of AI as soon as it starts to make progress in their domain. Denial, Anger, Bargaining, Depression, and Acceptance. Artists seem to be in the depression state, many programmers are still in the denial phase. Pretty solid denial here from a mathematician. o3 was a proof of concept, like every other domain AI enters, it's going to keep getting better.
Society is CLEARLY not ready for what AI's impact is going to be. We've been through change before, but never at this scale and speed. I think Musk/Vivek's DOGE thing is important, our governent has gotten quite large and bureaucratic. But the clock has started on AI, and this is a social structural issue we've gotta figure out. Putting it off means we probably become subjects to a default set of rulers if not the shoggoth itself.
The reason why this is so disruptive is because it will effect hundreds of fields simultaneously.
Previously workers in a field disrupted by automation would retrain to a different part of the economy.
If AI pans out to the point that there are mass layoffs in hundreds of sectors of the economy at once, then i’m not sure the process we have haphazardly set up now will work. People will have no idea where to go beyond manual labor. (But this will be difficult due to the obesity crisis - but maybe it will save lives in a weird way).
Well it hasn’t happened yet at least (unemployment is near historic lows). How much better does AI need to get? And do we actually expect it to happen? Improving on random benchmarks is not necessarily evidence of being able to do a specific job.
If there are 'mass layoffs in hundreds of sectors of the economy at once', then the economy immediately goes into Great Depression 2.0 or worse. Consumer spending is two-thirds of the US economy, when everyone loses their jobs and stops having disposable income that's literally what a depression is
This will create a prisoner’s dilemma for corporations then, the government will have to step in to provide incentives for insanely profitable corporations to keep the proper number of people employed or limit the rate of layoffs.
I think it's a little of both. Maybe generative AI algorithms won't overcome their initial limitations. But maybe we don't need to overcome them to transform society in a very significant way.
It's because we then go check it out, and see how useless it is when applied to the domain.
> programmers are still in the denial phase
I am doing a startup and would jump on any way to make the development or process more efficient. But the only thing LLMs are really good for are investor pitches.
My favourite moments of being a graduate student in math was showing my friends (and sometimes professors) proofs of propositions and theorems that we discussed together. To be the first to put together a coherent piece of reasoning that would convince them of the truth was immensely exciting. Those were great bonding moments amongst colleagues. The very fact that we needed each other to figure out the basics of the subject was part of what made the journey so great.
Now, all of that will be done by AI.
Reminds of the time when I finally enabled invincibility in Goldeneye 007. Rather boring.
I think we've stopped to appreciate the human struggle and experience and have placed all the value on the end product, and that's we're developing AI so much.
Yeah, there is the possibility of working with an AI but at that point, what is the point? Seems rather pointless to me in an art like mathematics.
No "AI" of any description is doing novel proofs at the moment. Not o3, or anything else.
LLMs are good for chatting about basic intuition with, up to and including complex subjects, if and only if there are publically available data on the topic which have been fed to the LLM during its training. They're good at doing summaries and overviews of specific things (if you push them around and insist they don't waffle and ignore garbage carefully and keep your critical thinking hat on, etc etc).
It's like having a magnifying glass that focuses in on the small little maths question you might have, without you having to sift through ten blogs or videos or whatever.
That's hardly going to replace graduate students doing proofs with professors, though, at least not with the methods being employed thus far!
I think that’s provably incorrect for the current approach to LLMs. They all have a horizon over which they correlate tokens in the input stream.
So, for any LLM, if you intersperse more than that number of ‘X’ tokens between each useful token, they won’t be able to do anything resembling intelligence.
The current LLMs are a bit like n-gram databases that do not use letters, but larger units.
Naturally, humans couldn’t do it, even though they could edit the input to remove the X’s, but shouldn’t we evaluate the ability (even intelligent ability) of LLM’s on what they can generally do rather than amplify their weakness?
Why is that unfair in reply to the claim “At this stage I assume everything having a sequencial pattern can and will be automated by LLM AIs.”?
I am not claiming LLMs aren’t or cannot be intelligent, not even that they cannot do magical things; I just rebuked a statement about the lack of limits of LLMs.
> Naturally, humans couldn’t do it, even though they could edit the input to remove the X’s
So, what are you claiming: that they cannot or that they can? I think most people can and many would. Confronted with a file containing millions of X’s, many humans will wonder whether there’s something else than X’s in the file, do a ‘replace all’, discover the question hidden in that sea of X’s, and answer it.
There even are simple files where most humans would easily spot things without having to think of removing those X's. Consider a file
How X X X X X X
many X X X X X X
days X X X X X X
are X X X X X X
there X X X X X X
in X X X X X X
a X X X X X X
week? X X X X X X
with a million X’s on the end of each line. Spotting the question in that is easy for humans, but impossible for the current bunch of LLMs
This is only easy because the software does line wrapping for you, mechanistically transforming the hard pattern of millions of symbols into another that happens to be easy for your visual system to match. Do the same for any visually capable model and it will get that easily too. Conversely, make that a single line (like the one transformers sees) and you will struggle much more than the transformer because you'll have to scan millions of symbols sequentially looking for patterns.
Humans have weak attention compared to it, this is a poor example.
If you have a million Xs on the end of each line, when a human is looking at that file, he's not looking at the entirety of it, but only at the part that is actually visible on-screen, so the equivalent task for an LLM would be to feed it the same subset as input. In which case they can all answer this question just fine.
The follow-up question is "Does it require a paradigm shift to solve it?". And the answer could be "No". Episodic memory, hierarchical learnable tokenization, online learning or whatever works well on GPUs.
> FrontierMath is a secret dataset of “hundreds” of hard maths questions, curated by Epoch AI, and announced last month.
The database stopped being secret when it was fed to proprietary LLMs running in the cloud. If anyone is not thinking that OpenAI has trained and tuned O3 on the "secret" problems people fed to GPT-4o, I have a bridge to sell you.
It's perfectly possible for OpenAI to run the model (or prove others the means to run it) without storing queries/outputs for future. I expect Epoch AI would insist on this. Perhaps OpenAI would lie about it, but that's opening up serious charges.
What evidence do we need that AI companies are exploiting every bit of information they can use to get ahead in the benchmarks to generate more hype? Ignoring terms/agreements, violating copyright, and otherwise exploiting information for personal gain is the foundation of that entire industry for crying out loud.
Some people are also forgetting who is the CEO of OpenAI.
Sam Altman has long talked about believing in the "move fast and break things" way of doing business. Which is just a nicer way of saying do whatever dodgy things you can get away with.
> How much longer this will go on for nobody knows, but there are lots of people pouring lots of money into this game so it would be a fool who bets on progress slowing down any time soon.
Money cannot solve the issues faced by the industry which mainly revolves around lack of training data.
They already used the entirety of the internet, all available video, audio and books and they are now dealing with the fact that most content online is now generated by these models, thus making it useless as training data.
When did we decide that AI == LLM? Oh don't answer. I know, The VC world noticed CNNs and LLMs about 10 years ago and it's the only thing anyone's talked about ever since.
Seems to me the answer to 'Can AI do maths yet?' depends on what you call AI and what you call maths. Our old departmental VAX running at a handfull of megahertz could do some very clever symbol manipulation on binomials and if you gave it a few seconds, it could even do something like theorum proving via proto-prolog. Neither are anywhere close to the glorious GAI future we hope to sell to industry and government, but it seems worth considering how they're different, why they worked, and whether there's room for some hybrid approach. Do LLMs need to know how to do math if they know how to write Prolog or Coc statements that can do interesting things?
I've heard people say they want to build software that emulates (simulates?) how humans do arithmetic, but ask a human to add anything bigger than two digit numbers and the first thing they do is reach for a calculator.
No it can't, and there's no such thing as AI. How is a thing that predicts the next-most-likely word going to do novel math? It can't even do existing math reliably because logical operations and statistical approximation are fundamentally different. It is fun watching grifters put lipstick on this thing and shop it around as a magic pig though.
openai and epochai (frontier math) are startups with a strong incentive to push such narratives. the real test will be in actual adoption in real world use cases.
the management class has a strong incentive to believe in this narrative, since it helps them reduce labor cost. so they are investing in it.
eventually, the emperor will be seen to have no clothes at least in some usecases for which it is being peddled right now.
"once" the training data can do it, LLMs will be able to do it. and AI will be able to do math once it comes to check out the lights of our day and night. until then it'll probably wonder continuously and contiguously: "wtf! permanence! why?! how?! by my guts, it actually fucking works! why?! how?!"
AWS announced 2 or 3 weeks a way of formulating rules into a formal language.
AI doesn't need to learn everything, our LLM Models already contain EVERYTHING. Including ways of how to find a solution step by step.
Which means, you can tell an LLM to translate whatever you want, into a logical language and use an external logic verifier. The only thing a LLM or AI needs to 'understand' at this point is to make sure that the statistical translation from left to right is high enough.
Your brain doesn't just do logic out of the box, You conclude things and formulate them.
And plenty of companies work on this. Its the same with programming, if you are able to write code and execute it, you execute it until the compiler errors are gone. Now your LLM can write valid code out of the box. Let the LLM write unit tests, now it can verify itself.
Claude for example offers you, out of the box, to write a validation script. You can give claude back the output of the script claude suggested to you.
I tried. I don't have the time to formulate and scrutinise adequate arguments, though.
Do you? Anything anywhere you could point me to?
The algorithms live entirely off the training data. They consistently fail to "abduct" (inference) beyond any language-in/of-the-training-specific information.
It is a gradual thing. Presumably the models are inferring things on runtime that was not a part of their training data.
Anyhow, philosophically speaking you are also only exposed to what your senses pick up, but presumably you are able to infer things?
As written: this is a dogma that stems from a limited understanding of what algorithmic processes are and the insistence that emergence can not happen from algorithmic systems.
There can be more than one problem. The history of computing (or even just the history of AI) is full of things that worked better and better right until they hit a wall. We get diminishing returns adding more and more training data. It’s really not hard to imagine a series of breakthroughs bringing us way ahead of LLMs.
It still has to know what to code in that environment. And based on my years of math as a wee little undergrad, the actual arithmetic was the least interesting part. LLM’s are horrible at basic arithmetic, but they can use python for the calculator. But python wont help them write the correct equations or even solve for the right thing (wolfram alpha can do a bit of that though)
I’ve yet to encounter an equation that 4o couldn’t answer in 1-2 prompts unless it timed out. Even then it can provide the solution in a Jupyter notebook that can be run locally.
Never really pushed it. I have to reason to believe it wouldn’t get most of that stuff correctly. Math is very much like programming and I’m sure it can output really good python for its notebook to use execute.
That's the equivalent to what we are asking the model to do. If you give the model a calculator it will get 100%. If you give it a pen and paper (e.g. let it show it's working) then it will get near 100%.
> That's the equivalent to what we are asking the model to do.
Why?
What does it mean to give a model a calculator?
What do you mean “let it show its working”? If I ask an LLM to do a calculation, I never said it can’t express the answer to me in long-form text or with intermediate steps.
If I ask a human to do a calculation that they can’t reliably do in their head, they are intelligent enough to know that they should use a pen and paper without needing my preemptive permission.
Ai has a interior world model thus it can do math if a chain of proof is walking without uncertainty from room to room. the problem is its inability to reflect on its own uncertainty and to then overrife that uncertainty ,should a new room entrance method be selfsimilar to a previous entrance
Unfortunately, the scientists who study actually brains have all sort of interesting models but ultimately very little clue how these actual brains work at the level of problem solving. I mean, there's all sort of "this area is associated with that kind of process" and "here's evidence this area does this algorithm" stuff but it's all at the level you imagine steam engine engineers trying to understand a warp drive.
The "open worm project" was an effort years ago to get computer scientists involved in trying to understand what "software" a very small actual brain could run. I believe progress here has been very slow and that an idea of ignorance that much larger brains involve.
I haven't checked in a while, but last I checked ChatGPT it struggled on very basic things like: how many Fs are in this word? Not sure if they've managed to fix that but since that I had lost hope in getting it to do any sort of math
I may be wrong, but I think it a silly question. AI is basically auto-complete. It can do math to the extent you can find a solution via auto-complete based on an existing corpus of text.
Humans can autocomplete sentences too because we understand what's going on. Prediction is a necessary criterion for intelligence, not an irrelevant one.
I understand the appeal of having a machine helping us with maths and expanding the frontier of knowledge. They can assist researchers and make them more productive. Just like they can make already programmers more productive.
But maths are also fun and fulfilling activity. Very often, when we learn a math theory, it's because we want to understand and gain intuition on the concepts, or we want to solve a puzzle (for which we can already look up the solution). Maybe it's similar to chess. We didn't develop search engines to replace human players and make them play together, but they helped us become better chess players or understanding the game better.
So the recent progress is impressive, but I still don't see how we'll use this tech practically and what impacts it can have and in which fields.
"Can AI do math for us" is the canonical wrong question. People want self-driving cars so they can drink and watch TV. We should crave tools that enhance our abilities, as tools have done since prehistoric times.
I'm a research mathematician. In the 1980's I'd ask everyone I knew a question, and flip through the hard bound library volumes of Mathematical Reviews, hoping to recognize something. If I was lucky, I'd get a hit in three weeks.
Internet search has shortened this turnaround. One instead needs to guess what someone else might call an idea. "Broken circuits?" Score! Still, time consuming.
I went all in on ChatGPT after hearing that Terry Tao had learned the Lean 4 proof assistant in a matter of weeks, relying heavily on AI advice. It's clumsy, but a very fast way to get suggestions.
Now, one can hold involved conversations with ChatGPT or Claude, exploring mathematical ideas. AI is often wrong, never knows when it's wrong, but people are like this too. Read how the insurance incidents for self-driving taxis are well below the human incident rates? Talking to fellow mathematicians can be frustrating, and so is talking with AI, but AI conversations go faster and can take place in the middle of the night.
I don't want AI to prove theorems for me, those theorems will be as boring as most of the dreck published by humans. I want AI to inspire bursts of creativity in humans.
> AI is often wrong, never knows when it's wrong, but people are like this too.
When talking with various models of ChatGPT about research math, my biggest gripe is that it's either confidently right (10% of my work) or confidently wrong (90%). A human researcher would be right 15% of the time, unsure 50% of the time, and give helpful ideas that are right/helpful (25%) or wrong/a red herring (10%). And only 5% of the time would a good researcher be confidently wrong in a way that ChatGPT is often.
In other words, ChatGPT completely lacks the meta-layer of "having a feeling/knowing how confident it is", which is so useful in research.
these numbers are just your perception. The way you ask the question will very much influence the output and certain topics more than others. I get much better results when I share my certainty levels in my questions and say things like "if at all", "if any" etc.
I agree with this approach and use it myself, but these confidence markers can also skew output in undesirable ways. All of these heuristics are especially fragile when the subject matter touches the frontiers of what is known.
In any case my best experiences with LLMs for pure math research have been for exploring the problem space and ideation -- queries along the line of "Here's a problem I'm working on ... . Do any other fields have a version of this problem, but framed differently?" or "Give me some totally left field methods, even if they are from different fields or unlikely to work. Assume I've exhausted all the 'obvious' approaches from field X"
Yeah, blame the users for "using it wrong" (phrase of the week I would say after the o3 discussions), and then sell the solution as almost-AGI.
PS: I'm starting to see a lot of plausible deniability in some comments about LLMs capabilites. When LLMs do great => "cool, we are scaling AI". when LLMs do something wrong => "user problem", "skill issues", "don't judge a fish for its ability to fly".
A human researcher that is basically right 40%-95% of the time would probably an Einstein level genius.
Just assume that the LLM is wrong and test their assumptions - math is one of the few disciplines where you can do that easily
It's pretty easy to test when it makes coding mistakes as well. It's also really good at "Hey that didn't work, here's my error message."
Do you think there’s potential for AI to develop a kind of probabilistic reasoning?
It think it is every sci-fiction dreamer to teach a robot to love.
I don't think AI will think conventionally. It isn't thinking to begin with. It is weighing options. Those options permutate and that is why every response is different.
I agree. I think it comes down to the motivation behind why one does mathematics (or any other field for that matter). If it's a means to an end, then sure have the AI do the work and get rid of the researchers. However, that's not why everyone does math. For many it's more akin to why an artist paints. People still paint today even though a camera can produce much more realistic images. It was probably the case (I'm guessing!) that there was a significant drop in jobs for artists-for-hire, for whom painting was just a means to an end (e.g. creating a portrait), but the artists who were doing it for the sake of art survived and were presumably made better by the ability to see photos of other places they want to paint or art from other artists due to the invention of the camera.
> People want self-driving cars so they can drink and watch TV. We should crave tools that enhance our abilities, as tools have done since prehistoric times.
Improved tooling and techniques have given humans the free time and resources needed for arts, culture, philosophy, sports, and spending time to enjoy life! Fancy telecom technologies have allowed me to work from home and i love it :)
> Talking to fellow <humans> can be frustrating, and so is talking with AI, but AI conversations go faster and can take place in the middle of the night.
I made a slight change to generalise your statement, I think you have summarised the actual marketing opportunity.
I think I'm missing your point? You still want to enjoy doing math yourself? Is that what you are saying? So you equate "Can AI do math in my place?" with "Can AI drink and watch TV in my place?"
Ingredients to a top HN comment on AI include some nominal expert explaining why actually labor won’t be replaced and it will be a collaborative process so you don’t need to worry sprinkled with a little bit of ‘the status quo will stay still even though this tech only appeared in the last 2 years’
It didn't appear in the last two years. We have had deep learning based autoregressive language models (like Word2Vec) for at least 10 years.
Early computer networks appeared in the 1960s and the public internet as we know it in the 1990s.
We are still early in AI.
In a way, AI is part of the process, but it's a collaborative process. It doesn't do all the work.
AI will not do math for us, but maybe eventually it will lead to another mainstream tool for mathematicians. Along with R, Matlab, Sage, GAP, Magma, ...
It would be interesting if in the future mathematicians are just as fluent in some (possibly AI-powered) proof verifying tool, as they are with LaTeX today.
AI can already do a bunch of math. So "AI will not do math for us" is just factually wrong.
Can AI solve “toy” math problems that computers have not been able to do? Yes. Can AI produce novel math research? No, it hasn’t yet. So “AI will not do math for us” is only factually wrong if you take the weaker definition of “doing math for us”. The stronger definition is not factually wrong yet.
More problematic with that statement is that a timeline isn’t specified. 1 year? Probably not. 10 years? Probably. 20 years? Very likely. 100 years? None of us here will be alive to be proven wrong but I’ll venture that that’s a certainty.
This is a pretty strong position to take in the comments of a post where a mathematician declared the 5 problems he'd seen to be PhD level, and speculated that the real difficulty with switching from numerical answers to proofs will be finding humans qualified to judge the AI's answers.
I will agree that it's likely none of us here will be alive to be proven wrong, but that's in the 1 to 10 year range.
Your idea of ‘do math’ is a bit different from this context.
Here it means do math research or better, find new math.
The analogy with self-driving cars is spot on
Your optimism should be tempered with the downside of progress meaning that AI in the near future may not only inspire creativity in humans, but it can replace human creativity all together.
Why do I need to hire an artist for my movie/video game/advertisement when AI can replicate all the creativity I need.
There is research on AI limiting creative output in completive arenas. Essentially it breaks expectancy therefore deteriorates iteration.
https://direct.mit.edu/rest/article-abstract/102/3/583/96779...
This was about mathematics.
There was a little more information in that reddit thread. Of the three difficulty tiers, 25% are T1 (easiest) and 50% are T2. Of the five public problems that the author looked at, two were T1 and two were T2. Glazer on reddit described T1 as "IMO/undergraduate problems", but the article author says that they don't consider them to be undergraduate problems. So the LLM is already doing what the author says they would be surprised about.
Also Glazer seemed to regret calling T1 "IMO/undergraduate", and not only because of the disparity between IMO and typical undergraduate. He said that "We bump problems down a tier if we feel the difficulty comes too heavily from applying a major result, even in an advanced field, as a black box, since that makes a problem vulnerable to naive attacks from models"
Also, all of the problems shows to Tao were T3
The reddit thread is ... interesting (direct link[1]). It seems to be a debate among mathematicians some of whom do have access to the secret set. But they're debating publicly and so naturally avoiding any concrete examples that would give the set away so wind-up with fuzzy-fiddly language for the qualities of the problem tiers.
The "reality" of keeping this stuff secret 'cause someone would train on it is itself bizarre and certainly shouldn't be above questioning.
https://www.reddit.com/r/OpenAI/comments/1hiq4yv/comment/m30...
It's not about training directly on the test set, it's about people discussing questions in the test set online (e.g., in forums), and then this data is swept up into the training set. That's what makes test set contamination so difficult to avoid.
Yes,
That is the "reality" - that because companies can train their models on the whole Internet, companies will train their (base) models on the entire Internet.
And in this situation, "having heard the problem" actually serves as a barrier to understanding of these harder problems since any variation of known problem will receive a standard "half-assed guestimate".
And these companies "can't not" use these base models since they're resigned to the "bitter lesson" (better the "bitter lesson viewpoint" imo) that they need large scale heuristics for the start of their process and only then can they start symbolic/reasoning manipulations.
But hold-up! Why couldn't an organization freeze their training set and their problems and release both to the public? That would give us an idea where the research stands. Ah, the answer comes out, 'cause they don't own the training set and the result they want to train is a commercial product that needs every drop of data to be the best. As Yan LeCun has said, this isn't research, this is product development.
>> It's not about training directly on the test set, it's about people discussing questions in the test set online
Don't kid yourself. There are 10's of billions of dollars going into AI. Some of the humans involved would happily cheat on comparative tests to boost investment.
The incentives are definitely there, but even CEOs and VCs know that if they cheat the tests just to get more investment, they're only cheating themselves. No one is liquidating within the next 5 years so either they end up getting caught and lose everything or they spent all this energy trying to cheat while having a subpar model which results in them losing to competitors who actually invested in good technology.
Having a higher valuation could help with attracting better talent or more funding to invest in GPUs and actual model improvements but I don't think that outweighs the risks unless you're a tiny startup with nothing to show (but then you wouldn't have the money to bribe anyone).
People like to cheat. See the VW case. Company is big and established and still cheated.
It depends a lot on individuals making up the companies command chain and their values.
Why is this any different from say, Theranos?
CEOs and VCs will happily lie because they are convinced they are smarter than everyone else and will solve the problem before they get caught.
Not having access to the dataset really makes the whole thing seem incredibly shady. Totally valid questions you are raising
it’s a key aspect of the entire project. we have gone through many cycles of evils where the dataset is public
> So the LLM is already doing what the author says they would be surprised about.
that's if you unconditionally believe in result without any proofreading, confirmation, reproducability and even barely any details (we are given only one slide).
I just spent a few days trying to figure out some linear algebra with the help of ChatGPT. It's very useful for finding conceptual information from literature (which for a not-professional-mathematician at least can be really hard to find and decipher). But in the actual math it constantly makes very silly errors. E.g. indexing a vector beyond its dimension, trying to do matrix decomposition for scalars and insisting on multiplying matrices with mismatching dimensions.
O1 is a lot better at spotting its errors than 4o but it too still makes a lot of really stupid mistakes. It seems to be quite far from producing results itself consistently without at least a somewhat clueful human doing hand-holding.
LLMs have been very useful for me in explorations of linear algebra, because I can have an idea and say "what's this operation called?" or "how do I go from this thing to that thing?", and it'll give me the mechanism and an explanation, and then I can go read actual human-written literature or documentation on the subject.
It often gets the actual math wrong, but it is good enough at connecting the dots between my layman's intuition and the "right answer" that I can get myself over humps that I'd previously have been hopelessly stuck on.
It does make those mistakes you're talking about very frequently, but once I'm told that the thing I'm trying to do is achievable with the Gram-Schmidt process, I can go self-educate on that further.
The big thing I've had to watch out for is that it'll usually agree that my approach is a good or valid one, even when it turns out not to be. I've learned to ask my questions in the shape of "how do I", rather than "what if I..." or "is it a good idea to...", because most of the time it'll twist itself into shapes to affirm the direction I'm taking rather than challenging and refining it.
It reliably fails also basic real analysis proofs, but I think this is not too surprising since those require a mix of logic and computation that is likely hard to just infer from statistical likelihood of tokens
Isn't Wolfram Alpha a better "ChatGPT of Math"?
Wolfram Alpha is better at actually doing math, but far worse at explaining what it’s doing, and why.
What’s worse about it?
It never tells you the wrong thing, at the very least.
When you give it a large math problem and the answer is "seven point one three five ... ", and it shows a plot of the result v some randomly selected domain, well there could be more I'd like to know.
You can unlock a full derivation of the solution, for cases where you say "Solve" or "Simplify", but what I (and I suspect GP) might want, is to know why a few of the key steps might work.
It's a fantastic tool that helped get me through my (engineering) grad work, but ultimately the breakthrough inequalities that helped me write some of my best stuff were out of a book I bought in desperation that basically cataloged linear algebra known inequalities and simplifications.
When I try that kind of thing with the best LLM I can use (as of a few months ago, albeit), the results can get incorrect pretty quickly.
> [...], but what I (and I suspect GP) might want, is to know why a few of the key steps might work.
It's been some time since I've used the step-by-step explainer, and it was for calculus or intro physics problems at best, but IIRC the pro subscription will at least mention the method used to solve each step and link to reference materials (e.g., a clickable tag labeled "integration by parts"). Doesn't exactly explain why but does provide useful keywords in a sequence that can be used to derive the why.
What book was it that you found helpful?
Im reviewing linear algebra now and would also love to know that book!
Its understanding of problems was very bad last time I used it. Meaning it was difficult to communicate what you wanted it to do. Usually I try to write in the Mathematica language, but even that is not foolproof.
Hopefully they have incorporated more modern LLM since then, but it hasn’t been that long.
Wolfram Alpha's "smartness" is often Clippy level enraging. E.g. it makes assumptions of symbols based on their names (e.g. a is assumed to be a constant, derivatives are taken w.r.t. x). Even with Mathematica syntax it tends to make such assumptions and refuses to lift them even when explicitly directed. Quite often one has to change the variable symbols used to try to make Alpha to do what's meant.
I wish there was a way to tell Chatgpt where it has made a mistake, with a single mouse click.
What's surprising to me is that this would surely be in OpenAI's interests, too -- free RLHF!
Of course there would be the risk of adversaries giving bogus feedback, but my gut says it's relatively straightforward to filter out most of this muck.
Is the explanation a pro feature? At the very end it says "step by step? Pay here"
Wolfram Alpha can solve equations well, but it is terrible at understanding natural language.
For example I asked Wolfram Alpha "How heavy a rocket has to be to launch 5 tons to LEO with a specific impulse of 400s", which is a straightforward application of the Tsiolkovsky rocket equation. Wolfram Alpha gave me some nonsense about particle physics (result: 95 MeV/c^2), GPT-4o did it right (result: 53.45 tons).
Wolfram alpha knows about the Tsiolkovsky rocket equation, it knows about LEO (low earth orbit), but I found no way to get a delta-v out of it, again, more nonsense. It tells me about Delta airlines, mentions satellites that it knows are not in LEO. The "natural language" part is a joke. It is more like an advanced calculator, and for that, it is great.
You're using it wrong, you can use natural language in your equation, but afaik it's not supposed to be able to do what you're asking of it.
You know, "You're using it wrong" is usually meant to carry an ironic or sarcastic tone, right?
It dates back to Steve Jobs blaming an iPhone 4 user for "holding it wrong" rather than acknowledging a flawed antenna design that was causing dropped calls. The closest Apple ever came to admitting that it was their problem was when they subsequently ran an employment ad to hire a new antenna engineering lead. Maybe it's time for Wolfram to hire a new language-model lead.
No, “holding it wrong” is the sarcastic version. “You’re using it wrong” is a super common way to tell people they are literally using something wrong.
But they're not using it wrong. They are using it as advertised by Wolfram themselves (read: himself).
The GP's rocket equation question is exactly the sort of use case for which Alpha has been touted for years.
It's not an LLM. You're simply asking too much of it. It doesn't work the way you want it to, sorry.
Correct, so it isn't a "ChatGPT of Math", which was the point.
Tell Wolfram. They're the ones who've been advertising it for years, well before LLMs were a thing, using English-language prompts like these examples: https://www.pcmag.com/news/23-cool-non-math-things-you-can-d...
The problem has always been that you only get good answers if you happen to stumble on a specific question that it can handle. Combining Alpha with an LLM could actually be pretty awesome, but I'm sure it's easier said than done.
Before LLMs exploded nobody really expected WA to perform well at natural language comprehension. The expectations were at the level of "an ELIZA that knows math".
Wolfram Alpha is mostly for "trivia" type problems. Or giving solutions to equations.
I was figuring out some mode decomposition methods such as ESPRIT and Prony and how to potentially extend/customize them. Wolfram Alpha doesn't seem to have a clue about such.
No. Wolfram Alpha can't solve anything that isn't a function evaluation or equation. And it can't do modular arithmetic to save its unlife.
WolframOne/Mathematica is better, but that requires the user (or ChatGPT!)to write complicated code, not natural language queries.
I wonder if these are tokenization issues? I really am curious about metas byte tokenization scheme...
Probably mostly not. The errors tend to be logical/conceptual. E.g. mixing up scalars and matrices is unlikely to be from tokenization. Especially if using spaces between the variables and operators, as AFAIK GPTs don't form tokens over spaces (although tokens may start or end with them).
The only thing I've consistently had issues with while using AI is graphs. If I ask it to put some simple function, it produces a really weird image that has nothing to do with the graph I want. It will be a weird swirl of lines and words, and it never corrects itself no matter what I say to it.
Has anyone had any luck with this? It seems like the only thing that it just can't do.
You're doing it wrong. It can't produce proper graphs with it's diffusion style image generation.
Ask it to produce graphs with python and matplotlib. That will work.
And works very well - it made me a nice general "draw successively accurate Fourier series approximations given this lambda for coefficients and this lambda for the constant term". PNG output, no real programming errors (I wouldn't remember if it had some stupid error, I'm a python programmer). Even TikZ in LaTeX isn't hopeless (although I did ending up reading the tikz manual)
Ask it to plot the graph with python plotting utilities. Not using its image generator. I think you need a ChatGPT subscription though for it to be able to run python code.
You seem to get 2(?) free Python program runs per week(?) as part of the 01 preview.
When you visit chatgpt on the free account it automatically gives you the best model and then disables it after some amount of work and says to come back later or upgrade.
Just install Python locally, and copy paste the code.
Shouldn’t ChatGPT be smart enough to know to do this automatically, based on context?
It was, for a while. I think this is an area where there may have been some regression. It can still write code to solve problems that are a poor fit for the language model, but you may need to ask it to do that explicitly.
The agentic reasoning models should be able to fix this if they have the ability to run code instead of giving each task to itself. "I need to make a graph" "LLMs have difficulty graphing novel functions" "Call python instead" is a line of reasoning I would expect after seeing what O1 has come up with on other problems.
Giving AI the ability to execute code is the safety peoples nightmare though, wonder if we'll hear anything from them as this is surely coming
Don't most mathematical papers contain at least one such error?
Where is this data from?
It's a question, and to be fair to AI it should actually refer to papers before review.
Yes, it's a question, but you haven't answered what you read that makes you suspect so.
Yesterday, I saw a thought provoking talk about the future of of "math jobs" assuming automated theory proving becomes more prevalent in the future.
[ (Re)imagining mathematics in a world of reasoning machines by Akshay Venkatesh]
https://www.youtube.com/watch?v=vYCT7cw0ycw [54min]
Abstract: In the coming decades, developments in automated reasoning will likely transform the way that research mathematics is conceptualized and carried out. I will discuss some ways we might think about this. The talk will not be about current or potential abilities of computers to do mathematics—rather I will look at topics such as the history of automation and mathematics, and related philosophical questions.
See discussion at https://news.ycombinator.com/item?id=42465907
That was wonderful, thank you for linking it. For the benefit of anyone who doesn't have time to watch the whole thing, here are a few really nice quotes that convey some main points.
"We might put the axioms into a reasoning apparatus like the logical machinery of Stanley Jevons, and see all geometry come out of it. That process of reasoning are replaced by symbols and formulas... may seem artificial and puerile; and it is needless to point out how disastrous it would be in teaching and how hurtful to the mental development; how deadening it would be for investigators, whose originality it would nip in the bud. But as used by Professor Hilbert, it explains and justifies itself if one remembers the end pursued." Poincare on the value of reasoning machines, but the analogy to mathematics once we have theorem-proving AI is clear (that the tools and the lie direct outputs are not the ends. Human understanding is).
"Even if such a machine produced largely incomprehensible proofs, I would imagine that we would place much less value on proofs as a goal of math. I don't think humans will stop doing mathematics... I'm not saying there will be jobs for them, but I don't think we'll stop doing math."
"Mathematics is the study of reproducible mental objects." This definition is human ("mental") and social (it implies reproducing among individuals). "Maybe in this world, mathematics would involve a broader range of inquiry... We need to renegotiate the basic goals and values of the discipline." And he gives some examples of deep questions we may tackle beyond just proving theorems.
As someone who has a 18 yo son who wants to study math, this has me (and him) ... worried ... about becoming obsolete?
But I'm wondering what other people think of this analogy.
I used to be a bench scientist (molecular genetics).
There were world class researchers who were more creative than I was. I even had a Nobel Laureate once tell me that my research was simply "dotting 'i's and crossing 't's".
Nevertheless, I still moved the field forward in my own small ways. I still did respectable work.
So, will these LLMs make us completely obsolete? Or will there still be room for those of us who can dot the "i"?--if only for the fact that LLMs don't have infinite time/resources to solve "everything."
I don't know. Maybe I'm whistling past the graveyard.
I was just thinking about this. I already posted a comment here, but I will say that as a mathematician (PhD in number theory), that for me, AI signficantly takes away the beauty of doing mathematics within a realm in which AI is used.
The best part of math (again, just for me) is that it was a journey that was done by hand with only the human intellect that computers didn't understand. The beauty of the subject was precisely that it was a journey of human intellect.
As I said elsewhere, my friends used to ask me why something was true and it was fun to explain it to them, or ask them and have them explain it to me. Now most will just use some AI.
Soulless, in my opinion. Pure mathematics should be about the art of the thing, not producing results on an assembly line like it will be with AI. Of course, the best mathematicians are going into this because it helps their current careers, not because it helps the future of the subject. Math done with AI will be a lot like Olympic running done with performance-enhancing drugs.
Yes, we will get a few more results, faster. But the results will be entirely boring.
There are many similarities in your comment to how grandmasters discuss engines. I have a hunch the arc of AI in math will be very similar to the arc of engines in chess.
https://www.wired.com/story/defeated-chess-champ-garry-kaspa...
I agree with that, in the sense that math will become more about who can use AI the fastest to generate the most theories, which sort of side-steps the whole point of math.
As a chess aficionado and a former tournament player, who didn’t get very far, I can see pros & cons. They helped me train and get significantly better than I would’ve gotten without them. On the other hand, so did the competition. :) The average level of the game is so much higher than when I was a kid (30+ years ago) and new ways of playing that were unthinkable before are possible now. On the other hand cheating (online anyway) is rampant and all the memorization required to begin to be competitive can be daunting, and that sucks.
Hey I play chess too. Not a very good player though. But to be honest, I enjoy playing with people who are not serious because I do think an overabundance of knowledge makes the game too mechanical. Just my personal experience, but I think the risk of cheaters who use programs and the overmechanization of chess is not worth becoming a better player. (And in fact, I think MOST people can gain satisfaction by improving just by studying books and playing. But I do think that a few who don't have access to opponents benefit from a chess-playing computer).
I agree wholeheartedly about the beauty of doing mathematics. I will add though that the author of this article, Kevin Buzzard, doesn't need to do this for his career and from what I know of him is somebody who very much cares about mathematics and the future of the subject. The fact that a mathematician of that calibre is interested in this makes me more interested.
If you think the purpose of pure math is to provide employment and entertainment to mathematicians, this is a dark day.
If you believe the purpose of pure math is to shed light on patterns in nature, pave the way for the sciences, etc., this is fantastic news.
Well, 99% of pure math will never leave the domain of pure math so I'm really not sure what you are talking about.
We also seem to suffer these automation delusions right now.
I could see how AI could assist me with learning pure math but the idea AI is going to do pure math for me is just absurd.
Not only would I not know how to start, more importantly I have no interest in pure math. There will still be a huge time investment to get up to speed with doing anything with AI and pure math.
You have to know what questions to ask. People with domain knowledge seem to really be selling themselves short. I am not going to randomly stumble on a pure math problem prompt when I have no idea what I am doing.
> Now most will just use some AI.
Do people with PhD in math really ask AI to explain math concepts to them?
They will, when it becomes good enough to prove tricky things.
Presumably people who get into math going forward will feel differently.
For myself, chasing lemmas was always boring — and there’s little interest in doing the busywork of fleshing out a theory. For me, LLMs are a great way to do the fun parts (conceptual architecture) without the boring parts.
And I expect we’ll such much the same change as with physics: computers increase the complexity of the objects we study, which tend to be rather simple when done by hand — eg, people don’t investigate patterns in the diagrams of group(oids) because drawing million element diagrams isn’t tractable by hand. And you only notice the patterns in them when you see examples of the diagrams at scale.
Just a counterpoint, but I wonder how much you'll really understand if you can't even prove the whole thing yourself. Personally, I learn by proving but I guess everyone is different.
My hunch is it won't be much different, even when we can simply ask a machine that doesn't have a cached proof, "prove riemann hypothesis" and it thinks for ten seconds and spits out a fully correct proof.
As Erdos(I think?) said, great math is not about the answers, it's about the questions. Or maybe it was someone else, and maybe "great mathematicians" rather than "great math". But, gist is the same.
"What happens when you invent a thing that makes a function continuous (aka limit point)"? "What happens when you split the area under a curve into infinitesimal pieces and sum them up"? "What happens when you take the middle third out of an interval recursively"? "Can we define a set of axioms that underlie all mathematics"? "Is the graph of how many repetitions it takes for a complex number to diverge interesting"? I have a hard time imagining computers would ever have a strong enough understanding of the human experience with mathematics to even begin pondering such questions unprompted, let alone answer them and grok the implications.
Ultimately the truths of mathematics, the answers, soon to be proved primarily by computers, already exist. Proving a truth does not create the truth; the truth exists independent of whether it has been proved or not. So fundamentally math is closer to archeology than it may appear. As such, AI is just a tool to help us dig with greater efficiency. But it should not be considered or feared as a replacement for mathematicians. AI can never take away the enlightenment of discovering something new, even if it does all the hard work itself.
> I have a hard time imagining computers would ever have a strong enough understanding of the human experience with mathematics to even begin pondering such questions unprompted, let alone answer them and grok the implications.
The key is that the good questions however come from hard-won experience, not lazily questioning an AI.
Even current people will feel differently. I don't bemoan the fact that Lean/Mathlib has `simp` and `linarith` to automate trivial computations. A "copilot for Lean" that can turn "by induction, X" or "evidently Y" into a formal proof sounds great.
The the trick is teaching the thing how high powered of theorems to use or how to factor out details or not depending on the user's level of understanding. We'll have to find a pedagogical balance (e.g. you don't give `linarith` to someone practicing basic proofs), but I'm sure it will be a great tool to aid human understanding.
A tool to help translate natural language to formal propositions/types also sounds great, and could help more people to use more formal methods, which could make for more robust software.
I think it will become apparent how bad they are at it. They’re algorithms and not sentient beings. They do not think of themselves, their place in the world, and do not fathom the contents of the minds of others. They do no care what others think of them.
Whatever they write only happens to contain some truth by virtue of the model and the training data. An algorithm doesn’t know what truth is or why we value it. It’s a bullshitter of the highest calibre.
Then comes the question: will they write proofs that we will consider beautiful and elegant, that we will remember and pass down?
Or will they generate what they’ve been asked to and nothing less? That would be utterly boring to read.
If you looked at how the average accountant spent their time before the arrival of the digital spreadsheet, you might have predicted that automated calculation would make the profession obsolete. But it didn't.
This time could be different, of course. But I'll need a lot more evidence before I start telling people to base their major life decisions on projected technological change.
That's before we even consider that only a very slim minority of the people who study math (or physics or statistics or biology or literature or...) go on to work in the field of math (or physics or statistics or biology or literature or...). AI could completely take over math research and still have next to impact on the value of the skills one acquires from studying math.
Or if you want to be more fatalistic about it: if AI is going to put everyone out of work then it doesn't really matter what you do now to prepare for it. Might as well follow your interests in the meantime.
It's important to base life decisions on very real technological change. We don't know what the change will be, but it's coming. At the very least, that suggests more diverse skills.
We're all usually (but not always) better off, with more productivity, eventually, but in the meantime, jobs do disappear. Robotics did not fully displace machinists and factory workers, but single-skilled people in Detroit did not do well. The loom, the steam engine... all of them displaced often highly-trained often low-skilled artisans.
If AI reaches this level socioeconomic impact is going to be so immense, that choosing what subject you study will have no impact on your outcome - no matter what it is - so it's a pointless consideration.
Another PhD in maths here and I would say not to worry. It's the process of doing and understanding mathematics, and thinking mathematically that is ultimately important.
There's never been the equivalent of the 'bench scientist' in mathematics and there aren't many direct careers in mathematics, or pure mathematics at least - so very few people ultimately become researchers. Instead, I think you take your way of thinking and apply it to whatever else you do (and it certainly doesn't do any harm to understand various mathematical concepts incredibly well).
What LLMs can do is limited, they are superior to wet-wear in some tasks like finding and matching patterns in higher dimensional space, they are still fundamentally limited to a tiny class of problems outside of that pattern finding and matching.
LLMs will be tools for some math needs and even if we ever get quantum computers will be limited in what they can do.
LLMs, without pattern matching, can only do up to about integer division, and while they can calculate parity, they can't use it in their calculations.
There are several groups sitting on what are known limitations of LLMs, waiting to take advantage of those who don't understand the fundamental limitations, simplicity bias etc...
The hype will meet reality soon and we will figure out where they work and where they are problematic over the next few years.
But even the most celebrated achievements like proof finding with Lean, heavily depends on smart people producing hints that machines can use.
Basically lots of the fundamental hints of the limits of computation still hold.
Model logic may be an accessable way to approach the limits of statistical inference if you want to know one path yourself.
A lot of what is in this article relates to some the known fundamental limitations.
Remember that for all the amazing progress, one of the core founders of the perceptron, Pitts drank him self to death in the 50s because it was shown that they were insufficient to accurately model biological neurons.
Optimism is high, but reality will hit soon.
So think of it as new tools that will be available to your child, not a replacement.
"LLMs, without pattern matching, can only do up to about integer division, and while they can calculate parity, they can't use it in their calculations." - what do you mean by this? Counting the number of 1's in a bitstring and determining if it's even or odd?
Yes, in this case PARITY is determining if the number of 1s in a binary input is odd or even
It is an effect of the complex to unpack descriptive complexity class DLOGTIME-uniform TC0, which has AND, OR and MAJORITY gates.
http://arxiv.org/abs/2409.13629
The point being that the ability to use parity gates is different than being able to calculate it, which is where the union of the typically ram machine DLOGTIME with the circuit complexity of uniform TC0 comes into play.
PARITY, MAJ, AND, and OR are all symmetric, and are in TCO, but PARITY is not in DLOGTIME-uniform TC0, which is first-order logic with Majority quantifiers.
Another path, if you think about symantic properties and Rice's theorem, this may make sense especially as PAC learning even depth 2 nets is equivalent to the approximate SVP.
PAC-learning even depth-2 threshold circuits is NP-hard.
https://www.cs.utexas.edu/~klivans/crypto-hs.pdf
For me thinking about how ZFC was structured so we can keep the niceties of the law of the excluded middle, and how statistics pretty much depends on it for the central limit and law of large numbers, IID etc...
But that path runs the risk of reliving the Brouwer–Hilbert controversy.
I doubt it.
Most likely AI will be good at some things and not others, and mathematicians will just move to whatever AI isn't good at.
Alternatively, if AI is able to do all math at a level above PhDs, then its going to be a brave new world and basically the singularity. Everything will change so much that speculating about it will probably be useless.
Let's put it this way, from another mathematician, and I'm sure I'll probably be shot for this one.
Every LLM release moves half of the remaining way to the minimum viable goal of replacing a third class undergrad. If your business or research initiative is fine with that level of competence then you will find utility.
The problem is that I don't know anyone who would find that useful. Nor does it fit within any existing working methodology we have. And on top of that the verification of any output can take considerably longer than just doing it yourself in the first place, particularly where it goes off the rails, which it does all the time. I mean it was 3 months ago I was arguing with a model over it not understanding place-value systems properly, something we teach 7 year olds here?
But the abstract problem is at a higher level. If it doesn't become a general utility for people outside of mathematics, which is very very evident at the moment by the poor overall adoption and very public criticism of the poor result quality, then the funding will dry up. Models cost lots of money to train and if you don't have customers it's not happening and no one is going to lend you the money any more. And then it's moot.
Well said. As someone with only a math undergrad and as a math RLHF’er, this speaks to my experience the most.
That craving for an understanding an elegant proof is nowhere to be found with verifying an LLM’s proof.
Like sure, you could put together a car by first building an airplane, disassembling all of it minus the two front seats, and having zero elegance and still get a car at the end. But if you do all that and don’t provide novelty in results or useful techniques, there’s no business.
Hell, I can’t even get a model to calculate compound interest for me (save for the technicality of prompt engineering a python function to do it). What do I expect?
This is a great point that nobody will shoot you over :)
But the main question is still: assuming you replace an undergrad with a model, who checks the work? If you have a good process around that already, and find utility as an augmented system, then get you’ll get value - but I still think it’s better for the undergrad to still have the job and be at the wheel, and does things faster and better when leveraging a powerful tool.
Shot already for criticising the shiny thing (happened with crypto and blockchain already...)
Well to be fair no one checks what the graduates do properly, even if we hired KPMG in. That is until we get sued. But at least we have someone to blame then. What we don't want is something for the graduate to blame. The buck stops at someone corporeal because that's what the customers want and the regulators require.
That's the reality and it's not quite as shiny and happy as the tech industry loves to promote itself.
My main point, probably cleared up with a simple point: no one gives a shit about this either way.
I used to do bench top work too; and was blessed with “the golden hands” in that I could almost always get protocols working. To me this always felt more like intuition than deductive reasoning. And it made me a terrible TA. My advice to students in lab was always something along the lines of “just mess around with it, and see how it works.” Not very helpful for the stressed and struggling student -_-
Digression aside, my point is that I don’t think we know exactly what makes or defines “the golden hands”. And if that is the case, can we optimize for it?
Another point is that scalable fine tuning only works for verifiable stuff. Think a priori knowledge. To me that seems to be at the opposite end of the spectrum from “mess with it and see what happens”.
> blessed with “the golden hands” in that I could almost always get protocols working.
Very funny. My friends and I never used the phrase "golden hands" but we used to say something similar: "so-and-so has 'great hands'".
But it meant the same thing.
I, myself, did not have great hands, but my comment was more about the intellectual process of conducting research.
I guess my point was that:
* I've already dealt with more talented researchers, but I still contributed meaningfully.
* Hopefully, the "AI" will simply add another layer of talent, but the rest of us lesser mortals will still be able to contribute.
But I don't know if I'm correct.
> I even had a Nobel Laureate once tell me that my research was simply "dotting 'i's and crossing 't's".
(。•́︿•̀。)
The mathematicians of the future will still have to figure out the right questions, even if llms can give them the answers. And "prompt engineering" will require mathematical skills, at the very least.
Evaluating the output of llms will also require mathematical skills.
But I'd go further, if your son enjoys mathematics and has some ability in the area, it's wonderful for your inner life. Anyone who becomes sufficiently interested in anything will rediscover mathematics lurking at the bottom.
What part do you think is going to become obsolete? Because Math isn't about "working out the math", it's about finding the relations between seemingly unrelated things to bust open a problem. Short of AGI, there is no amount of neural net that's going to realize that a seemingly impossible probabilistic problem is actually equivalent to a projection of an easy to work with 4D geometry. "Doing the math" is what we have computers for, and the better they get, the easier the tedious parts of the job become, but "doing math" is still very much a human game.
> What part do you think is going to become obsolete?
Thank you for the question.
I guess what I'm saying is:
Will LLMs (or whatever comes after them) be _so_ good and _so_ pervasive that we will simply be able to say, "Hey ChatGPT-9000, I'd like to see if the xyz conjecture is correct." And then ChatGPT-9000 just does the work without us contributing beyond asking a question.
Or will the technology be limited/bound in some way such that we will still be able to use ChatGPT-9000 as a tool of our own intellectual augmentation and/or we could still contribute to research even without it.
Hopefully, my comment clarifies my original post.
Also, writing this stuff has helped me think about it more. I don't have any grand insight, but the more I write, the more I lean toward the outcome that these machines will allow us to augment our research.
As amazing as they may seem, they're still just autocompletes, it's inherent to what an LLM is. So unless we come up with a completely new kind technology, I don't see "test this conjecture for me" becoming more real than the computer assisted proof tooling we already have.
By the way, don't trust Nobel laureates or even winners. E.g. Linus Pauling was talking absolute garbage, harmful and evil, after winning the Nobel.
> don't trust Nobel laureates or even winners
Nobel laureate and winner are the same thing.
> Linus Pauling was talking absolute garbage, harmful and evil, after winning the Nobel.
Can you be more specific, what garbage? And which Nobel prize do you mean – Pauling got two, one for chemistry and one for peace.
Thank you, my bad.
I was referring to Linus's harmful and evil promotion of Vitamin C as the cure for everything and cancer. I don't think Linus was attaching that garbage to any particular Nobel prize. But people did say to their doctors: "Are you a Nobel winner, doctor?". Don't think they cared about particular prize either.
> Linus's harmful and evil promotion of Vitamin C
Which is "harmful and evil" thanks to your afterknowledge. He had based his books on the research that failed to replicate. But given low toxicity of vitamin C it's not that "evil" to recommend treatment even if probabilistic estimation of positive effects is not that high.
Sloppy, but not exceptionally bad. At least it was instrumental in teaching me to not expect marvels coming from dietary research.
Eugenics and vitamin C as a cure all.
If Pauling's eugenics policies were bad, then the laws against incest that are currently on the books in many states (which are also eugenics policies that use the same mechanism) are also bad. There are different forms of eugenics policies, and Pauling's proposal to restrict the mating choices of people carrying certain recessive genes so their children don't suffer is ethically different from Hitler exterminating people with certain genes and also ethically different from other governments sterilizing people with certain genes. He later supported voluntary abortion with genetic testing, which is now standard practice in the US today, though no longer in a few states with ethically questionable laws restricting abortion. This again is ethically different from forced abortion.
https://scarc.library.oregonstate.edu/coll/pauling/blood/nar...
FWIW my understanding is that the policies against incest you mention actually have much less to do with controlling genetic reproduction and are more directed at combating familial rape/grooming/etc.
Not a fun thing to discuss, but apparently a significant issue, which I guess should be unsurprising given some of the laws allowing underage marriage if the family signs off.
Mentioning only to draw attention to the fact that theoretical policy is often undeniable in a vacuum, but runs aground when faced with real world conditions.
From what I remember, he wanted to mark people with tattoos or something.
This is mentioned in my link: "According to Pauling, carriers should have an obvious mark, (i.e. a tattoo on the forehead) denoting their disease, which would allow carriers to identify others with the same affliction and avoid marrying them."
The goal wasn't to mark people for ostracism but to make it easier for people carrying these genes to find mates that won't result in suffering for their offspring.
Eventually we may produce a collection of problems exhaustive enough that these tools can solve almost any problem that isn't novel in practice, but I doubt that they will ever become general problem solvers capable of what we consider to be reasoning in humans.
Historically, the claim that neural nets were actual models of the human brain and human thinking was always epistemically dubious. It still is. Even as the practical problems of producing better and better algorithms, architectures, and output have been solved, there is no reason to believe a connection between the mechanical model and what happens in organisms has been established. The most important point, in my view, is that all of the representation and interpretation still has to happen outside the computational units. Without human interpreters, none of the AI outputs have any meaning. Unless you believe in determinism and an overseeing god, the story for human beings is much different. AI will not be capable of reason until, like humans, it can develop socio-rational collectivities of meaning that are independent of the human being.
Researchers seemed to have a decent grasp on this in the 90s, but today, everyone seems all too ready to make the same ridiculous leaps as the original creators of neural nets. They did not show, as they claimed, that thinking is reducible to computation. All they showed was that a neural net can realize a boolean function—which is not even logic, since, again, the entire semantic interpretive side of the logic is ignored.
> Unless you believe in determinism and an overseeing god
Or perhaps, determinism and mechanistic materialism - which in STEM-adjacent circles has a relatively prevalent adherence.
Worldviews which strip a human being of agency in the sense you invoke crop up quite a lot today in such spaces. If you start of adopting a view like this, you have a deflationary sword which can cut down most any notion that's not mechanistic in terms of mechanistic parts. "Meaning? Well that's just an emergent phenomenon of the influence of such and such causal factors in the unrolling of a deterministic physical system."
Similar for reasoning, etc.
Now obviously large swathes of people don't really subscribe to this - but it is prevalent and ties in well with utopian progress stories. If something is amenable to mechanistic dissection, possibly it's amenable to mechanistic control. And that's what our education is really good at teaching us. So such stories end up having intoxicating "hype" effects and drive fundraising, and so we get where we are.
For one, I wish people were just excited about making computers do things they couldn't do before, without needing to dress it up as something more than it is. "This model can prove a set of theorems in this format with such and such limits and efficiency"
Agreed. If someone believes the world is purely mechanistic, then it follows that a sufficiently large computing machine can model the world---like Leibniz's Ratiocinator. The intoxication may stem from the potential for predictability and control.
The irony is: why would someone want control if they don't have true choice? Unfortunately, such a question rarely pierces the intoxicated mind when this mind is preoccupied with pass the class, get an A, get a job, buy a house, raise funds, sell the product, win clients, gain status, eat right, exercise, check insta, watch the game, binge the show, post on Reddit, etc.
> Agreed. If someone believes the world is purely mechanistic, then it follows that a sufficiently large computing machine can model the world---like Leibniz's Ratiocinator.
I don’t think it does. Taking computers as an analogy… if you have a computer with 1GB memory, then you can’t simulate a computer with more than 1GB memory inside of it.
"sufficiently large machine" ... It's a thought experiment. Leibniz didn't have a computer, but he still imagined it.
> If someone believes the world is purely mechanistic, then it follows that a sufficiently large computing machine can model the world
Is this controversial in some way? The problem is that to simulate a universe you need a bigger universe -- which doesn't exist (or is certainly out of reach due to information theoretical limits)
> ---like Leibniz's Ratiocinator. The intoxication may stem from the potential for predictability and control.
I really don't understand the 'control' angle here. It seems pretty obvious that even in a purely mechanistic view of the universe, information theory forbids using the universe to simulate itself. Limited simulations, sure... but that leaves lots of gaps wherein you lose determinism (and control, whatever that means).
People wish to feel safe. One path to safety is controlling or managing the environment. Lack of sufficient control produces anxiety. But control is only possible if the environment is predictable, i.e., relatively certain knowledge that if I do X then the environment responds with Y. Humans use models for prediction. Loosely speaking, if the universe is truly mechanistic/deterministic, then the goal of modeling is to get the correct model (though notions of "goals" are problematic in determinism without real counterfactuals). However, if we can't know whether the universe is truly deterministic, then modeling is a pragmatic exercise in control (or management).
My comments are not about simulating the universe on a real machine. They're about the validity and value of math/computational modeling in a universe where determinism is scientifically indeterminable.
> Is this controversial in some way?
It’s not “controversial”, it’s just not a given that the universe is to be thought a deterministic machine. Not to everyone, at least.
Choice is over rated. This gets to an issue Ive long had with Nozicks experience machine. Not only would I happily spend my days in such a machine, Im pretty sure most other people would too. Maybe they say they wouldnt but if you let them try it out and then offered them the question again I think theyd say yes. The real conclusion of the experience machine is that the unknown is scary.
> there is no reason to believe a connection between the mechanical model and what happens in organisms has been established
The universal approximation theorem. And that's basically it. The rest is empirical.
No matter which physical processes happen inside the human brain, a sufficiently large neural network can approximate them. Barring unknowns like super-Turing computational processes in the brain.
The universal approximation theorem is set in a precise mathematical context; I encourage you to limit its applicability to that context despite the marketing label "universal" (which it isn't). Consider your concession about empiricism. There's no empirical way to prove (i.e. there's no experiment that can demonstrate beyond doubt) that all brain or other organic processes are deterministic and can be represented completely as functions.
Function is the most general way of describing relations. Non-deterministic processes can be represented as functions with a probability distribution codomain. Physics seems to require only continuous functions.
Sorry, but there's not much evidence that can support human exceptionalism.
I don't understand your point here. A (logical) relation is, by definition, a more general way of describing relations than a function, and it is telling that we still suck at using and developing truly relational models that are not univalent (i.e. functions). Only a few old logicians really took the calculus of relations proper seriously (Pierce, for one). We use functions precisely because they are less general, they are rigid, and simpler to work with. I do not think anyone is working under the impression that a function is a high fidelity means to model the world as it is experienced and actually exists. It is necessarily reductionistic (and abstract). Any truth we achieve through functional models is necessarily a general, abstracted, truth, which in many ways proves to be extremely useful but in others (e.g. when an essential piece of information in the particular is not accounted for in the general reductive model) can be disastrous.
I'm not a big fan of philosophy. The epistemology you are talking about is another abstraction on top of the physical world. But the evolution of the physical world as far as we know can be described as a function of time (at least, in a weak gravitational field when energies involved are well below the grand unification energy level, that is for the objects like brains).
The brain is a physical system, so whatever it does (including philosophy) can be replicated by modelling (a (vastly) simplified version of) underlying physics.
Anyway, I am not especially interested in discussing possible impossibility of an LLM-based AGI. It might be resolved empirically soon enough.
Some differential equations that model physics admit singularities and multiple solutions. Therefore, functions are not the most general way of describing relations. Functions are a subset of relations.
Although "non-deterministic" and "stochastic" are often used interchangeably, they are not equivalent. Probability is applied analysis whose objects are distributions. Analysis is a form of deductive, i.e. mechanical, reasoning. Therefore, it's more accurate (philosophically) to identify mathematical probability with determinism. Probability is a model for our experience. That doesn't mean our experience is truly probabilistic.
Humans aren't exceptional. Math modeling and reasoning are human activities.
> Some differential equations that model physics admit singularities and multiple solutions.
And physicists regard those as unphysical: the theory breaks down, we need better one.
For example, the Euler equations model compressible flow with discontinuities (shocks in the flow field variables) and rarefaction waves. These theories are accepted and used routinely.
Great. A useful approximation of what really happens in the fluid. But I'm sure there are no shocks and rarefactions in physicists' neurons while they are thinking about it.
Switching into a less facetious mode...
Do you understand that in context of this dialogue it's not enough to show some examples of discontinuous or otherwise unrepresentable by NNs functions? You need at least to give a hint why such functions cannot be avoided while approximating functionality of the human brain.
Many things are possible, but I'm not going to keep my mind open to a possibility of a teal Russell's teapot before I get a hint at its existence, so to speak.
That's not useful by itself, because "anything cam model anything else" doesn't put any upper bound on emulation cost, which for one small task could be larger than the total energy available in the entire Universem
Either the brain violates the physical Church-Turing thesis or it's not.
If it does, well, it will take more time to incorporate those physical mechanisms into computers to get them on par with the brain.
I leave the possibility that it's "magic"[1] aside. It's just impossible to predict, because it will violate everything we know about our physical world.
[1] One example of "magic": we live in a simulation and the brain is not fully simulated by the physics engine, but creators of the simulation for some reason gave it access to computational resources that are impossible to harness using the standard physics of the simulated world. Another example: interactionistic soul.
I mean, that is why they mention super-Turning processes like quantum based computing.
Quantum computing actually isn't super-Turing, it "just" computes some things faster. (Strictly speaking it's somewhere between a standard Turing machine and a nondeterministic Turing machine in speed, and the first can emulate the second.)
If we're nitpicking: quantum computing algorithms could (if implemented) compute certain things faster than the best classical algorithms we know. We don't know any quantum algorithms that are provably faster than all possible classical algorithms.
I'm with you. Interpreting a problem as a problem requires a human (1) to recognize the problem and (2) to convince other humans that it's a problem worth solving. Both involve value, and value has no computational or mechanistic description (other than "given" or "illusion"). Once humans have identified a problem, they might employ a tool to find the solution. The tool has no sense that the problem is important or even hard; such values are imposed by the tool's users.
It's worth considering why "everyone seems all too ready to make ... leaps ..." "Neural", "intelligence", "learning", and others are metaphors that have performed very well as marketing slogans. Behind the marketing slogans are deep-pocketed, platformed corporate and government (i.e. socio-rational collective) interests. Educational institutions (another socio-rational collective) and their leaders have on the whole postured as trainers and preparers for the "real world" (i.e. a job), which means they accept, support, and promote the corporate narratives about techno-utopia. Which institutions are left to check the narratives? Who has time to ask questions given the need to learn all the technobabble (by paying hundreds of thousands for 120 university credits) to become a competitive job candidate?
I've found there are many voices speaking against the hype---indeed, even (rightly) questioning the epistemic underpinnings of AI. But they're ignored and out-shouted by tech marketing, fundraising politicians, and engagement-driven media.
I hear these arguments a lot from law and philosophy students, never from those trained in mathematics. It seems to me, "literary" people will still be discussing these theoretical hypotheticals as technology passes them by building it.
I straddle both worlds. Consider that using the lens of mathematical reasoning to understand everything is a bit like trying to use a single mathematical theory (eg that of groups) to comprehend mathematics as a whole. You will almost always benefit and enrich your own understanding by daring to incorporate outside perspectives.
Consider also that even as digital technology and the ratiomathimatical understanding of the world has advanced it is still rife with dynamics and problems that require a humanistic approach. In particular, a mathematical conception cannot resolve teleological problems which require the establishment of consensus and the actual determination of what we, as a species, want the world to look like. Climate change and general economic imbalance are already evidence of the kind of disasters that mount when you limit yourself to a reductionistic, overly mathematical and technological understanding of life and existence. Being is not a solely technical problem.
I don't disagree, I just don't think it is done well or at least as seriously as it used to. In modern philosophy, there are many mathematically specious arguments, that just make clear how large the mathematical gap has become e.g. improper application of Godel's incompleteness theorems. Yet Godel was a philosopher himself, who would disagree with its current hand-wavy usage.
19th/20th was a golden era of philosophy with a coherent and rigorous mathematical lens to apply with other lenses. Russel, Turing, Godel, etc. However this just doesn't exist anymore
While I agree that these are titans of 20th c. philosophy, particularly of the philosophy of mathematics and logic, the overarching school they belonged to (logical positivism) has been thoroughly and rightly criticized, and it is informative to read these criticisms to understand why a view of life that is overly mathematical is in many ways inadequate. Your comment still argues from a very limited perspective. There is no reason that correct application of Gödel s theorem should be any indication of the richness of someone's philosophical views unless you are already a staunchly committed reductionist who values mathematical arguments above all else (why? can maths help you explain and understand the phenomena of love in a way that will actually help you experience love? this is just one example domain where it does not make much sense), or unless they are specifically attempting a philosophy of mathematics. The question of whether or not we can effectively model cognition and human mental function using mathematical models is not a question of mathematical philosophy, but rather one of epistemology. If you really want to head a spurious argument, read McCulloch and Pitts. They essentially present an argument of two premises, the brain is finite, and we can create a machine of formal "neurons" (which are not even complete models of real neurons) that computes a boolean function, they then conclude that they must have a model of cognition, that cognition must be nothing more than computation, and that the brain must basically be a Turing machine.
The relevance of mathematics to the cognitive problem must be decided outside of mathematics. As another poster said, even if you buy the theorems, it is still an empirical question as to whether or not they really model what they claim to model, and whether or not that model is of a fidelity that we find acceptable for a definition of general intelligence. Often, people reach claims of adequacy today not by producing really fantastic models but instead by lowering the bar enormously. They claim that these models approximate humans by severely reducing the idea of what it means to be an intelligent human to the specific talents their tech happens to excel at (e.g. apparently being a language parrot is all that intelligence is, ignoring all the very nuanced views and definitions of intelligence we have come up with over the course of history. A machine that is not embodied ina skeletal structure and cannot even experience, let alone solve, the vast number of physical, anatomical problems we contend with on a daily basis is, in my view, still very far from anything I would call general intelligence).
Can you define what you mean by novel here?
I don't have much to opine from an advanced maths perspective, but I'd like to point out a couple examples of where ChatGPT made basic errors in questions I asked it as an undergrad CS student.
1. I asked it to show me the derivation of a formula for the efficiency of Stop-and-Wait ARQ and it seemed to do it, but a day later, I realised that in one of the steps, it just made a term vanish to get to the next step. Obviously, I should have verified more carefully, but when I asked it to spot the mistake in that step, it did the same thing twice more with bs explanations of how the term is absorbed.
2. I asked it to provide me syllogisms that I could practice proving. An overwhelming number of the syllogisms it gave me were inconsistent and did not hold. This surprised me more because syllogisms are about the most structured arguments you can find, having been formalized centuries ago and discussed extensively since then. In this case, asking it to walk step-by-step actually fixed the issue.
Both of these were done on the free plan of ChatGPT, but I can remember if it was 4o or 4.
The first question is always: which model? Which fortunately you at least addressed: >free plan of ChatGPT, but I can remember if it was 4o or 4.
Since chatgpt-4o, there has been o1-preview, and o1 (full) is out. They just announced o3 got a 25% on frontiermath which is what this article is a reaction to. So, any tests on 4o are at least TWO (or three) AI releases with new capabilities.
I didn't see anyone else ask this but.. isn't the FrontierMath dataset compromised now? At the very least OpenAI now knows the questions if not the answers. I would expect that the next iteration will "magically" get over 80% on the FrontierMath test. I imagine that experiment was pretty closely monitored.
I figured their model was independently evaluated against the questions/answers. That's not to say it's not compromised by "Here's a bag of money" type methods, but I don't even think it'd be a reasonable test if they just handed over the dataset.
I'm sure it was independently evaluated, but I'm sure the folks running the test were not given an on-prem installation of ChatGPT to mess with. It was still done via API calls, presumably through the chat interface UI.
That means the questions went over the fence to OpenAI.
I'm quite certain they are aware of that, and it would be pretty foolish not to take advantage of at least knowing what the questions are.
Depending on the plan the researchers used they may have contractual protections against OpenAI training on their inputs.
Sure, but given the resourcing at OpenAI, it would not be hard to clean[1] the inputs. I'm just trying to be realistic here, there are plenty of ways around contractual obligations and a significant incentive to do so.
[1]: https://en.wikipedia.org/wiki/Clean-room_design
Now that you put it that way, it is laughably easy.
This was my first thought when I saw the results:
https://news.ycombinator.com/item?id=42473470
Insightful comment. The thing that's extremely frustrating is look at all the energy poured into this conversation around benchmarks. There is a fundamental assumption of honesty and integrity in the benchmarking process by at least some people. But when the dataset is compromised and generation N+1 has miraculous performance gains, how can we see this as anything other than a ploy to pump up valuations? Some people have millions of dollars at stake here and they don't care about the naysayers in the peanut gallery like us.
It's sadly inevitable that when billions in funding and industry hype are tied to performance on a handful of benchmarks, scores will somehow, magically, continue to go up.
Needless to say, it doesn't bring us any closer to AGI.
The only solution I see here is people crafting their own, private benchmarks that the big players don't care about enough to train on. That, at least, gives you a clearer view of the field.
Not sure why your comment was downvoted, but it certainly shows the pressure going against people who point out fundamental flaws. This is pushing us towards "AVI" rather than AGI-- "Artificially Valued Intelligence". The optimization function here is around the market.
I'm being completely serious. You are correct, despite the downvotes, that this could not be pushing us towards AGI because if the dataset is leaked you can't claim the G-- generalizability.
The point of the benchmark is to lead is to believe that this is a substantial breakthrough. But a reasonable person would be forced to conclude that the results are misleading to due to optimizing around the training data.
It's fascinating that this has run into the exact same problem as the Quantum research. Ie, in the quantum research to demonstrate any valuable forward progress you must compute something that is impossible to do with a traditional computer. If you can't do it with a traditional computer, it suddenly becomes difficult to verify correctness (ie, you can't just check it was matching the traditional computer's answer.
In the same way ChatGPT scores 25% on this and the question is "How close were those 25% to questions in the training set". Or to put it another way we want to answer the question "Is ChatGPT getting better at applying it's reasoning to out-of-set problems or is it pulling more data into it's training set". Or "Is the test leaking into the training".
Maybe the whole question is academic and it doesn't matter, we solve the entire problem by pulling all human knowledge into the training set and that's a massive benefit. But maybe it implies a limit to how far it can push human knowledge forward.
>in the quantum research to demonstrate any valuable forward progress you must compute something that is impossible to do with a traditional computer
This is factually wrong. The most interesting problems motivating the quantum computing research are hard to solve, but easy to verify on classical computers. The factorization problem is the most classical example.
The problem is that existing quantum computers are not powerful enough to solve the interesting problems, so researchers have to invent semi-artificial problems to demonstrate "quantum advantage" to keep the funding flowing.
There is a plethora of opportunities for LLMs to show their worth. For example, finding interesting links between different areas of research or being a proof assistant in a math/programming formal verification system. There is a lot of ongoing work in this area, but at the moment signal-to-noise ratio of such tools is too low for them to be practical.
No, it is factually right, at least if Scott Aaronson is to be believed:
> Having said that, the biggest caveat to the “10^25 years” result is one to which I fear Google drew insufficient attention. Namely, for the exact same reason why (as far as anyone knows) this quantum computation would take ~10^25 years for a classical computer to simulate, it would also take ~10^25 years for a classical computer to directly verify the quantum computer’s results!! (For example, by computing the “Linear Cross-Entropy” score of the outputs.) For this reason, all validation of Google’s new supremacy experiment is indirect, based on extrapolations from smaller circuits, ones for which a classical computer can feasibly check the results. To be clear, I personally see no reason to doubt those extrapolations. But for anyone who wonders why I’ve been obsessing for years about the need to design efficiently verifiable near-term quantum supremacy experiments: well, this is why! We’re now deeply into the unverifiable regime that I warned about.
https://scottaaronson.blog/?p=8525
It's a property of the "semi-artificial" problem chosen by Google. If anything, it means that we should heavily discount this claim of "quantum advantage", especially in the light of inherent probabilistic nature of quantum computations.
Note that the OP wrote "you MUST compute something that is impossible to do with a traditional computer". I demonstrated a simple counter-example to this statement: you CAN demonstrate forward progress by factorizing big numbers, but the problem is that no one can do it despite billions of investments.
Apparently they can't, right now, as you admit. Anyway this is turning into a stupid semantic argument, have a nice day.
If they can't, then is it really quantum supremacy?
They claimed it last time in 2019 with Sycamore, which could perform in 200 seconds a calculation that Google claimed would take a classical supercomputer 10,000 years.
That was debunked when a team of scientists replicated the same thing on an ordinary computer in 15 hours with a large number of GPUs. Scott Aaronson said that on a supercomputer, the same technique would have solved the problem in seconds.[1]
So if they now come up with another problem which they say cannot even be verified by a classical computer and uses it to claim quantum advantage, then it is right to be suspicious of that claim.
1. https://www.science.org/content/article/ordinary-computers-c...
> If they can't, then is it really quantum supremacy?
Yes, quantum supremacy on an artificial problem is quantum supremacy (even if it's "this quantum computer can simulate itself faster than a classical computer"). Quantum supremacy on problems that are easy to verify would of course be nicer, but unfortunately not all problems happen to have an easy verification.
that applies specifically to this artificial problem google created to be hard for classical computers and in fact in the end it turned out it was not so much. IBM came up with a method to do what google said it would take 10.000 years on a classical computers in just 2 days. I would not be surprised if a similar reduction happened also to their second attempt if anyone was motivated enough to look at it.
In general we have thousands of optimisations problems that are hard to solve but immediate to verify.
the unverifiable regime is a great way to extract funding.
> This is factually wrong.
What's factually wrong about it? OP said "you must compute something that is impossible to do with a traditional computer" which is true, regardless of the output produced. Verifying an output is very different from verifying the proper execution of a program. The difference between testing a program and seeing its code.
What is being computed is fundamentally different from classical computers, therefore the verification methods of proper adherence to instructions becomes increasingly complex.
They left out the key part which was incorrect and the sentence right after "If you can't do it with a traditional computer, it suddenly becomes difficult to verify correctness"
The point stands that for actually interesting problems verifying correctness of the results is trivial. I don't know if "adherence to instructions" transudates at all to quantum computing.
> This is factually wrong. The most interesting problems motivating the quantum computing research are hard to solve, but easy to verify on classical computers.
You parent did not talk about quantum computers. I guess he rather had predictions of novel quantum-field theories or theories of quantum gravity in the back of his mind.
Then his comment makes even less sense.
I agree with the issue of ”is the test dataset leaking into the training dataset” being an issue with interpreting LLM capabilities in novel contexts, but not sure I follow what you mean on the quantum computing front.
My understanding is that many problems have solutions that are easier to verify than to solve using classical computing. e.g. prime factorization
Oh it's a totally different issue on the quantum side that leads to the same issue with difficulty verifying. There, the algorithms that Google for example is using today, aren't like prime factorization, they're not easy to directly verify with traditional computers, so as far as I'm aware they kind of check the result for a suitably small run, and then do the performance metrics on a large run that they hope gave a correct answer but aren't able to directly verify.
How much of this could be resolved if its training set were reduced? Conceivably, most of the training serves only to confuse the model when only aiming to solve a math equation.
If constrained by existing human knowledge to come up with an answer, won’t it fundamentally be unable to push human knowledge forward?
Depends on your understanding of human knowledge I guess? People talk about the frontier of human knowledge and if your view of knowledge is like that of a unique human genius pushing forward the frontier then yes - it'd be stuck. But if you think of knowledge as more complex than that you could have areas that are kind of within our frontier of knowledge (that we could reasonably know, but don't actually know) - taking concepts that we already know in one field and applying them to some other field. Today the reason that doesn't happen is because genius A in physics doesn't know about the existence of genius B in mathematics (let alone understand their research), but if it's all imbibed by "The Model" then it's trivial to make that discovery.
I was referring specifically to the parent comments statements around current AI systems.
Reasoning is essentially the creation of new knowledge from existing knowledge. The better the model can reason the less constrained it is to existing knowledge.
The challenge is how to figure out if a model is genuinely reasoning
Reasoning is a very minor (but essential) part of knowledge creation.
Knowledge creation comes from collecting data from the real world, and cleaning it up somehow, and brainstorming creative models to explain it.
NN/LLM's version of model building is frustrating because it is quite good, but not highly "explainable". Human models have higher explainability, while machine models have high predictive value on test examples due to an impenetrable mountain of algebra.
There are likely lots of connections that could be made that no individual has made because no individual has all of existing human knowledge at their immediate disposal.
I don't think many expect AI to push knowledge forward? A thing that basically just regurgitates consensus historic knowledge seems badly suited to that
But apparently these new frontier models can 'reason' - so with that logic, they should be able to generate new knowledge?
O1 was able to find the math problem in a recently published paper, so yes.
Then much of human research and development is also fundamentally impossible.
Only if you think current "AI" is on the same level as human creativity and intelligence, which it clearly is not.
I think current "AI" (i.e. LLMs) is unable to push human knowledge forward, but not because it's constrained by existing human knowledge. It's more like peeking into a very large magic-8 ball, new answers everytime you shake it. Some useful.
It may be able to push human knowledge forward to an extent.
In the past, there was quite a bit of low hanging fruit such that you could have polymaths able to contribute to a wide variety of fields, such as Newton.
But in the past 100 years or so, the problem is there is so much known, it is impossible for any single person to have deep knowledge of everything. e.g. its rare to find a really good mathematician who also has a deep knowledge (beyond intro courses) about say, chemistry.
Would a sufficiently powerful AI / ML model be able to come up with this synthesis across fields?
That's not a strong reason. Yes, that means ChatGPT isn't good at wholly independently pushing knowledge forward, but a good brainstormer that is right even 10% of the time is an incredible found of knowledge.
No comment on the article it's just always interesting to get hit with intense jargon from a field I know very little about.
I understood the statements of all five questions. I could do the third one relatively quickly (I had seen the trick before that the function mapping a natural n to alpha^n was p-adically continuous in n iff the p-adic valuation of alpha-1 was positive)
> I am dreading the inevitable onslaught in a year or two of language model “proofs” of the Riemann hypothesis which will just contain claims which are vague or inaccurate in the middle of 10 pages of correct mathematics which the human will have to wade through to find the line which doesn’t hold up.
I wonder what the response of working mathematicians will be to this. If the proofs look credible it might be too tempting to try and validate them, but if there’s a deluge that could be a hug time sync. Imagine if Wiles or Perelman had produced a thousand different proofs for their respective problems.
Maybe the coming onslaught of AI slop "proofs" will give a little bump to proof assistants like Coq. Of course, it would still take a human mathematician some time to verify theorem definitions.
Don't waste time on looking at it unless a formal proof checker can verify it.
Honestly I think it won’t be that different from today, where there is no shortage of cranks producing “proofs” of the Riemann Hypothesis and submitting them to prestigious journals.
> As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me [that] I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so you moment you make a database of maths questions public, the language models will train on it.
Well, yes and no. This is only true because we are talking about closed models from closed companies like so-called "OpenAI".
But if all models were truly open, then we could simply verify what they had been trained on, and make experiments with models that we could be sure had never seen the dataset.
Decades ago Microsoft (in the words of Ballmer and Gates) famously accused open source of being a "cancer" because of the cascading nature of the GPL.
But it's the opposite. In software, and in knowledge in general, the true disease is secrecy.
> But if all models were truly open, then we could simply verify what they had been trained on
How do you verify what a particular open model was trained on if you haven’t trained it yourself? Typically, for open models, you only get the architecture and the trained weights. How can you reliably verify what the model was trained on from this?
Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."
> Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."
If they've done it right, you can re-run the training and get the same weights. And maybe you could spot-check parts of it without running the full training (e.g. if there are glitch tokens in the weights, you'd look for where they came from in the training data, and if they weren't there at all that would be a red flag). Is it possible to release the wrong training set (or the wrong instructions) and hope you don't get caught? Sure, but demanding that it be published and available to check raises the bar and makes it much more risky to cheat.
If they provide the training set it's reproducible and therefore verifiable.
If not, it's not really "open", it's bs-open.
The OP said "truly open" not "open model" or any of the other BS out there. If you are truly open you share the training corpora as well or at least a comprehensive description of what it is and where to get it.
It seems like you skipped the second paragraph of my comment?
Because it is mostly hogwash.
Lots of ai researchers have shown that you can both give credit and discredit "open models" when you are given a dataset and training steps.
Many lauded papers fell into reddit Ml or twitter ire when people couldnt reproduce the model or results.
If you are given the training set, the weights, the steps required, and enough compute, you can do it.
Having enough compute and people releasing the steps is the main impediment.
For my research I always release all of my code, and the order of execution steps, and of course the training set. I also give confidence intervals based on my runs so people can reproduce and see if we get similar intervals.
After playing with and using AI for almost two years now it is not getting better from both a cost perspective and performance.
So the higher the cost the better the performance. While models and hardware can be improved the curve is still steep.
The big answer is what are people using it for? We'll they are using lightweight simplistic models to do targeted tasks. To do many smaller and easier to process tasks.
Most of the news on AI is just there to promote a product to earn more cash.
I am fairly optimistic about LLMs as a human math -> theorem-prover translator, and as a fan of Idris I am glad that the AI community is investing in Lean. As the author shows, the answer to "Can AI be useful for automated mathematical work?" is clearly "yes."
But I am confident the answer to the question in the headline is "no, not for several decades." It's not just the underwhelming benchmark results discussed in the post, or the general concern about hard undergraduate math using different skillsets than ordinary research math. IMO the deeper problem still seems to be a basic gap where LLMs can seemingly do formal math at the level of a smart graduate student but fail at quantitative/geometric reasoning problems designed for fish. I suspect this holds for O3, based on one of the ARC problems it wasn't able to solve: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... (via https://www.interconnects.ai/p/openais-o3-the-2024-finale-of...) ANNs are simply not able to form abstractions, they can only imitate them via enormous amounts of data and compute. I would say there has been zero progress on "common sense" math in computers since the invention of Lisp: we are still faking it with expert systems, even if LLM expert systems are easier to build at scale with raw data.
It is the same old problem where an ANN can attain superhuman performance on level 1 of Breakout, but it has to be retrained for level 2. I am not convinced it makes sense to say AI can do math if AI doesn't understand what "four" means with the same depth as a rat, even if it can solve sophisticated modular arithmetic problems. In human terms, does it make sense to say a straightedge-and-compass AI understands Euclidean geometry if it's not capable of understanding the physical intuition behind Euclid's axioms? It makes more sense to say it's a brainless tool that helps with the tedium and drudgery of actually proving things in mathematics.
To give a sense if scale: It’s not that o3 failed to solve that red blue rectangle problem once: o3 spent thousands of gpu hours putting out text about that problem, creating by my math about a million pages of text, and did not find the answer anywhere in those pages. For other problems it did find the answer around the million page mark, as at the ~$3000 per problem spend setting the score was still slowly creeping up.
If the trajectory of the past two years is any guide, things that can be done at great compute expense now will rapidly become possible for a fraction of the cost.
The trajectory is not a guide, unless you count the recent plateauing.
it can take my math and point out a step I missed and then show me the correct procedure but still get the wrong result because it can't reliably multiply 2-digit numbers
it's a "language" model (LLM), not a "math" model. when it is generating your answer, predicting and outputing a word after word it is _not_ multiplying your numbers internally.
Yes, I know. It's just kind of interesting how it can make inferences about complicated things but not get multiplications correct that would almost definitely have been in its training set many times (two digit by two digit)
Better than an average human then.
Different than an average human.
Just a comment: the example o1 got wrong was actually underspecified: https://anokas.substack.com/p/o3-and-arc-agi-the-unsolved-ta...
Which is actually a problem I have with ARC (and IQ tests more generally): it is computationally cheaper to go from ARC transformation rule -> ARC problem than it is the other way around. But this means it’s pretty easy to generate ARC problems with non-unique solutions.
One thing I know is that there wouldn’t be machines entering IMO 2025. The concept of “marker” does not exist in IMO - scores are decided by negotiations between team leaders of each country and the juries. It is important to get each team leader involved for grading the work of students for their country, for accountability as well as acknowledging cultural differences. And the hundreds of people are not going to stay longer to grade AI work.
So here's what I'm perplexed about. There are statements in Presburger arithmetic that take time doubly exponential (or worse) in the size of the statement to reach via any path of the formal system whatsoever. These are arithmetic truths about the natural numbers. Can these statements be reached faster in ZFC? Possibly—it's well-known that there exist shorter proofs of true statements in more powerful consistent systems.
But the problem then is that one can suppose there are also true short statements in ZFC which likewise require doubly exponential time to reach via any path. Presburger Arithmetic is decidable whereas ZFC is not, so these statements would require the additional axioms of ZFC for shorter proofs, but I think it's safe to assume such statements exist.
Now let's suppose an AI model can resolve the truth of these short statements quickly. That means one of three things:
1) The AI model can discover doubly exponential length proof paths within the framework of ZFC.
2) There are certain short statements in the formal language of ZFC that the AI model cannot discover the truth of.
3) The AI model operates outside of ZFC to find the truth of statements in the framework of some other, potentially unknown formal system (and for arithmetical statements, the system must necessarily be sound).
How likely are each of these outcomes?
1) is not possible within any coherent, human-scale timeframe.
2) IMO is the most likely outcome, but then this means there are some really interesting things in mathematics that AI cannot discover. Perhaps the same set of things that humans find interesting. Once we have exhausted the theorems with short proofs in ZFC, there will still be an infinite number of short and interesting statements that we cannot resolve.
3) This would be the most bizarre outcome of all. If AI operates in a consistent way outside the framework of ZFC, then that would be equivalent to solving the halting problem for certain (infinite) sets of Turing machine configurations that ZFC cannot solve. That in itself itself isn't too strange (e.g., it might turn out that ZFC lacks an axiom necessary to prove something as simple as the Collatz conjecture), but what would be strange is that it could find these new formal systems efficiently. In other words, it would have discovered an algorithmic way to procure new axioms that lead to efficient proofs of true arithmetic statements. One could also view that as an efficient algorithm for computing BB(n), which obviously we think isn't possible. See Levin's papers on the feasibility of extending PA in a way that leads to quickly discovering more of the halting sequence.
> and for arithmetical statements, the system must necessarily be sound
Why do you say this? The AI doesn't know or care about soundness. Probably it has mathematical intuition that makes unsound assumptions, like human mathematicians do.
> How likely are each of these outcomes?
I think they'll all be true to a certain extent, just as they are for human mathematicians. There will probably be certain classes of extremely long proofs that the AI has no trouble discovering (because they have some kind of structure, just not structure that can be expressed in ZFC), certain truths that the AI makes an intuitive leap to despite not being able to prove them in ZFC (just as human mathematicians do), and certain short statements that the AI cannot prove one way or another (like Goldbach or twin primes or what have you, again, just as human mathematicians can't).
ZFC is way worse than Presburger arithmetic -- since it is undecidable, we know that the length of the minimal proof of a statement cannot be bounded by a computable function of the length of the statement.
This has little to do with the usefulness of LLMs for research-level mathematics though. I do not think that anyone is hoping to get a decision procedure out of it, but rather something that would imitate human reasoning, which is heavily based on analogies ("we want to solve this problem, which shares some similarities with that other solved problem, can we apply the same proof strategy? if not, can we generalise the strategy so that it becomes applicable?").
2 is definitely true. 3 is much more interesting and likely true but even saying it takes us into deep philosophical waters.
If every true theorem had a proof in a computationally bounded length the halting problem would be solvable. So the AI can't find some of those proofs.
The reason I say 3 is deep is that ultimately our foundational reasons to assume ZFC+the bits we need for logic come from philosohical groundings and not everyone accepts the same ones. Ultrafinitists and large cardinal theorists are both kinds of people I've met.
My understanding is that no model-dependent theorem of ZFC or its extensions (e.g., ZFC+CH, ZFC+¬CH) provides any insight into the behavior of Turing machines. If our goal is to invent an algorithm that finds better algorithms, then the philosophical angle is irrelevant. For computational purposes, we would only care about new axioms independent of ZFC if they allow us to prove additional Turing machine configurations as non-halting.
> There are statements in Presburger arithmetic that take time doubly exponential (or worse) in the size of the statement to reach via any path of the formal system whatsoever.
This is a correct statement about the worst case runtime. What is interesting for practical applications is whether such statements are among those that you are practically interested in.
I would certainly think so. The statements mathematicians seem to be interested in tend to be at a "higher level" than simple but true statements like 2+3=5. And they necessarily have a short description in the formal language of ZFC, otherwise we couldn't write them down (e.g., Fermat's last theorem).
If the truth of these higher level statements instantly unlocks many other truths, then it makes sense to think of them in the same way that knowing BB(5) allows one to instantly classify any Turing machine configuration on the computation graph of all n ≤ 5 state Turing machines (on empty tape input) as halting/non-halting.
I think this is a silly question, you could track AI's doing very simple maths back in 1960 - 1970's
It's just the worrisome linguistic confusion between AI and LLMs.
Every profession seems to have a pessimistic view of AI as soon as it starts to make progress in their domain. Denial, Anger, Bargaining, Depression, and Acceptance. Artists seem to be in the depression state, many programmers are still in the denial phase. Pretty solid denial here from a mathematician. o3 was a proof of concept, like every other domain AI enters, it's going to keep getting better.
Society is CLEARLY not ready for what AI's impact is going to be. We've been through change before, but never at this scale and speed. I think Musk/Vivek's DOGE thing is important, our governent has gotten quite large and bureaucratic. But the clock has started on AI, and this is a social structural issue we've gotta figure out. Putting it off means we probably become subjects to a default set of rulers if not the shoggoth itself.
Or is it just white collar workers experiencing what blue collar workers have been experiencing for decades?
So will that make society shift to the left in demand stronger of safety nets, or to the right in search of a strongman to rescue them?
Depends on the individual, do they think “look after us” or do they think “look after ME”?
The reason why this is so disruptive is because it will effect hundreds of fields simultaneously.
Previously workers in a field disrupted by automation would retrain to a different part of the economy.
If AI pans out to the point that there are mass layoffs in hundreds of sectors of the economy at once, then i’m not sure the process we have haphazardly set up now will work. People will have no idea where to go beyond manual labor. (But this will be difficult due to the obesity crisis - but maybe it will save lives in a weird way).
Well it hasn’t happened yet at least (unemployment is near historic lows). How much better does AI need to get? And do we actually expect it to happen? Improving on random benchmarks is not necessarily evidence of being able to do a specific job.
If there are 'mass layoffs in hundreds of sectors of the economy at once', then the economy immediately goes into Great Depression 2.0 or worse. Consumer spending is two-thirds of the US economy, when everyone loses their jobs and stops having disposable income that's literally what a depression is
This will create a prisoner’s dilemma for corporations then, the government will have to step in to provide incentives for insanely profitable corporations to keep the proper number of people employed or limit the rate of layoffs.
I think it's a little of both. Maybe generative AI algorithms won't overcome their initial limitations. But maybe we don't need to overcome them to transform society in a very significant way.
It's because we then go check it out, and see how useless it is when applied to the domain.
> programmers are still in the denial phase
I am doing a startup and would jump on any way to make the development or process more efficient. But the only thing LLMs are really good for are investor pitches.
My favourite moments of being a graduate student in math was showing my friends (and sometimes professors) proofs of propositions and theorems that we discussed together. To be the first to put together a coherent piece of reasoning that would convince them of the truth was immensely exciting. Those were great bonding moments amongst colleagues. The very fact that we needed each other to figure out the basics of the subject was part of what made the journey so great.
Now, all of that will be done by AI.
Reminds of the time when I finally enabled invincibility in Goldeneye 007. Rather boring.
I think we've stopped to appreciate the human struggle and experience and have placed all the value on the end product, and that's we're developing AI so much.
Yeah, there is the possibility of working with an AI but at that point, what is the point? Seems rather pointless to me in an art like mathematics.
> Now, all of that will be done by AI.
No "AI" of any description is doing novel proofs at the moment. Not o3, or anything else.
LLMs are good for chatting about basic intuition with, up to and including complex subjects, if and only if there are publically available data on the topic which have been fed to the LLM during its training. They're good at doing summaries and overviews of specific things (if you push them around and insist they don't waffle and ignore garbage carefully and keep your critical thinking hat on, etc etc).
It's like having a magnifying glass that focuses in on the small little maths question you might have, without you having to sift through ten blogs or videos or whatever.
That's hardly going to replace graduate students doing proofs with professors, though, at least not with the methods being employed thus far!
I am talking about in 20-30 years.
I was not refuted sufficiently a couple of years ago. I claimed "training is open boundary" etc.
Like as a few years ago, I just boringly add again "you need modeling" to close it.
Who is the author?
Kevin Buzzard
At this stage I assume everything having a sequencial pattern can and will be automated by LLM AIs.
I think that’s provably incorrect for the current approach to LLMs. They all have a horizon over which they correlate tokens in the input stream.
So, for any LLM, if you intersperse more than that number of ‘X’ tokens between each useful token, they won’t be able to do anything resembling intelligence.
The current LLMs are a bit like n-gram databases that do not use letters, but larger units.
It’s that a bit of an unfair sabotage?
Naturally, humans couldn’t do it, even though they could edit the input to remove the X’s, but shouldn’t we evaluate the ability (even intelligent ability) of LLM’s on what they can generally do rather than amplify their weakness?
Why is that unfair in reply to the claim “At this stage I assume everything having a sequencial pattern can and will be automated by LLM AIs.”?
I am not claiming LLMs aren’t or cannot be intelligent, not even that they cannot do magical things; I just rebuked a statement about the lack of limits of LLMs.
> Naturally, humans couldn’t do it, even though they could edit the input to remove the X’s
So, what are you claiming: that they cannot or that they can? I think most people can and many would. Confronted with a file containing millions of X’s, many humans will wonder whether there’s something else than X’s in the file, do a ‘replace all’, discover the question hidden in that sea of X’s, and answer it.
There even are simple files where most humans would easily spot things without having to think of removing those X's. Consider a file
with a million X’s on the end of each line. Spotting the question in that is easy for humans, but impossible for the current bunch of LLMsThis is only easy because the software does line wrapping for you, mechanistically transforming the hard pattern of millions of symbols into another that happens to be easy for your visual system to match. Do the same for any visually capable model and it will get that easily too. Conversely, make that a single line (like the one transformers sees) and you will struggle much more than the transformer because you'll have to scan millions of symbols sequentially looking for patterns.
Humans have weak attention compared to it, this is a poor example.
If you have a million Xs on the end of each line, when a human is looking at that file, he's not looking at the entirety of it, but only at the part that is actually visible on-screen, so the equivalent task for an LLM would be to feed it the same subset as input. In which case they can all answer this question just fine.
The follow-up question is "Does it require a paradigm shift to solve it?". And the answer could be "No". Episodic memory, hierarchical learnable tokenization, online learning or whatever works well on GPUs.
At this stage I hope everything that needs to be reliable won't be automated by LLM AIs.
> FrontierMath is a secret dataset of “hundreds” of hard maths questions, curated by Epoch AI, and announced last month.
The database stopped being secret when it was fed to proprietary LLMs running in the cloud. If anyone is not thinking that OpenAI has trained and tuned O3 on the "secret" problems people fed to GPT-4o, I have a bridge to sell you.
It's perfectly possible for OpenAI to run the model (or prove others the means to run it) without storing queries/outputs for future. I expect Epoch AI would insist on this. Perhaps OpenAI would lie about it, but that's opening up serious charges.
This level of conspiracy thinking requires evidence to be useful.
Edit: I do see from your profile that you are a real person though, so I say this with more respect.
What evidence do we need that AI companies are exploiting every bit of information they can use to get ahead in the benchmarks to generate more hype? Ignoring terms/agreements, violating copyright, and otherwise exploiting information for personal gain is the foundation of that entire industry for crying out loud.
Some people are also forgetting who is the CEO of OpenAI.
Sam Altman has long talked about believing in the "move fast and break things" way of doing business. Which is just a nicer way of saying do whatever dodgy things you can get away with.
> How much longer this will go on for nobody knows, but there are lots of people pouring lots of money into this game so it would be a fool who bets on progress slowing down any time soon.
Money cannot solve the issues faced by the industry which mainly revolves around lack of training data.
They already used the entirety of the internet, all available video, audio and books and they are now dealing with the fact that most content online is now generated by these models, thus making it useless as training data.
Considering that they have Terence Tao himself working on the problem, betting against it would be unwise.
How to train an AI strapped to a formal solver.
In other news we’ve discovered life (our bacteria) on mars Just joking
When did we decide that AI == LLM? Oh don't answer. I know, The VC world noticed CNNs and LLMs about 10 years ago and it's the only thing anyone's talked about ever since.
Seems to me the answer to 'Can AI do maths yet?' depends on what you call AI and what you call maths. Our old departmental VAX running at a handfull of megahertz could do some very clever symbol manipulation on binomials and if you gave it a few seconds, it could even do something like theorum proving via proto-prolog. Neither are anywhere close to the glorious GAI future we hope to sell to industry and government, but it seems worth considering how they're different, why they worked, and whether there's room for some hybrid approach. Do LLMs need to know how to do math if they know how to write Prolog or Coc statements that can do interesting things?
I've heard people say they want to build software that emulates (simulates?) how humans do arithmetic, but ask a human to add anything bigger than two digit numbers and the first thing they do is reach for a calculator.
No it can't, and there's no such thing as AI. How is a thing that predicts the next-most-likely word going to do novel math? It can't even do existing math reliably because logical operations and statistical approximation are fundamentally different. It is fun watching grifters put lipstick on this thing and shop it around as a magic pig though.
openai and epochai (frontier math) are startups with a strong incentive to push such narratives. the real test will be in actual adoption in real world use cases.
the management class has a strong incentive to believe in this narrative, since it helps them reduce labor cost. so they are investing in it.
eventually, the emperor will be seen to have no clothes at least in some usecases for which it is being peddled right now.
Epoch is a non-profit research institute, not a startup.
Betteridge's Law applies.
"once" the training data can do it, LLMs will be able to do it. and AI will be able to do math once it comes to check out the lights of our day and night. until then it'll probably wonder continuously and contiguously: "wtf! permanence! why?! how?! by my guts, it actually fucking works! why?! how?!"
AWS announced 2 or 3 weeks a way of formulating rules into a formal language.
AI doesn't need to learn everything, our LLM Models already contain EVERYTHING. Including ways of how to find a solution step by step.
Which means, you can tell an LLM to translate whatever you want, into a logical language and use an external logic verifier. The only thing a LLM or AI needs to 'understand' at this point is to make sure that the statistical translation from left to right is high enough.
Your brain doesn't just do logic out of the box, You conclude things and formulate them.
And plenty of companies work on this. Its the same with programming, if you are able to write code and execute it, you execute it until the compiler errors are gone. Now your LLM can write valid code out of the box. Let the LLM write unit tests, now it can verify itself.
Claude for example offers you, out of the box, to write a validation script. You can give claude back the output of the script claude suggested to you.
Don't underestimate LLMs
Is this the AWS thing you referenced? https://aws.amazon.com/what-is/automated-reasoning/
I do think it is time to start questioning whether the utility of ai solely can be reduced to the quality of the training data.
This might be a dogma that needs to die.
I tried. I don't have the time to formulate and scrutinise adequate arguments, though.
Do you? Anything anywhere you could point me to?
The algorithms live entirely off the training data. They consistently fail to "abduct" (inference) beyond any language-in/of-the-training-specific information.
The best way to predict the next word is to accurately model the underlying system that is being described.
It is a gradual thing. Presumably the models are inferring things on runtime that was not a part of their training data.
Anyhow, philosophically speaking you are also only exposed to what your senses pick up, but presumably you are able to infer things?
As written: this is a dogma that stems from a limited understanding of what algorithmic processes are and the insistence that emergence can not happen from algorithmic systems.
If not bad training data shouldn’t be problem
There can be more than one problem. The history of computing (or even just the history of AI) is full of things that worked better and better right until they hit a wall. We get diminishing returns adding more and more training data. It’s really not hard to imagine a series of breakthroughs bringing us way ahead of LLMs.
As far as ChatGPT goes, you may as well be asking: Can AI use a calculator?
The answer is yes, it can utilize a stateful python environment and solve complex mathematical equations with ease.
There is a difference between correctly stating that 2 + 2 = 4 within a set of logical rules and proving that 2 + 2 = 4 must be true given the rules.
I think you misunderstood, ChatGPT can utilize Python to solve a mathematical equation and provide proof.
https://chatgpt.com/share/676980cb-d77c-8011-b469-4853647f98...
More advanced solutions:
https://chatgpt.com/share/6769895d-7ef8-8011-8171-6e84f33103...
Awful lot of shy downvotes.. Why not say something if you disagree?
It still has to know what to code in that environment. And based on my years of math as a wee little undergrad, the actual arithmetic was the least interesting part. LLM’s are horrible at basic arithmetic, but they can use python for the calculator. But python wont help them write the correct equations or even solve for the right thing (wolfram alpha can do a bit of that though)
You’ll have to show me what you mean.
I’ve yet to encounter an equation that 4o couldn’t answer in 1-2 prompts unless it timed out. Even then it can provide the solution in a Jupyter notebook that can be run locally.
Never really pushed it. I have to reason to believe it wouldn’t get most of that stuff correctly. Math is very much like programming and I’m sure it can output really good python for its notebook to use execute.
No: https://github.com/0xnurl/gpts-cant-count
I can't reliably multiply four digit numbers in my head either, what's your point?
Nobody said you have to do it in your head.
That's the equivalent to what we are asking the model to do. If you give the model a calculator it will get 100%. If you give it a pen and paper (e.g. let it show it's working) then it will get near 100%.
Citation needed.
Which bit do you need a citation for? I can run the experiment in 10 mins.
> That's the equivalent to what we are asking the model to do.
Why?
What does it mean to give a model a calculator?
What do you mean “let it show its working”? If I ask an LLM to do a calculation, I never said it can’t express the answer to me in long-form text or with intermediate steps.
If I ask a human to do a calculation that they can’t reliably do in their head, they are intelligent enough to know that they should use a pen and paper without needing my preemptive permission.
Ai has a interior world model thus it can do math if a chain of proof is walking without uncertainty from room to room. the problem is its inability to reflect on its own uncertainty and to then overrife that uncertainty ,should a new room entrance method be selfsimilar to a previous entrance
I wish scientists who do psychology and cognition of actual brains could approach those AI things and talk about it, and maybe make suggestions.
I really really wish AI would make some breakthrough and be really useful, but I am so skeptical and negative about it.
Unfortunately, the scientists who study actually brains have all sort of interesting models but ultimately very little clue how these actual brains work at the level of problem solving. I mean, there's all sort of "this area is associated with that kind of process" and "here's evidence this area does this algorithm" stuff but it's all at the level you imagine steam engine engineers trying to understand a warp drive.
The "open worm project" was an effort years ago to get computer scientists involved in trying to understand what "software" a very small actual brain could run. I believe progress here has been very slow and that an idea of ignorance that much larger brains involve.
https://en.wikipedia.org/wiki/OpenWorm
If you can't find useful things for LLMs or AI at this point, you must just lack imagination.
I haven't checked in a while, but last I checked ChatGPT it struggled on very basic things like: how many Fs are in this word? Not sure if they've managed to fix that but since that I had lost hope in getting it to do any sort of math
I may be wrong, but I think it a silly question. AI is basically auto-complete. It can do math to the extent you can find a solution via auto-complete based on an existing corpus of text.
You're underestimating the emergent behaviour of these LLM's. See for example what Terrence Tao thinks about o1:
https://mathstodon.xyz/@tao/113132502735585408
I'm always just so pleased that the most famous mathematician alive today is also an extremely kind human being. That has often not been the case.
Pretty sure this is out of date now
> AI is basically
Very many things conventionally labelled in the 50's.
You are speaking of LLMs.
Yes - I mean only to say "AI" as the term is commonly used today.
Humans can autocomplete sentences too because we understand what's going on. Prediction is a necessary criterion for intelligence, not an irrelevant one.
I understand the appeal of having a machine helping us with maths and expanding the frontier of knowledge. They can assist researchers and make them more productive. Just like they can make already programmers more productive.
But maths are also fun and fulfilling activity. Very often, when we learn a math theory, it's because we want to understand and gain intuition on the concepts, or we want to solve a puzzle (for which we can already look up the solution). Maybe it's similar to chess. We didn't develop search engines to replace human players and make them play together, but they helped us become better chess players or understanding the game better.
So the recent progress is impressive, but I still don't see how we'll use this tech practically and what impacts it can have and in which fields.