27 comments

  • mtokarski 7 hours ago ago

    Interesting work, but I think the interpretation may be a bit overstated. The authors claim that injecting too much factual "knowledge" during pretraining causes models to collapse — performance drops below the baseline once knowledge frequency crosses a threshold.

    The problem is how they inject it. Their “knowledge” isn’t natural language; it’s templated Wikidata triples like "X is the capital of Y." That’s a super low-entropy, highly repetitive distribution. When you cram enough of that into a fixed token budget, you’re not really teaching the model more facts — you’re just destroying linguistic diversity and skewing the token statistics.

    In real pretraining or domain adaptation scenarios, “knowledge” tends to appear in richer, more varied contexts. The practical takeaway isn’t "don’t add too much domain data," but rather "don’t overrepresent any single format or narrow syntactic pattern" The issue seems more about representation homogeneity than about factual density itself.

    • magicalhippo 6 hours ago ago

      I'm sure there's other work, I came across this in the Physics of Language Model paper[1] on knowledge extraction.

      Essentially they found that by presenting the knowledge in a single, fixed way, the model is trained to reproduce that exact sequence of tokens, rather than "internalizing" the knowledge.

      By varying the sentences, the model instead manages to separate out the knowledge, so to speak. This in turn drastically improves how well they can extract that knowledge later.

      [1]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5250633

      • ijk 4 hours ago ago

        That's consistent with other research I've seen, where varied presentation of the data is key to effective knowledge injection [1].

        My assumption, based on the research is that training on different prompts but the same answer gives you more robust Q&A behavior; training on variations of how to express the same concept generalizes. Training on the same prompt and different answers gives you creative diversity [2].

        [1] https://arxiv.org/abs/2404.00213 [2] https://arxiv.org/abs/2503.17126

      • dotancohen 2 hours ago ago

        It's the same for humans. This is the main argument against rote memorization.

    • spankalee 6 hours ago ago

      Doesn't this then support the claim that LLMs aren't building world models - where even linguistically simple factual statements should help expand and refine that model - and reenforce the idea that they are still just next token predictors?

      • dotancohen 2 hours ago ago

        Not unlike humans. Don't believe me? Go ask somebody these questions in quick succession:

          What colour is a tomato?
          What colour is a ruby?
          What colour are lips?
          What colour is a strawberry?
          What colour is blood?
          What colour traffic light do you drive on?
        • bryzaguy an hour ago ago

          What a cool demonstration. My automatic response was “red” for traffic light. Although, a different part of my brain re-evaluated given the context. The question in my mind now, is the auto response a building block to the latter or is that orchestration a fully separate system?

      • simsla 6 hours ago ago

        There's no inductive bias for a world model in multiheaded attention. LLMs are incentivized to learn the most straightforward interpretation/representation of the data you present.

        If the data you present is low entropy, it'll memorize. You need to make the task sufficiently complex so that memorisation stops being the easiest solution.

      • andrewflnr 6 hours ago ago

        My read is that token prediction requires a more general model to predict more varied tokens, which makes it something closer to a world model. After all, in principle, there's a point where the optimal "token predictor" really is backed by a world model. (Now is that model feasible to find? unclear!)

  • itissid 3 hours ago ago

    What if we use the structured prompts from coding sessions, especially the ones which use arch design document, domain knowledge(UML, Statecharts what have you), what team member to ask about X, for a large projects and fine tuned models. And these could all be made into tool calls for instruction following.

    Right now it seems teams manage a reasonably sophisticated LLM layer, MCPs and instruction following is one shot context window management dependent.

  • adsharma 11 hours ago ago

    I wish the authors calculated a plot of model size (number of params) vs number of triples it can hold before the memory collapse happens.

    It's hard to map the frequency of knowledge injection to a real world understanding of "how much knowledge" can a 4B param model hold?

    • bconsta 9 hours ago ago

      There is a study that gives a rule of thumb of ~2 bits per param for a model's memorization capacity: https://arxiv.org/abs/2404.05405

      • dart_pink 7 hours ago ago

        Seems they have replicated Gardner's work, without mentioning it, "Maximum Storage Capacity in Neural Networks" (1987), which established that the storage capacity of a neural network is about 2N (2 bits per parameter)

        • bconsta 5 hours ago ago

          I had no idea about this. Thanks for sharing

        • selimthegrim 7 hours ago ago

          Elizabeth Gardner for those looking.

      • adsharma 4 hours ago ago

        Recent: 3.6 bits per param

        https://arxiv.org/abs/2505.24832

        • dart_pink 3 hours ago ago

          You're both right. The classical capacity measure (Gardner's capacity limit) is defined as the maximum number of patterns that can be remembered with zero errors. This remains 2 bits per parameter, proven mathematically.

          The capacity definition in this recent paper is completely different - it is defined based on the kolmogorov complexity of predicting a memorized sequence, or in layman's terms: how easy it is to compress known sequences. This allows for some bit "errors", ie some symbols with bad compression ratio, only the total compression ratio of the sequence is measured.

          This is somewhat parallel to the classical ECC limits (strict hamming distance constraints) vs modern probabilistic ECC limits.

          TLDR when you allow a small number of errors, the capacity increases from 2 bits to 3.6 bits

      • adsharma 4 hours ago ago

        2 bits out of FP8 would be 25% 2 bits out of FP16 would be 12.5%

        I've seen recent work that claimed 70% of the params are used for memorization.

  • daft_pink 8 hours ago ago

    I’m really curious how much it costs to inject information like this into an LLM as people say training an LLM is very expensive, so if you want a domain specific LLM, how much does the additional training cost to get this?

    • simonw 8 hours ago ago

      It sounds like you're talking about fine-tuning an existing model. That's not what this paper did - they studied the effect of training small models entirely from scratch with varying amounts of domain knowledge.

      I still haven't seen strong evidence that fine-tuning to add extra knowledge is effective, but I'd be delighted to learn otherwise.

      • ijk 4 hours ago ago

        Adding knowledge works, depending on how to define knowledge and works; given sufficient data you can teach an LLM new things [1].

        However, the frontier models keep improving at a quick enough rate that it's often more effective just to wait for the general solution to catch up with your task then to spend months training a model yourself. Unless you need a particular tightly controlled behavior or need a smaller faster model or what have you. Training new knowledge in can get weird [2].

        And in-context learning takes literal seconds-to-minutes of time if your information fits in the context window, so it's a lot faster to go that route if you can.

        [1] https://arxiv.org/abs/2404.00213

        [2] https://openreview.net/forum?id=NGKQoaqLpo

      • hollerith 7 hours ago ago

        Are there any effective ways to add extra knowledge to an LLM, ways that are more than just demos or proofs of concept?

        For example, could there be a site like HN with ten thousand contributors where the contributions are changes to an LLM rather than posts and comments?

        One issue is that if contribution A contradicts contribution B, then on HN the contradiction presents no problem (i.e., two HN comments can and often do contradict each other just fine) whereas AFAICT the LLM will need to resolve the contradiction somehow to give coherent answers on the subject matter of the contributions A and B. Then again I suppose the LLM's answer could take the form, "opinions on [subject] vary, with some maintaining that . . . whereas others claim that . . ."

        • econ 3 hours ago ago

          One mistake people make is to preferably close questions immediately. One should in stead leave them all open until a situation arrises where your actions should [unavoidably] depend on "knowing" the answer.

          Let's say, just in time for Jesus to save you.

        • simonw 7 hours ago ago

          This is a solved problem. The answer is to add extra relevant information to the context as part of answering the user's prompt.

          This is sometimes called RAG, for Retrieval Augmented Generation.

          These days the most convincing way to do this is via tool calls.

          Provide your LLM harness with a tool for running searches, and tell it to use that tool any time it needs additional information.

          A good "reasoning" LLM like GPT-5 or Claude 4 can even handle contradictory pieces of information - they can run additional searches if they get back confusing results and work towards a resolution, or present "both sides" to the user if they were unable to figure it out themselves.

  • gdiamos 11 hours ago ago

    I wonder if this depends on what is inside the domain specific data.

    I’m happy to see ML papers on hacker news.