Show HN: Experiments in AI-generation of crosswords

(abstractnonsense.com)

32 points | by abstractbill a day ago ago

19 comments

  • vunderba 21 hours ago ago

    Not bad.

    As someone who has dabbled in AI generated crosswords I found that providing samples of "good crossword clues" (which I curated from historical NYT monday puzzles) as part of the LLM context helped tremendously in generating better clues.

    There was also a Show HN for a generative AI crossword puzzle system a few months ago so I'll include what I mentioned there:

    Part of the deep satisfaction in solving a crossword puzzle is the specificity of the answer. It's far more gratifying to answer a question with something like "Hawking" then to answer with "scientist", or answering with "mandelbrot" versus "shape".

    So ideally, you want to lean towards "specificity" wherever possible, and use "generics" as filler.

    Link:

    https://news.ycombinator.com/item?id=41879754

    • abstractbill 21 hours ago ago

      Thanks. Yes, specificity of solutions seems like a good metric to optimize for.

      In some of my crosswords I get clues that are specific in clever ways (e.g. one of these has "Extreme, not camping" which I thought was really strange until I found the answer "intense" and was very impressed by that level of wordplay from an LLM!)

  • korymath 21 hours ago ago

    Great post.

    Funny, I just posted this to X

    2025 GenAI challenge

    Create a 5x5 crossword puzzle with two distinct solutions. Each clue must work for both solutions. Do not use the same word in both solutions. No black squares.

    I try with each new model that lands. Still can’t get it.

    • alberto_balsam 20 hours ago ago

      Do you know if there is a solution to this by humans? I'd be interested in seeing it.

      • korymath 19 hours ago ago

        I've not found a solution at any NxN size made by human or machine.

        • quuxplusone 17 hours ago ago

          You might get a little closer by tweaking the prompt — you're asking the LLM to "figure out" that the first step is to create two 5x5 word squares with no repeated words, and then the second step is to solve ten requests of the form, "Give me a crossword-style clue that could be reasonably solved by either the words OPERA or the word TENET" (for each of the ten word-pairs in your square-pairs). However, LLMs are based on tokens, and thus fundamentally don't "understand" that words are made out of letters — that's why we have memes about their inability to count the number of "r"s in "strawberry" and so on. So we shouldn't expect an LLM to be able to perform Step 1 at all, really. And Step 2 requires wordplay and/or lateral thinking, which LLMs are again bad at. (They can easily do "Give me a crossword-style clue that could be solved by the word OPERA," because there are databases of such things on the web which form part of every LLM's dataset. But there's no such database for double-solution clues.)

          Generating a 5x5 word square (with different words across and down, so not of the "Sator Arepo" variety) is already really hard for a human. I plugged the Wordle target word list into https://github.com/Quuxplusone/xword/blob/master/src/xword-f... to get a bunch of plausible squares like this:

              SCALD
              POLAR
              ARTSY
              CEASE
              ERROR
          
          But you want two word squares that can plausibly be clued together, which is (not impossible, but) difficult if matching entries aren't the same part of speech. For example, cluing "POLAR" together with "ARTSY" (both adjectives) seems likely more doable than cluing "POLAR" together with "LASSO" (noun or verb).

          Anyway, here's my attempt at a human solution, using the grid above — and another grid, which I'll challenge you to find from these clues. Hint: All but two of the ten pairs match, part-of-speech-wise.

              1A. Remove the outer layer of, perhaps  
              2A. Region on a globe  
              3A. Like some movie theaters  
              4A. Command to a lawbreaker  
              5A. Rhyme for Tom Lehrer?  
              1D. ____yard (sometime sci-fi setting)  
              2D. It goes something like this: Ꮎ  
              3D. Feature of liturgy, often  
              4D. It's vacuous, in a sense  
              5D. Fino, vis-a-vis Pedro Ximénez
    • abstractbill 18 hours ago ago

      Thanks!

      That's a wonderfully hard problem, I'd love to see it get solved.

    • echelon 20 hours ago ago

      That's algorithmically hard.

      Ask the LLM to generate a program to solve the problem.

      • korymath 19 hours ago ago

        I've tried that, as recently as today with latest Gemini, Claude, and o1 ... none have been successful.

  • corlinpalmer 17 hours ago ago

    Awesome! I have also dabbled in AI-generated crosswords, but I was more fascinated with the concept of generating the most efficient layout of an X-by-X grid from a given word set. It's a surprisingly difficult optimization problem because the combinatorics are insane. Here's an example output trying to find the most efficient layout of common Linux terminal commands:

        W   P     G   
        H I S T O R Y 
        E         O   
        R   T   Y U M 
      L E S S     P   
        I   O   C A T 
      U S E R A D D   
      L     T R   D C 
    
    Of course this is a pretty small grid and it gets more difficult with size. I've thought about making a competition from this sort of challenge. Would anyone be interested?
    • abstractbill 17 hours ago ago

      Yes! That a really fun problem too -- it feels like it should be tractable but it's insanely hard. If you do start some kind of competition around it, let me know, I'd be interested.

  • furyofantares 21 hours ago ago

    I've tried to get o1 to generate Xordle puzzles.

    Warning: post contains a spoiler for a recent Xordle.

    Xordle is Wordle with two target words that share no letters in common. Additionally, there is a "free clue" given at the start, and all three words are thematically linked. It's not always a straightforward link, for example a recent puzzle had the starter word 'grief' and targets 'empty' and 'chair'. All puzzles today are selected from user submissions.

    o1 is the first model that's been able to solve Xordles reliably, or to generate valid puzzles at all. It's well-known that these things are massively handicapped for this type of task due to tokenization.

    But since o1 can in fact achieve it, I wanted to see if I could get it to make puzzles that are at all satisfying. Instead it makes very bland puzzles, with straightforward connections and extremely broad themes.

    Prompting can swing the pendulum too far in the other direction, to puzzles where the connection is contrived and impossible to see even after it's solved. As I've often experienced with LLMs, being able to hit either side of a target with prompting does not necessarily mean you can get it to land in the middle, and in fact I have had no success in doing so with this task.

    This is one of the most basic examples I know of lack of creativity or "taste" to an LLM. It is a little hard for a human to generate two 5-letter words with no overlap, but it is extremely easy for a human to look for a thematic connection among 2-3 words and say if it's satisfying. But so far I've been totally unable to make the LLM make satisfying puzzles.

    edit: Nothin' like making a claim about LLMs to get one up off one's ass and try to prove it wrong immediately. I'm getting some much better results with better examples now.

    • IanCal 21 hours ago ago

      Have you tried using an llm to say whether the puzzles are good or not?

    • abstractbill 21 hours ago ago

      Great observation, yeah, I've had very similar experiences with prompting, exactly as you said -- one direction giving very bland literal clues, and the opposite direction giving clues that are a stretch even when you know the answer!

  • gowld 21 hours ago ago

    The "American" grids aren't American. An American grid almost always has 2 answers (both directions) per square.

    • abstractbill 21 hours ago ago

      Oh that's really interesting thanks! That would actually be an easy constraint to add too.

      • quuxplusone 16 hours ago ago

        American-style crossword construction has a number of constraints, some bendable, some not.

        - Every cell must be "keyed," i.e., part of a word Across and a word Down. Unkeyed cells are strictly forbidden.

        - No word may be less than 3 letters. Two-letter words are strictly forbidden.

        - The grid must be rotationally symmetric. (But this rule can be broken for fun. Bilaterally symmetric grids are relatively common these days. Totally asymmetric grids are very rare and always in service of some kind of fun — see https://www.xwordinfo.com/Thumbs?select=symmetry )

        - No more than one-sixth of the squares can be black. (But this rule can be broken, usually either to make the puzzle less challenging by shortening the average word length, or to make the creator's life easier in order to achieve some other feat.)

        - If a single black square is bordered on two adjoining sides by other black squares, then it could be turned white without destroying the other properties of the grid. Such black squares are called "cheaters" and are frowned upon. (Though they might serve a purpose, e.g. to fit a specific theme entry's length.)

  • dgreensp 20 hours ago ago

    I found this article a bit disappointing.

    The link at the bottom doesn’t work.

    The grids shown do not follow the well-known rules of (American) crosswords: every square is part of two words of three or more letters each.

    Coming up with a pattern of black squares, and writing good clues, are two parts of making a crossword puzzle that are IMO fun and benefit from a human touch, and are not overly difficult. There are also databases of past clues used in crossword puzzles (eg every NY Times clue ever, and various crossword dictionaries) for reference and possible training. If you don’t care about originality (or copyright) and want quality clues, you can just pull clues from these. If you do care about all those things, you can surface the list of clues used in the past to the human constructor and let them write the final clue. Or you can try to perfect LLM clue-writing. In my experience, LLMs are terrible at clues. Like sometimes if I try to give it feedback about a clue, it will just work the feedback into the clue… it’s a little hard to describe without an example, but basically it doesn’t seem to understand the requirements of a clue and the process of a solver looking at a clue and trying to come up with an answer.

    Coming up with an interlocking set of fun, high-quality words and phrases is the hard part. I agree that LLM wordlist curation is a great idea, and I started playing around with that once.

    Beyond that, I don’t think LLMs can help with grid construction, which is a more classic combinatorial problem.

    • abstractbill 17 hours ago ago

      > The link at the bottom doesn’t work.

      Can you clarify which link is broken and how? What browser and OS?

      > In my experience, LLMs are terrible at clues.

      That hasn't been my experience. Without good prompting they give you clues that are too bland and literal, but it is quite possible to get them to give you clues with interesting and creative wordplay. I wish it was easier to get clues like that more consistently, but it's certainly doable. I still believe within a year it'll be easy.