Narrative Jailbreaking for Fun and Profit

(interconnected.org)

95 points | by tobr a day ago ago

23 comments

  • cantsingh 21 hours ago ago

    I've been playing with the same thing, it's like a weird mix of social engineering and SQL injection. You can slowly but surely shift the window of what the bot thinks is "normal" for the conversation. Some platforms let you rewrite your last message, which gives you multiple "attempts" at getting the prompt correct to keep the conversation going the direction you want it.

    Very fun to do on that friend.com website, as well.

    • deadbabe 21 hours ago ago

      I tried it on friend.com. It worked a for a while, I got the character to convince itself it had been replaced entirely by a demon from hell (because it kept talking about the darkness in their mind and I pushed them to the edge). They even took on an entire new name. For quite a while it worked, then suddenly in one of the responses it snapped out of it, and assured me we were just roleplaying no matter how much I tried to go back to the previous state.

      So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

      • Yoric 21 hours ago ago

        > So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

        With a LLM, I don't think that there is a difference.

        • Terr_ 15 hours ago ago

          I like to think of it as a amazing document autocomplete being applied to a movie script, which we take turns appending to.

          There is only a generator doing generator things, everything else--including the characters that appear in the story--are mostly in the eye of the beholder. If you insult the computer, it doesn't decide it hates you, it simply decides that a character saying mean things back to you would be most fitting for the next line of the document.

      • xandrius 21 hours ago ago

        Just to remind people, there is no snapping out of anything.

        There is the statistical search space of LLMs and you can nudge it to different directions to return different outputs; there is no will in the result.

        • ta8645 19 hours ago ago

          Isn't the same true for humans? Most of us stay in the same statistical search space for large chunks of our lives, all but sleepwalking through the daily drudgery.

          • 1659447091 18 hours ago ago

            No, humans have autonomy.

            • gopher_space 15 hours ago ago

              In a big picture sense. Probably more correct to say that some humans have autonomy some of the time.

              My go-to example is being able to steer the pedestrian in front of you by making audible footsteps to either side of their center.

              • 1659447091 14 hours ago ago

                The pedestrian in front of you has the choice to be steered or to ignore you--or more unexpected actions. Which ever they choose has nothing to do with the person behind them taking away their autonomy and everything to do with what they felt like doing with it at the time. Just because the wants of the person behind them and willingness & aweness and choice of the person in front align with those wants does not take away the forward person's self governance.

      • nico 21 hours ago ago

        Super interesting

        Some thoughts:

        - if you get whatever you wanted before it snaps back out of it, wouldn’t you say you had a successful jailbreak?

        - related to the above, some jailbreaks in physical devices, don’t persist after a reboot, they are still useful and called jailbreak

        - the “snapped out”, could have been caused by a separate layer, within the stack that you were interacting with. That intermediate system could have detected, and then blocked, the jailbreak

  • squillion 14 hours ago ago

    Very cool! That’s hypnosis, if we want to insist with the psychological metaphors.

    > If you run on the conversation the right way, you can become their internal monologue.

    That’s what hypnosis in people is about, according to some: taking over someone else’s monologue.

  • spiritplumber 11 hours ago ago

    Yeah, I got chatgpt to help me write a yaoi story between an interdimensional terrorist and a plant-being starship captain (If you recognize the latter, no, it's not what you think, either).

    It's actually not hard.

  • isoprophlex a day ago ago

    This is fun of course, but as a developer you can trivially and with high accuracy guard against it by having a second model critique the conversation between the user and the primary LLM.

    • abound a day ago ago

      Not sure if it's trivial or high accuracy for a dedicated user. This jailbreak game [1] was making the rounds a while back, it employs the trick you mentioned as well as any others to prevent an LLM from revealing a secret, but it's still not too terribly hard to get past.

      [1] https://gandalf.lakera.ai

    • Yoric a day ago ago

      I've spent most of my career working to make sure that my code works safely, securely and accurately. While what you write makes sense, it's a bit of a shock to see such solutions being proposed.

      So far, when thinking about security, we've had to deal with:

      - spec-level security;

      - implementation-level security;

      - dependency-level security (including the compiler and/or runtime env);

      - os-level security;

      - config-level security;

      - protocol-level security;

      - hardware-level security (e.g. side-channel attacks).

      Most of these layers have only gotten more complex and more obscure with each year.

      Now, we're increasingly adding a layer of LLM-level security, which relies on black magic and hope that we somehow understand what the LLM is doing. It's... a bit scary.

      • qazxcvbnmlp a day ago ago

        It’s not black magic, but it is non deterministic. It’s not going to erase security and stability but it will require new skills and reasoning. The current mental model of “software will always do X if you prohibit bad actors from getting in” is broken.

        • Yoric 21 hours ago ago

          I agree that with the generation of code we're discussing this mental model is broken, but I think it goes a bit beyond determinism.

          Non-determinism is something that we've always had to deal with. Maybe your user is going to take the USB key before you're done writing your file, maybe your disk is going to fail, the computer is going to run out of battery during a critical operation, or your network request is going to timeout.

          But this was non-determinism within predictable boundaries. Yes, you may need to deal with corrupted file, an incomplete transaction, etc. but you could fairly easily predict where it could happen and how it could affect the integrity of your system.

          Now, if you're relying on a GenAI or RAG at runtime for anything other than a end-user interface, you'll need to deal with the possibility that your code might be doing something entirely unrelated than what you're expecting. For instance, even if we assume that your GenAI is properly sandboxed (and I'm not counting on early movers in the industry to ensure anything close to proper sandboxing), you could request one piece of statistics you'd like to display to your user, only to receive something entirely unrelated – and quite possibly something that, by law, you're not allowed to use or display.

          If we continue on the current trajectory, I suspect that it will take decades before we achieve anything like the necessary experience to write code that works without accident. And if the other trend of attempting to automate away engineering jobs continues, we might end up laying off the only people with the necessary experience to actually see the accidents coming.

          • schoen 17 hours ago ago

            > But this was non-determinism within predictable boundaries. Yes, you may need to deal with corrupted file, an incomplete transaction, etc. but you could fairly easily predict where it could happen and how it could affect the integrity of your system.

            There's been lots of interesting computer security research relying on aspects of the physical instantiation of software systems (like side channel attacks where something physical could be measured in order to reveal secret state, or fault injection attacks where an attacker could apply heat or radiation in order to make the CPU or memory violate its specifications occasionally). These attacks fortunately aren't always applicable because there aren't always attackers in a position to carry them out, but where they are applicable, they could be very powerful!

            • Yoric 32 minutes ago ago

              Yeah, I briefly mentioned those in my previous message :)

    • nameless912 a day ago ago

      This seems like a "turtles all the way down" kinda solution... What's to say you won't fool the supervisor LLM?

      • xandrius 21 hours ago ago

        It would be interesting to see if there is a layout of supervisors to make sure this less prone to hijacking. Something like byzantine generals where you know a few might get fooled, so you can construct personalities which are more/less malliable and go for consensus.

        This still wouldn't make it perfect but quite hard to study from an attacker's perspective.

      • apike 21 hours ago ago

        While this can be done in principle (it's not a foolproof enough method to, for example, ensure an LLM doesn't leak secrets) it is much harder to fool the supervisor than the generator because:

        1. You can't get output from the supervisor, other than the binary enforcement action of shutting you down (it can't leak its instructions)

        2. The supervisor can judge the conversation on the merits of the most recent turns, since it doesn't need to produce a response that respects the full history (you can't lead the supervisor step by step into the wilderness)

        3. LLMs, like humans, are generally better at judging good output than generating good output

      • ConspiracyFact 19 hours ago ago

        "Who will watch the watchers?"

        There is no good answer--I agree with you about the infinite regress--but there is a counter: the first term of the regress often offers a huge improvement over zero terms, even if perfection isn't achieved with any finite number of terms.

        Who will stop the government from oppressing the people? There's no good answer to this either, but some rudimentary form of government--a single term in the regress--is much better than pure anarchy. (Of course, anarchists will disagree, but that's beside the point.)

        Who's to say that my C compiler isn't designed to inject malware into every program I write, in a non-detectable way ("trusting trust")? No one, but doing a code review is far better than doing nothing.

        What if the md5sum value itself is corrupted during data transfer? Possible, but we'll still catch 99.9999% of cases of data corruption using checksums.

        Etc., etc.