Use "\A \z", not "^ $" with Python regular expressions

(sethmlarson.dev)

40 points | by todsacerdoti a day ago ago

20 comments

  • flufluflufluffy 17 hours ago ago

    The vast majority of the times I use ^/$, I actually want the behavior of matching start/end of lines. If I had some multi-line text, and only wanted to update or do something with the actual beginning or end of the entire text, I’d typically just do it manually.

    • theamk 16 hours ago ago

      A lot of time I want to check for valid identifier:

          if not re.match('^[a-z0-9_]+$', user):
              raise SomeException("invalid username")
      
      as written, the code above is incorrect - it will happily accept "john\n", which can cause all sort of havoc down the line
      • extraduder_ire 12 hours ago ago

        Shouldn't you use the match returned from the string? Or use .fullmatch() (added 3.4) to match the whole string.

        • theamk 7 hours ago ago

          In general no, you should not use match from the string. If you are getting input from user, you want a more complex processing (like stripping all whitespace), and if you are getting input from API calls, you want to either use specified name as-is, or fail.

          Yes, fullmatch() will help, and so will \Z. It's just that it is so easy to forget...

  • Joker_vD a day ago ago

    Regular expressions as we basically now them today were made for ed. In that context, '$' absolutely had to match the terminating newline or it would've been completely useless.

  • seanwilson 16 hours ago ago

    I wish one of those regex libraries that replaces the regex symbols with human readable words would become standard. Or they don't work well?

    Regex is one of those things where I have to look up to remind myself what the symbols are, and by the time I need this info again I've forgotten it all.

    I can't think of anywhere else in general programming where we have something so terse and symbol heavy.

    • db48x 16 hours ago ago

      It’s been done. Emacs, for example, has rx notation. From the manual:

          35.3.3 The ‘rx’ Structured Regexp Notation
          ------------------------------------------
          
          As an alternative to the string-based syntax, Emacs provides the
          structured ‘rx’ notation based on Lisp S-expressions.  This notation is
          usually easier to read, write and maintain than regexp strings, and can
          be indented and commented freely.  It requires a conversion into string
          form since that is what regexp functions expect, but that conversion
          typically takes place during byte-compilation rather than when the Lisp
          code using the regexp is run.
          
             Here is an ‘rx’ regexp(1) that matches a block comment in the C
          programming language:
          
               (rx "/*"                    ; Initial /*
                   (zero-or-more
                    (or (not "*")          ;  Either non-*,
                        (seq "*"           ;  or * followed by
                             (not "/"))))  ;     non-/
                   (one-or-more "*")       ; At least one star,
                   "/")                    ; and the final /
          
          or, using shorter synonyms and written more compactly,
          
               (rx "/*"
                   (* (| (not "*")
                         (: "*" (not "/"))))
                   (+ "*") "/")
          
          In conventional string syntax, it would be written
          
               "/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
      
      Of course, it does have one disadvantage. As the manual says:

             The ‘rx’ notation is mainly useful in Lisp code; it cannot be used in
          most interactive situations where a regexp is requested, such as when
          running ‘query-replace-regexp’ or in variable customization.
      
      Raku also has advanced the state of the art considerably.
  • zahlman 9 hours ago ago

    For this to matter, it seems that I would have to be in the situation of:

    * running a regex not in multi-line mode

    * on input that was presumably split from multiple lines, or within a line of multi-line input

    * wherein I care whether the line in question is the last line of input without a trailing newline

    * but I didn't check, or `.strip()` or anything

    I can't say I recall ever being bitten by this.

    And there is also nothing here to justify \A over ^.

  • eviks a day ago ago

    so why \A instead of ^?

  • svilen_dobrev 18 hours ago ago

    it's in the spec. Since forever, like v 1.3? don't remember.

    And it is same in perl: from `man perlre`:

       ^   Match the beginning of the string  (or line, if /m is used)
  • autoexec a day ago ago

    I've said it before and I'll say it again, I'd like Python a lot more if it abandoned re and handled regex like perl did.

    • edflsafoiewq 15 hours ago ago

      I've never used perl. What's the difference?

      • autoexec 8 hours ago ago

        It doesn't need an import at all. It's just a normal part of the language's syntax and can be used just about anywhere:

            $foo =~ /regex/
            $result = $foo =~ /regex/
            if ($foo =~ /regex/) {whatever;}
            while (/regex/) {whatever;}
        
        The captures ($1, $2, etc.) are global and usable wherever you need them.

        In this particular case the default is that $ matches the end of a string without a newline but you can include it anytime you need to:

           $foo =~ /regex$/ # end of string without newline
           $foo =~ /regex$/m # end of string with newline
  • instig007 15 hours ago ago

    ABC: Always. Build on. Parser Combinators.

    Python ecosystem has several options, for instance: https://parsy.readthedocs.io/en/latest/tutorial.html

  • az09mugen a day ago ago

    They could simply advise to use boundaries '\b' instead.

    • notpushkin 13 hours ago ago

      Which would also match whitespace in addition to the \n they’re trying to avoid matching?