Python's splitlines does more than just newlines

(yossarian.net)

113 points | by Bogdanp 5 days ago ago

44 comments

  • dleeftink 4 days ago ago

    For more controlled splitting, I really like Unicode named characters classes[0] for more precise splitting and matching tasks.

    [0]: https://en.wikipedia.org/wiki/Unicode_character_property#Gen...

  • mixmastamyk 4 days ago ago

    Splitlines is generally not needed. for line in file: is more idiomatic.

    • tiltowait 4 days ago ago

      Splitlines additionally strips the newline character, functionality which is often (maybe even usually?) desired.

      • masklinn 4 days ago ago

        This has been controlled via a boolean parameter since at least 2.0, which as far as I can tell is when this method was added to `str`.

    • fulafel 4 days ago ago

      It has similar (but not identical) behaviour though:

        >>> for line in StringIO("foo\x85bar\vquux\u2028zoot"): print(line)
        ... 
        foo
        bar
         quux zoot
      • amelius 4 days ago ago

        I would expect it to have identical behavior.

    • rangerelf 4 days ago ago

      What if the text is already in a [string] buffer?

      • mixmastamyk 4 days ago ago

        StringIO can help, .rstrip() for the sibling comment.

    • paulddraper 4 days ago ago

      If it's reading from a file, you wouldn't be using splitlines() anyway; you'd use readlines().

      For string you’d need to

        import io
      
        for line in io.StringIO(str):
          pass
    • drdrey 4 days ago ago

      not every line is read from a file

      • mixmastamyk 4 days ago ago

        That's where the generally fits in.

        • crazygringo 4 days ago ago

          No, because that still assumes files are the general usage.

          In my experience, they're not. It's strings.

          • mixmastamyk 4 days ago ago

            And where do you get these input strings? Big enough that .split() is not sufficient? Files, and yes sockets support the interface as well with a method call.

            • crazygringo 4 days ago ago

              > And where do you get these input strings?

              From database fields, API calls, JSON values, HTML tag content, function inputs generally, you know -- the normal places.

              In my experience, most people aren't dealing directly with files (or streams) most of the time.

              • mixmastamyk 3 days ago ago

                Your examples ultimately come from files or sockets, as I mentioned. Especially if big enough to use splitlines on them.

                I also used the word generally, so your insistence on quantifying the proportion is a complete waste of time.

                • crazygringo 2 days ago ago

                  What a nonsensical thing to say. You said to call "for line in file" -- you can't do that on a string, even if it originally came from part of a file. Or are you suggesting one should...?

                  And I said your "generally" was wrong. You were provided general advice, I'm saying it's wrong in general. Do you see me giving numerical quantities anywhere?

            • gnulinux 4 days ago ago

              They might be programmatically generated, for example.

              There are countless sources one can get a string from. Surely you don't think filesystems are the only source of strings?

              • mixmastamyk 3 days ago ago

                Input is very rarely auto-generated, though output is.

                Surely you haven't misread my comments above to such an extent? Perhaps not familiar with sockets.

                • gnulinux 3 days ago ago

                  No, I didn't misread, input can be self-generated of course. If you're writing a system that's designed like UserInput -> [BlackBox] -> Output, clearly user input won't be auto-generated. But if you factor [BlackBox] into a system like A -> B -> C, A -> D -> C, C -> Output, then each of those arrows will represent an input into the next system that was generated by something our devs wrote. This could be bunch of jsonlines (related to this thread) interpreted as string, a database, some in-memory structure, whatever.

  • cuckoos-jicamas 4 days ago ago

    str.split() function does the same:

    >>> s = "line1\nline2\rline3\r\nline4\vline5\x1dhello"

    >>> s.split() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']

    >>> s.splitlines() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']

    But split() has sep argument to define delimiter according which to split the string.. In which case it provides what you expected to happen:

    >>> s.split('\n') ['line1', 'line2\rline3\r', 'line4\x0bline5\x1dhello']

    In general you want this:

    >>> linesep_splitter = re.compile(r'\n|\r\n?')

    >>> linesep_splitter.split(s) ['line1', 'line2', 'line3', 'line4\x0bline5\x1dhello']

    • roelschroeven 4 days ago ago

      In that example str.split() has the same result as str.splitlines(), but it's not in general the same, even without custom delimiter.

      str.split() splits on runs of consecutive whitespace, any type of whitespace, including tabs and spaces which splitlines() doesn't do.

          >>> 'one two'.split()
          ['one', 'two']
          >>> 'one two'.splitlines()
          ['one two']
      
      split() without custom delimiter also splits on runs of whitespace, which splitline() also doesn't do (except for \r\n because that combination counts as one line ending):

          >>> 'one\n\ntwo'.split()
          ['one', 'two']
          >>> 'one\n\ntwo'.splitlines()
          ['one', '', 'two']
    • gertlex 4 days ago ago

      splitlines() is sometimes nice for adhoc parsing (of well behaved stuff...) because it throws out whitespace-only lines from the resulting list of strings.

      #1 use-case of that for me is probably just avoiding the cases where there's a trailing newline character in the output of a command I ran by subprocess.

  • meken 4 days ago ago

    TIL: Python has a splitlines function

  • zb3 4 days ago ago

    Useful to know for security purposes, surprises like that might cause vulnerabilities..

  • RainyDayTmrw 3 days ago ago

    Is there a parser ambiguity/confusion vector here?

  • wvbdmp 4 days ago ago

    What, no <br\s*\/?>?

  • 4 days ago ago
    [deleted]
  • zzzeek 4 days ago ago

    in the same theme, NTLAIL strip(), rstrip(), lstrip() can strip other kinds of characters besides whitespace.

    • masklinn 4 days ago ago

      One thing to note tho is that they take character sets, as long as they encounter characters in the specified set they will keep stripping. Lots of people think if you give it a string it will remove that string.

      That feature was added in 3.9 with the addition of `removeprefix` and `removesuffix`.

      Sadly,

      1. unlike Rust's version they provide no way of knowing whether they stripped things out

      2. unlike startswith/endswith they do not take tuples of prefixes/suffixes

  • 7bit 5 days ago ago

    This article provides no additional value to the splitlines() docs.

    • woodruffw 4 days ago ago

      The "article" is my TIL mini-blog. What were you expecting besides a "today I learned"?

      • kstrauser 4 days ago ago

        I already knew this information, more or less, but I like reading TIL posts like this. It's fun seeing the someone learn new things, and sometimes I pick up something myself, or at least look at it in a new way.

      • cap11235 4 days ago ago

        Yeah, don't listen to parent. I like these sorts of articles a lot; its only useless if you assume that everyone interested has also memorized the Python docs fully (which I imagine is zero people). Fun technical tangents are quite fun indeed.

      • zahlman 4 days ago ago

        What is "yossarian", BTW? I'd gotten confused thinking it was someone else's blog, because I naturally parse that as a surname.

        • woodruffw 4 days ago ago

          John Yossarian is the protagonist of Joseph Heller’s Catch-22[1], which was my favorite book in high school. Like a lot of people, my handle is a slightly embarrassing memorialization of my younger self :-)

          [1]: https://en.wikipedia.org/wiki/Catch-22

          • di 4 days ago ago

            Don't be embarrassed, it's a good book (and was my favorite too).

          • zahlman 4 days ago ago

            > Like a lot of people, my handle is a slightly embarrassing memorialization of my younger self :-)

            ... Guilty, actually.

    • rsyring 5 days ago ago

      Sometimes value is measured by awareness. I benefited from becoming aware of the behavior because of the article. Yes, it's in the docs, but the docs are not something I would have gone looking to read today.

    • diath 5 days ago ago

      The value of this article, to me, is that I'd never read the splitlines documentation, so this is a little detail that I just learned thanks to it being linked here.

    • happytoexplain 4 days ago ago

      I've been working with Python for a year or so now, and never knew this. I'm grateful to the author.

    • felipelemos 4 days ago ago

      For all of us that don't read all documentation for every single method, tool, function or similar, it is, by awarenes, very useful.