49 comments

  • heydenberk 11 hours ago ago

    Jim Woolsey, a hippie and early-ish computer hacker from New Hope, Pennsylvania, was an important and early force in the digitization of the Tibetan language. This interview[0] with him from 1993 is a fascinating time capsule, and interesting in its own right. He was a family friend and I always admired his singular commitment to this important and underappreciated work.

    [0] https://www.mcall.com/1993/10/08/new-hope-man-computer-guru-...

  • stahorn 2 hours ago ago

    "... relatively short paragraphs (possibly up to a few pages)"

    I love things like this that just shows me how much I view the world from a certain perspective. I don't think I've ever had a paragraph even on one page! The closest I know is that some writer, that I forgot the name of, had several pages of stream of consciousness that I think was without paragraphs and punctuations.

  • skybrian 13 hours ago ago

    Apparently "documents have reasonably short paragraphs" should be added to "falsehoods programmers believe about text."

    • khaled 3 hours ago ago

      In some countries, legal documents are required to not have any paragraph breaks, so you can have a document with one paragraph spanning 100s of pages. OpenOffice has a hard limit of 65534 per paragraph, and it took LibreOffice quite some work to left it: https://bugs.documentfoundation.org/show_bug.cgi?id=30668

    • pbronez 12 hours ago ago

      I never thought about this element of cross language structure before. Text direction, diacritics, punctuation, sure - but I always assumed that chunking was universal. Turns out no:

      “the typographical notion of the paragraph does not really exist in a Tibetan text the way it does in European languages. As a result, Tibetan texts often need to be processed as a long stream of uninterrupted text with no forced line breaks, sometimes over hundreds or thousands of pages. “

      • teractiveodular 10 hours ago ago

        The same applies to old Chinese, and in fact most ancient languages. Latin and Greek were originally written in scriptio continua, meaning no punctuation or spacing:

        https://en.wikipedia.org/wiki/Scriptio_continua

      • crazygringo 10 hours ago ago

        Tens of pages, sure.

        But hundreds? Thousands?

        Do they not have the concepts of headers? Sections? Chapters?

        Both in non-fiction and fiction, there are a lot more means of content separation than just paragraphs.

        • noisy_boy 3 hours ago ago

          Stream processing before it was cool.

        • lazide 6 hours ago ago

          ‘I don’t have time to make a short letter, so I made a long one instead’ -Mark Twain [https://www.goodreads.com/quotes/21422-i-didn-t-have-time-to...]

          That kind of organization takes time, editing, multiple revisions to get right, etc.

          And a mindset that it is useful. In many cases (like if you’re a religious caste), having a giant wall of text that requires skill to identify elements from, is a plus.

          Do you want your ciphertexts formatted into paragraphs too?

          • mmooss 4 hours ago ago

            Are you really saying that Tibetan doesn't use spaces and paragraphs because the creators or perhaps all Tibetans don't want to spend time editing or can't organize their writing well enough?

            • lazide 3 hours ago ago

              Not saying can’t. Arguably, there isn’t anything anyone can’t do. At least until they go broke.

              I’m saying that different societies have different priorities/expectations/motivations, and there is clearly a reason they don’t do it, or it wouldn’t be so consistent eh?

              It’s not like white space isn’t the default on a writing surface.

              Do you have any alternative theories?

              I could also imagine scrolls being expensive, so ‘fluff’ like white space is discouraged, and not easily re-used or overwritten based on the inks, so re-editing or the like doesn’t actually work.

              But I’m just speculating here.

              Edit: the scripta continua link above had a good reference to Greek/roman examples where they were transcriptions by slaves of spoken monologues. They didn’t have paragraphs or the like because people don’t speak in paragraphs.

              They also don’t edit their words when they speak, and rarely do ‘chapters’.

              Footnotes, bibliographic references, etc. also don’t really make sense in the way we might think if it’s ’a record of spoken words’ vs ‘words representing ideas on their own’.

              So writing used more like transcriptions of famous speeches or lectures, less as standalone and independent works.

              And I assume every culture has the equivalent of 2 hour long speeches that could have been a one page email.

              “Before and after the advent of the codex, Latin and Greek script was written on scrolls by slave scribes. The role of the scribes was to simply record everything they heard to create documentation. Because speech is continuous, there was no need to add spaces.”

        • DougMerritt 10 hours ago ago

          "Continued on next scroll"

    • AlienRobot 9 hours ago ago

      Somewhere, a programmer created a 4096 character buffer and sought the next '\n' only to be defeated by Tibetan.

  • InDubioProRubio an hour ago ago

    Languages, are very interesting beasts. As in, they are easy to learn and marshal communication across large swaths of the world- or they are hard to master and allow to construct very complex constructs and ideas- which are then transported from one speaker to another. In which part of the field does the Tibetan language fall?

  • hyperhello 15 hours ago ago

    This has been in the works for a while. There is an old HyperCard stack to teach Tibetan pronunciation (with 16bit sound) you can try: https://hcsimulator.com/Learn-Tibetan

    • fsckboy 12 hours ago ago

      the only vowel is AH ?

      • shanekandy 11 hours ago ago

        In text, the singular vowels are built on the ah syllable with modifying marks.

      • cosignal 12 hours ago ago

        The site seems incomplete. Tibetan does have 5 vowels, and it looks like the non intrinsic vowels are written at the bottom section of the view, but I can't get them to work. I assume the intention would be that you click one of the other vowels to toggle it, but it no worky.

        • hyperhello 12 hours ago ago

          I don't know who created it, or if it was part of a larger proto-Duolingo language product.

  • java-man 15 hours ago ago

    I want to know the details how they achieved it (the support for super-long paragraphs, or rather, the absence thereof).

    Does anyone know?

    • l1n 15 hours ago ago

      https://gerrit.libreoffice.org/c/core/+/172801

      Pretty short change for reducing O(n^2) impact with a cache.

      This change includes the following scalability improvements for documents containing extremely large paragraphs:

      - Reduces the size of layout contexts to account for LF control chars.

      - Due to typical access patterns while laying out paragraphs, VCL was making O(n^2) calls to vcl::ScriptRun::next(). VCL now uses an existing global LRU cache for script runs, avoiding much of this overhead.

      • buovjaga 3 hours ago ago
      • java-man 15 hours ago ago

        Thank you. Also https://bugs.documentfoundation.org/show_bug.cgi?id=92064

        I lack the context - are they still layong out the widths of characters when wrapping?

        • the_mitsuhiko 4 hours ago ago

          Probably shows a bit how little that software is used with Tibetan text if this bug was able to stay open for almost 10 years for what ultimately was a 5 line fix.

          • khaled 3 hours ago ago

            The fix looks like a 5 line fix because it is a last step in a very long process of optimizing LibreOffice text layout that started years ago. This 5 line fix could not have been possible 10 years ago simply because the code it is fixing didn't exist back then.

          • mmooss 4 hours ago ago

            > Probably shows a bit how little that software is used with Tibetan text

            ... by the LibreOffice devs in Indo-European-speaking countries.

            • the_mitsuhiko 4 hours ago ago

              Apparently by anyone if the bug description is accurate. Seemingly one cannot open sufficiently long documents let alone write into them.

              • mmooss 3 hours ago ago

                Perhaps: from the article:

                So long as LibreOffice could not handle long paragraphs there was essentially no free tool to publish Tibetan.

  • wslh 15 hours ago ago

    With all due respect, the innovation side of Tibetans is also appreciated in "The Nine Billion Names of God" [1].

    [1] <https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God>

    • dymk 14 hours ago ago

      Unsong takes inspiration from this as well -

      https://unsongbook.com/

    • asimovfan 14 hours ago ago

      i don't know how it is phrased in the book itself but in Tibetan Buddhism there is no god. And their innovation is far beyond this book (at least the plot summary on wikipedia).

      • wslh 13 hours ago ago

        If I were a Tibetan Buddhist, I might say we were just having some fun with Arthur C. Clarke's imagination.

    • sol2070 13 hours ago ago

      Classic!

  • einpoklum 11 hours ago ago

    Hey everyone, I'm Eyal, a LibreOffice project volunteer who does a lot of QA regarding Right-to-Left and Complex-Text-Layout scripts (= written languages). I want to thank thunderbong3 for posting a link to that post - and heartily thank Jonathan Clark, the new RTL-CTL-CJK-focused developer at The Document Foundation, who implemented the performance improvement for Tibetan.

    Most bugs we encounter and report in LibreOffice are more general, and aren't script specific (e.g. code which forgets that the content may be right-to-left resulting in wrong behavior in those cases); and a lot of the script-specific bugs are about the most popular script, which is Arabic (that is also used for Farsi, Urdu, Javanese etc.)

    But we do have some issues regarding less-commonly-used scripts, like Tibetan or Mongolian. Here:

    https://bugs.documentfoundation.org/show_bug.cgi?id=115607

    is the meta-bug which tracks issues with: Mongolian, Tibetan, Uyghur, Zhuang,Kazak, Xibo, Dai, Yi, Miao, Jingpo, Lisu, Lahu, Wa, etc.

    We don't know if there are really very few issues specific to those languages (which is quite possible), or whether it's just that they're not used so much and the users aren't motivated enough to file bugs.

    Still, as Jonathan's recent fix demonstrates, there is certainly the interest to address them when developer-time-resources become available.

    I would like to encourage everyone who cares about these scripts, and "document editing fairness" across countries and cultures, to consider:

    1. Try using LibreOffice with such languages which you know at least a little bit of - and if you find any bugs, file them at our BugZilla: https://bugs.documentfoundation.org/

    2. Consider supproting The Document Foundation, which manages the LibreOffice project, financially:

    https://www.libreoffice.org/donate/

    We are one of the larger FOSS projects in the world, with tens of Millions of regular users (if not > 100 Million) and a board of trustees with members from dozens of countries; but - we don't have large corporations investing money nor time in the project. While a few commercial companies do contribute to LibreOffice (like Collabora and Allotropia) - many fundamental issues are not close enough to their customers' needs - which is why it was decided to hire Jonathan directly to give RTL-CTL-CJK support a boost. Individual user donations are what enables this work.

    • mmooss 3 hours ago ago

      Hi Eyal - Your hard work as a volunteer does so much for so many - look at that blog post, for example. I really admire it.

      • buovjaga 3 hours ago ago

        The bug report about long paragraphs and the blog post are by Élie Roux, the CTO of BDRC :)

      • einpoklum 2 hours ago ago

        I did not author that blog post... I just noticed it as a HackerNews reads :-)

        I did give a talk on the state of Right-to-Left language support at the annual LibreOffice conference, a few days ago:

        https://events.documentfoundation.org/libreoffice-conference...

    • whereistimbo 5 hours ago ago

      I would appreciate you if you supported Dzongkha as well!

      • buovjaga 3 hours ago ago
      • einpoklum 2 hours ago ago

        Well, the "minimal" support is there, as buovjaga noted - but... we need you to tell us what aspects of that support is missing - by filing bugs, or at the very least talking to us about this (for example - there are "LibreOffice RTL" and a "LibreOffice CJK" groups on Telegram).

  • xmly 11 hours ago ago

    There are over 50 tibetic languages, which one do you choose?

  • soheil 7 hours ago ago

    > first-class citizen

    Why are we anthropomorphizing languages now? Are there really no appropriately-charged words to describe what we wanna say without resorting to political speech?

    • vinay427 7 hours ago ago

      This is a well-established phrase in computer science and programming languages, and it’s likely that its use here is meant to be evocative of those principles rather than of an anthropomorphic sense.

      https://en.wikipedia.org/wiki/First-class_citizen

      • soheil 7 hours ago ago

        Well let's not use it anymore if I agree on principle, computer science itself is a new field, so not sure how deeply entrenched it really is.

        • scooke 4 hours ago ago

          Can you give an example or two of the type of terminology you have in mind? Something like first tier, or high level, or primary, or what? Almost any term that I could think of still carries with it a sense of politics. Perhaps someone from the first class citizen language, like English or French doesn’t really notice this dynamic, but someone who uses these other languages sure notice the lack of importance, or urgency, when they can’t live digitally with their own language. Whether or not they use the term,” political”, it’s still there. But I’m curious to see what your suggestions might include.