CPython Internals Explained

(github.com)

215 points | by yufiz 6 days ago ago

53 comments

  • squeedles a day ago ago

    Had to write a fairly substantial native extension to Python a couple years ago and one of the things I enjoyed was that the details were not easily "Googleable" because implementation results were swamped by language level results.

    It took me back to the old days of source diving and accumulated knowledge that you carried around in your head.

    https://www.dave.org/posts/20220806_python/

    • davepeck a day ago ago

      I made some small contributions to cpython during the 3.14 cycle. The codebase is an interesting mix of modern and “90s style” C code.

      I found that agentic coding tools were quite good at answering my architectural questions; even when their answers were only half correct, they usually pointed me in the right direction. (I didn’t use AI to write code and I wonder if agentic tools would struggle with certain aspects of the codebase like, for instance, the Cambrian explosion of utility macros used throughout.)

      • squeedles a day ago ago

        This was around 2021 so AI code tools had not yet eaten everyone. One of the most interesting challenges was finding the right value judgements when blending multiple type systems. I doubt any agentic coding tool could do it today.

        I blended the python type system with a large low-level type system (STEP AIM low level types) and a smaller set of higher-level types (STEP ARM, similar to a database view). I already was familiar with STEP, so I needed to really grok what Python was doing under the covers because I needed to virtualize the STEP ARM and AIM access while making it look like "normal" Python.

        • davepeck a day ago ago

          Oh, that's very interesting work. And, yes, I'd also be surprised if (today's) agentic tools were at all helpful for that: it's way outside of distribution, and conceptual correctness truly matters.

    • throwaway81523 a day ago ago

      There's a file on docs.python.org explaining the C api. The rest is pretty straightforward, at least until recently when free threading was introduced (IDK about now). Main hassle is manually having to track reference borrowing etc. Understandable in Python 2, but another tragedy in Python 3.

    • EdwardDiego a day ago ago

      Great write up, thank you for sharing it! Quick question though, in your first code example (dynamic enum with a metaclass) what is "m" in this line towards the start?

          Py_DECREF(m)
      Is it the metaclass?
      • lozenge 18 hours ago ago

        That's a standard error clause. In the case PyImport_ImportModule threw a Python exception, you need to Py_DECREF any C local variables which are new references(not borrowed references) and return -1.

        From the later call PyModule_AddObject, it's clear this code has come from the PyInit_ module initialisation function. This code is running on import of the C extension to initialise the "FruitEnum" module attribute. https://docs.python.org/3/c-api/extension-modules.html#c.PyI...

        • squeedles 16 hours ago ago

          Exactly so. I didn't notice that missing def when I put together the blog post, but you are right to call it out. In this case that decref was copypasta from some other code -- I don't decref on the other error returns. I combined code that was in several places and omitted the decref for mod_enum too!

          The module init function is where you would normally create the module object (PyModule_Create) and decref it if an error occurs. The blog example is utility code that you would call within the module init function to add an enum.

          Someone should really create a blog post compiler to catch these sorts of things :-)

  • elcapitan a day ago ago

    This looks quite nice. I always wished there was something like "Ruby Under a Microscope" for Python (and other languages). It was quite instrumental for my deeper understanding of the language.

    • stonecharioteer a day ago ago

      There is.

      https://realpython.com/products/cpython-internals-book/ But it's for 3.9 and doesn't cover the massive changes regarding delayed annotations and the GIL updates

      The Ruby under a Microscope guy is updating it.

      • elcapitan 19 hours ago ago

        That's nice too, but it seems to be more a tour of the code base, and doesn't have the detailled diagrams of memory layout that the Ruby book and the one posted here have.

  • mvATM99 2 days ago ago

    Very interesting! Gonna look through this for sure in the next weeks

  • westurner a day ago ago

    vstinner's Python docs; "Unofficial Python Development (Victor's notes) documentation" > Garbage Collector > "Implement the GC protocol in a type": https://pythondev.readthedocs.io/garbage_collector.html#impl...

    Python Developer's Guide > "CPython's internals": https://devguide.python.org/internals/index.html

    Python/cpython//InternalDocs/README.md > "CPython Internals Documentation": https://github.com/python/cpython/blob/main/InternalDocs/REA...

    • westurner a day ago ago

      IDK why /InternalDocs/ instead of /Doc/internals/ ? ( `ln -s` works with Mac/Lin/WSL. )

      Ideally what's in InternalDocs/ would be built into the docs.python.org docs .

      Is it just that markdown support in sphinx is not understood to exist?

      Sphinx has native markdown support. Sphinx does not have native MyST Markdown support. To support MyST Markdown in a sphinx-doc project, you must e.g. `pip install myst_parser` and add "myst_parser" to the extensions list in conf.py.

      MyST Markdown supports docutils and sphinx RestructuredText roles and directives: https://myst-parser.readthedocs.io/en/latest/syntax/roles-an...

      Directive in ReStructuredText .rst:

        .. directivename:: arguments
           :key1: val1
           :key2: val2
      
         This is
         directive content
      
      Directive in MyST Markdown .md:

        ```{directivename} arguments
        :key1: val1
        :key2: val2
        
        This is
        directive content
        ```
      
      RestructuredText Role, MyST Markdown Role:

        :role-name:`role content`
        {role-name}`role content`
      
      Sphinx resolves reference labels at docs build time, so that references will be replaced with the full relative URL to the path#fragment of the document where they occur; in ReStructuredText and then MyST Markdown:

        .. _label-name:
        (label-name)=
      
        :ref:`Link title <label-name>`
        {ref}`Link title <label-name>`
      • shakna a day ago ago

        > Ideally what's in InternalDocs/ would be built into the docs.python.org docs .

        If you expose internals in documentation, then people depend on internals.

        And when you break it, because it isn't meant to be tracked by any kind of API, there are wonderful groups who will sue you (usually under "devaluation").

        • westurner a day ago ago

          That's why three different procedures for docs?

          Python docs procedures: (0) Devguide, (1) PEPs w/ front matter in RST, (2) RST in /Doc with Sphinx, (3) MD and TXT in /InternalDocs without a toctree

          The .. warning: or even admonition directives could be used for indicating that docs under /internals are not public API and can change with or without a PEP; though that should also or at least be indicated in the source unless that's a given expectation that not marked public APIs are not to be considered stable

          • shakna a day ago ago

            How many, many times has a project said, "Don't use, internal only", only for it to become an industry-wide common "trick"?

            Saying "here be dragons" is not enough to discourage people whose job it is to be creative.

            • westurner 14 hours ago ago

              That's the bad kind of lazy.

              It is advantageous to format, interlink, and create a table of contents for documentation. I doubt anyone would advocate for removing the Makefile and conf.py from /Doc.

              That a document describes something internal does not mean that it should be excluded from the docs.

              Is there a consistent standard for whether something is internal or external to the CPython project? Aren't there already internal things documented in /Doc? Should they be removed from the docs build then? Or moved to an internal folder with a DRY additional toctree?

  • 2 days ago ago
    [deleted]
  • tonymet 2 days ago ago

    I wish they would just go back to calling it Python, since it’s the Python that everyone knows and uses. No one gets confused over Python the spec and Python the implementation. Every time I see “CPython” i have to double check we’re just talking about Python.

    I guess they “CPython’ed” back when people thought Jython would take off , and it never did because Java sucks.

    • vkazanov a day ago ago

      Just to name alternatives: Cpython, Pypy, jython, ironpython.

      Then, there quite a few python-likes out there.

      I wish they would stay precise.

      • appplication a day ago ago

        Yes, but no one is ever talking about pypy or jython implicitly. They are always mentioned by name because they represent <0.1% of all Python usage and are relegated essentially exclusively to niche or experimental use cases for power users.

        It’s a bit like arguing people should start saying “homo sapiens” when referencing “people” for added precision. It may be useful to anthropologists but the rest of us really don’t need that. Similarly, CPython is really only a sensible level of precision in a discussion directly about alternative Python implementations.

        (although in this case the original post is about implementation internals so I’d give it a pass)

        • rich_sasha a day ago ago

          This seems to be literally looking at the details of the C implementation of a Python interpreter. Exactly specifying the implementation makes sense here. You wouldn't say "how does the C++ compiler work" then look only at gcc.

          • tonymet a day ago ago

            c++ / g++ is not comparable because the original c++ reference compilers are not commercially popular today. No one is using Strouvestroups compilers.

            CPython is Python. Every time your buddy says “just download python” you are using CPython . There’s no reason to be pedantic.

            • rich_sasha 20 hours ago ago

              If you know enough about Python to look at how the dict is implemented, you also know the difference between Python and CPython. It's not a beginners intro.

            • shakna a day ago ago

              g++ and clang are comparable. You need to specify the implementation.

        • tonymet a day ago ago

          I like this debate because it triggers everyone’s pragmatic frustration with the philosophy of language.

          Are things defined by the dictionary or by everyday experiences?

      • foresto a day ago ago
        • tonymet a day ago ago

          Another great example that no one would confuse for python.

      • tonymet a day ago ago

        CPython, pypy, jython are not alternatives.

        CPython is Python. The others are attempts.

        • tonymet a day ago ago

          I don’t think it’s good form to downvote people you disagree with.

          • palata a day ago ago

            I did not downvote, but I'm guessing that it is perceived as disrespectful to call them failures to the point where they don't even qualify as "alternatives".

            • nomel a day ago ago

              The word "failure" was never used.

              But, they are technically correct. The language is defined as by CPython: it is the standard!!! None of the others fully meet that standard, which includes quirks! It's knows trade offs with them! They are, literally, attempts to adhere to that standard.

            • tonymet 12 hours ago ago

              It’s no slight to jython. They fill an important Niche. But jython will never ever be confused with python.

    • EdwardDiego a day ago ago

      I find it's usually referred to as CPython when discussing something specific to the implementation or internals of Python that don't apply to Pypy, which seems to be the alternative Python implementation with the most traction.

      No harm in being explicit right? Tis part of the zen of Python after all.

    • paulddraper a day ago ago

      A lot to unpack there, but the language and the implementation are different.

      JavaScript and Node.js are different too.

    • palata a day ago ago

      I feel like when the goal is to talk about the internals of it, then it makes sense to call it CPython.

      In general, I never, ever see anyone saying "I will write a CPython script". Everybody says "Python" in my experience... do you see it differently?

      EDIT: I don't think that your opinion deserves to be downvoted, though...

    • mkoubaa a day ago ago

      Precision in language is important for software engineering.

  • damjon 2 days ago ago

    I've been comparing various platforms and discussing them with ChatGPT—for instance, why Python's execution is slower than JavaScript's V8. It claimed this is due to mtechnical debt and the inability to change because libraries like NumPy bypass public interfaces and access data directly.

    I'm wondering how much of that is true and what is just a hallucination."

    Btw: JavaScript seems to have similar complexity issues.

    Edit: Python has no JIT

    • gf000 a day ago ago

      If we are being very pedantic, languages don't have "speed", only implementations do.

      Of course in the real life there are de facto implementations and language features give way to better/worse tradeoffs.

      With that out of the way, Python is basically the de facto glue language. It is very often used to provide a scripting API over lower level C libraries. To be ergonomic in this function, CPython (the major implementation) exposed some internal details of its execution model, which C libraries can reach into. This makes it very hard to make more aggressive optimizations, as one example a C library can just increase/decrease the reference count of an object. Another design decision (that got some discussion recently) is the GIL (global interpreter lock) that makes python much less competitive than something like Java. (JS also does a single thread of execution, though there are ways around it).

      JS has a different use case, so access to the C world doesn't impose such restrictions on it.

      • mkoubaa a day ago ago

        Both you and the grandparent comment are correct. The implementation is slow because the API that it exposes is so leaky that implementation changes (for example a tracing garbage collector) are impossible to implement without changing the API, and the API cannot easily change because of the dependence or the ecosystem on it (e.g. numpy)

    • johndough a day ago ago

          > Edit: Python has no JIT
      
      There are quite a few JITs:

      JIT-compiler for Python https://pypy.org/

      Python enhancement proposal for JIT in CPython https://peps.python.org/pep-0744/

      And there are several JIT-compilers for various subsets of Python, usually with focus on numerical code and often with GPU support, for example

      Numba https://numba.pydata.org/numba-doc/dev/user/jit.html

      Taichi Lang https://github.com/taichi-dev/taichi

    • hmry a day ago ago

      > Edit: Python has no JIT

      In 3.14 and up you can enable JIT by setting the env var PYTHON_JIT=1

      • brcmthrowaway a day ago ago

        Who made this JIT? FAcebook?

        • simonw 19 hours ago ago

          Facebook engineers (most notably Sam Gross) contributed a lot of the no-GIL work: https://lwn.net/Articles/939981/

          Much of the initial work on the JIT came from Microsoft's Faster CPython team: https://lwn.net/Articles/1029307/

        • hmry a day ago ago

          Lots of people. Several people from Arm and Microsoft, various PhD students... I don't know if anyone working at Facebook worked on the JIT, maybe they did.

    • ayhanfuat 2 days ago ago

      It is not that numpy bypasses public interfaces. It uses documented C APIs. V8, as far as I know, does not have that.

      • wk_end a day ago ago

        V8 itself might not, but, say, Node does and that doesn't torpedo performance. Was Node-API just better designed than Python's FFI?

        • kccqzy a day ago ago

          My understanding is that Node still doesn’t give you low-level C APIs into the language itself. It gives you JavaScript APIs that call into I/O libraries (libuv basically).

          Python it’s not hard to write a module in pure C that manipulates other Python objects. This means the representation of Python objects has to be stable enough for the C code. V8 does not allow that.

          • wk_end a day ago ago

            I haven’t tried it myself but I don’t think that’s the case. See the documentation here:

            https://nodejs.org/api/n-api.html

            I’ve only skimmed this, but it sure sounds like it lets you write C code that operates on JS objects. In fact, it explicitly says “APIs exposed by Node-API are generally used to create and manipulate JavaScript values.”

    • g947o a day ago ago

      As someone who has many times dived into deep rabbit holes like this (e.g. how does JavaScript's prototype-based class work?), some effective ways to handle this is to ask follow up questions, use web search or ask for references. Deep search also helps. Often it corrects itself or takes back claims that have no basis. At the very least, it provides references that you can read yourself.

      Of course, you can't really do all of that on a free plan.

      That's far from ideal, but if you are motivated and care about these technical details (which you probably do), you can get pretty good results.

      =====

      Putting all of this aside, you can sometimes find YouTube videos on obscure channels that talk about these things. Chances are that someone who cares to make a YouTube video about these hardcore topics know what they are talking.

    • palata a day ago ago

      For what it's worth, I really don't get the downvotes. I think it is an interesting question, and it brought interesting answers.

      No clue if that's the reason for the downvotes, but maybe next time don't mention ChatGPT and just formulate this as "From what I read, [...]".