In the JS world, tracing was abandoned because it didn't scale to real world code.
JS JITs (the production ones, like JSC's) have no such thing as trace blockers that prevent the surrounding code from being optimized. You might have an operation (like a call to some wacky native function) that is itself not optimized, but that won't have any impact on the JIT's ability to optimize the code surrounding that operation.
Tracing is just too much of a benchmark hack overall IMO. Tracing would only be a good idea in a world where it's too expensive to run a real optimizing JIT. But the JS, Java, and .NET experiences show that a real optimizing JIT - with all of its compile time costs - is exactly what you want because it results in predictable speed-ups
Pypy's tracing JIT has worked pretty well for years, so that doesn't seem universal. I admit I have fairly low expectations on a JIT succeeding (in any meaningful fashion) in cpython given the constraints that it currently has, but I'm generally a skeptic, so maybe i'm being overly pessimistic.
What does “worked pretty well” really mean though?
When we talk about JS or Java JITs working well, we are making statements based on intense industry competition where if a JIT had literally any shortcoming then a competitor would highlight it in competitive benchmarking and blog posts. So, the competition forced aggressive improvements and created a situation where the top JITs deliver reliable perf across lots of workloads.
OTOH PyPy is awesome but just hasn’t had to face that kind of competitive challenge. So we probably can’t know how far off from JS JITs it is.
One thing I can say is when I compared it to JSC by writing the same benchmark in both Python and JS, JSC beat it by 4x or so.
I think the Java JITs are a better comparison because the workload is more similar: JavaScript is weird for how it’s expected to start in a fraction of a second and soak up a huge bolus of code which may substantially never be used whereas most of the performance-sensitive Python code stabilizes quickly and loads what it uses really early on.
The Java JIT and most other Javascript jits are essentially operating the same way. The core difference is the java language spec sets up a whole bunch of requirements that need to be figured out at startup and are easy to trigger.
For example, static initialization on classes. The JDK has a billion different classes and on startup a not insignificant fraction of those end up getting loaded for all but the simplest applications.
Essentially, Java and the JS jits are both initially running everything interpreted and when a hot method is detected they progressively start spending the time sending those methods and their statistics to more aggressive JIT compilers.
A non-insignificant amount of time is being spent to try and make java start faster and a key portion of that is resolving the class loading problem.
All JVMs have options to JIT straight away, although that comes with other tradeoffs.
All commercial JVMs have had JIT caches for quite some time, and this is finally also available as free beer on OpenJDK, thus code can execute right away as if it was an AOT language.
In some of those implementations, the JIT cache gets updated after each execution taking into account profiling data, thus we have the possibilitiy to reach an optimal status across the lifetime of the executable.
The .NET and ART cousins also have similar mechanisms in place.
Which I guess is what your last sentence refers to, but I wasn't sure.
Yes, my thought was that the lifecycle is different. The average JVM is probably running for days on average so a huge percentage of the total runtime is in code which had been aggressively optimized by the JIT whereas a lot of JavaScript isn’t used enough to reach that point so their respective developers are going to have different tuning goals. I’d expect Python to be closer to Java in that regard, with some harder to optimize features than Java but less than JavaScript owing to the richer language and better typing.
Sure. I guess I'd point out that even when the long lifecycle, the optimization on live paths happens quick. It doesn't take too many invocations before the jvm optimizes.
That's similar to how js does things.
Java does have a "client" optimization mode for more short lived operations (like guis for example) and AFAIK it's basically unused at this point. The more aggressive "server" optimizations are faster than ever and get triggered pretty aggressively now. The nature of the jvm is also changing. With fast scaling and containerization, a slow start and long warmup aren't good. That's why part of the jdk dev has been dedicated to resolve that.
I don't know how JSC handles it, but in SM `eval` has significant negative effects on surrounding code. (We also decline to optimize functions containing `with` statements, but that's less because it's impossible and more because nobody uses them.)
Last I saw (and I admit this is pretty dated) V8 was doing the same thing. try/catch at one point in V8 would cause the surrounding method to be deoptimized.
Yeah, SM will compile functions with try/catch/finally, but we don't support unwinding directly into optimized code, so the catch block itself will not be optimized.
Ive heard that LuaJIT has
Pre stable perf than Mozilla’s tracing JIT had and I’ve heard plenty of stories about how flaky LuaJIT’s performance is. But we can’t know how good it really is due to lack of competitors
> Knowledge transfer worked in both ways: I learned a lot about the internal details of CPython's JIT, and conversely I shared with them some of the experience, pain points and gut feelings which I got by working many years on PyPy.
Can cross fertilization between PyPy and CPython JIT efforts help already fast PyPy to get even faster? Like, did CPython JIT team try something PyPy developers didn't attempt before?
From many points of view, what CPython JIT can do is a subset of what PyPy can
do.
The biggest differences between the two JITs are:
1. PyPy is meta tracing, CPython is tracing
2. PyPy has "standard" code generation backends, CPython has copy&patch.
3. CPython so far uses "trace projection" while PyPy uses "trace recording".
(1+2) make CPython JIT much faster to compile and warmup than PyPy, although I suspect that most of the gain is because of (1). However, this comes at the expense of generality, because in PyPy you can automatically trace across all builtins, whereas in CPython you are limited to the bytecode.
Trace projection looked very interesting to me because it automatically solve a problem which I found everywhere in real world code: if you do trace recording, you don't know whether you will be actually able to close the loop, and so you must decide to give up after a certain threshold ("trace too long"). The problem is that there doesn't seem to be threshold which is generally good, so you always end up tracing too much (big warmup costs + literally you are doing unnecessary work) or not enough (the generated code is less optimal, sometimes up to 5-10x).
With trace projection you decide which loop to optimize "in retrospect" so you don't have that specific problem. However, you have OTHER problems (in particular that you don't know the actual values used in the trace) which makes it harder to optimize, so CPython JIT plans to switch to trace recording.
Interesting. If PyPy is this capable, wonder whether an effort to make non-CPython implementations of Python compatible with C extensions would have been preferred compared to taking time and effort into implementing JIT into CPython.
There are efforts to create a new C API which is more friendly to alternative implementations (including CPython itself, when they want to change how they do things internally):
https://hpyproject.org/https://github.com/py-ni
I'm not optimistic. A simple loop summing numbers in a list is at least 30 times slower than PyPy, GraalPy, and Node.js, regardless of whether the JIT is enabled or not. I've watched and read everything there is to know about Python performance, and I'm afraid that without Python 4 (which won't be coming), there won't be much to gain. PyPy is great, but I'm afraid it won't be supported for long, and I wouldn't bet on it for a serious project. I migrated my own project to JS and I am rather satisfied (certainly with the performance).
From what I understand, Julia doesn’t do any tracing at all, it just compiles each function based on the types it receives. Obviously Python doesn’t have multiple dispatch, but that actually might make compilation easier. Swap out the LLVM step with python's IR and they could probably expect a pretty substantial performance improvement. That said I don’t know anything about compilers, I just use both Python and Julia.
This is called a method based JIT, and is generally the more common approach to JIT compilation. Tracing JIT is a deliberate design choice that is quite different from method based JITs
(author of the blog post here)
"just compile when you know the types" is not a good strategy for Python.
My EuroPython talk "Myths and fairy tales about Python performance" explains many reasons why Python is VERY hard to optimize: https://lwn.net/Articles/1031707/
One big advantage of tracing JITs is that they are generally easier to write an to maintain.
For the specific case of PyPy, it's actually a "meta tracing JIT": you trace the interpreter, not the underlying program, which TL;DR means that you can write the interpreter (which is "easy") and you get a JIT compiler for free.
The basic assumption of a tracing JIT is that you have one or more "hot loops" in which you have one (or few) fast paths which are taken most of the time.
If the assumption holds, tracing has big advantages because you eliminate most of dynamism and you automatically inline across multiple layer of function calls, which in turns make it possible to eliminate allocation of most temporary objects.
The problem is that the assumption not always holds, and that's where you start to get problems.
But methods JITs are not THE solution either. Meta has a whole team developing Cinder, which is a method JIT for Python, but they had to introduce what they call "static python", which is an opt-in sets of constraints to remove some Python dynamism to make the JIT job easier.
Finally, as soon as you call any C extension, any JIT is out of luck and must deoptimize to present a "state of the world" which is compatible with that the C extension finds.
IIRC Julia's is a particularly simple method-based JIT for a dynamically-typed language.
I'm not sure exactly how it differs from most JavaScript JITs, but I believe it just compiles each method once for each set of function argument types - for example, it doesn't try to dynamically determine the types of local variables.
In the JS world, tracing was abandoned because it didn't scale to real world code.
JS JITs (the production ones, like JSC's) have no such thing as trace blockers that prevent the surrounding code from being optimized. You might have an operation (like a call to some wacky native function) that is itself not optimized, but that won't have any impact on the JIT's ability to optimize the code surrounding that operation.
Tracing is just too much of a benchmark hack overall IMO. Tracing would only be a good idea in a world where it's too expensive to run a real optimizing JIT. But the JS, Java, and .NET experiences show that a real optimizing JIT - with all of its compile time costs - is exactly what you want because it results in predictable speed-ups
Pypy's tracing JIT has worked pretty well for years, so that doesn't seem universal. I admit I have fairly low expectations on a JIT succeeding (in any meaningful fashion) in cpython given the constraints that it currently has, but I'm generally a skeptic, so maybe i'm being overly pessimistic.
What does “worked pretty well” really mean though?
When we talk about JS or Java JITs working well, we are making statements based on intense industry competition where if a JIT had literally any shortcoming then a competitor would highlight it in competitive benchmarking and blog posts. So, the competition forced aggressive improvements and created a situation where the top JITs deliver reliable perf across lots of workloads.
OTOH PyPy is awesome but just hasn’t had to face that kind of competitive challenge. So we probably can’t know how far off from JS JITs it is.
One thing I can say is when I compared it to JSC by writing the same benchmark in both Python and JS, JSC beat it by 4x or so.
I think the Java JITs are a better comparison because the workload is more similar: JavaScript is weird for how it’s expected to start in a fraction of a second and soak up a huge bolus of code which may substantially never be used whereas most of the performance-sensitive Python code stabilizes quickly and loads what it uses really early on.
The Java JIT and most other Javascript jits are essentially operating the same way. The core difference is the java language spec sets up a whole bunch of requirements that need to be figured out at startup and are easy to trigger.
For example, static initialization on classes. The JDK has a billion different classes and on startup a not insignificant fraction of those end up getting loaded for all but the simplest applications.
Essentially, Java and the JS jits are both initially running everything interpreted and when a hot method is detected they progressively start spending the time sending those methods and their statistics to more aggressive JIT compilers.
A non-insignificant amount of time is being spent to try and make java start faster and a key portion of that is resolving the class loading problem.
All JVMs have options to JIT straight away, although that comes with other tradeoffs.
All commercial JVMs have had JIT caches for quite some time, and this is finally also available as free beer on OpenJDK, thus code can execute right away as if it was an AOT language.
In some of those implementations, the JIT cache gets updated after each execution taking into account profiling data, thus we have the possibilitiy to reach an optimal status across the lifetime of the executable.
The .NET and ART cousins also have similar mechanisms in place.
Which I guess is what your last sentence refers to, but I wasn't sure.
> Which I guess is what your last sentence refers to, but I wasn't sure.
Yup, the CDS and now AOT stuff in openjdk is what I was referring to. Project Leyden.
Yes, my thought was that the lifecycle is different. The average JVM is probably running for days on average so a huge percentage of the total runtime is in code which had been aggressively optimized by the JIT whereas a lot of JavaScript isn’t used enough to reach that point so their respective developers are going to have different tuning goals. I’d expect Python to be closer to Java in that regard, with some harder to optimize features than Java but less than JavaScript owing to the richer language and better typing.
Sure. I guess I'd point out that even when the long lifecycle, the optimization on live paths happens quick. It doesn't take too many invocations before the jvm optimizes.
That's similar to how js does things.
Java does have a "client" optimization mode for more short lived operations (like guis for example) and AFAIK it's basically unused at this point. The more aggressive "server" optimizations are faster than ever and get triggered pretty aggressively now. The nature of the jvm is also changing. With fast scaling and containerization, a slow start and long warmup aren't good. That's why part of the jdk dev has been dedicated to resolve that.
I don't know how JSC handles it, but in SM `eval` has significant negative effects on surrounding code. (We also decline to optimize functions containing `with` statements, but that's less because it's impossible and more because nobody uses them.)
Last I saw (and I admit this is pretty dated) V8 was doing the same thing. try/catch at one point in V8 would cause the surrounding method to be deoptimized.
Yeah, SM will compile functions with try/catch/finally, but we don't support unwinding directly into optimized code, so the catch block itself will not be optimized.
JSC will still JIT optimize functions that use eval.
It’s true that there are some necessary pessimizations but nothing as severe as failing to optimize the code at all
I remember Mozilla tried Tracing JIT with TraceMonkey and ultimately loss to v8.
But LuaJIT is also Tracing JIT which seems to work well enough.
LuaJIT has no V8-like competitor that runs Lua
Ive heard that LuaJIT has Pre stable perf than Mozilla’s tracing JIT had and I’ve heard plenty of stories about how flaky LuaJIT’s performance is. But we can’t know how good it really is due to lack of competitors
> Knowledge transfer worked in both ways: I learned a lot about the internal details of CPython's JIT, and conversely I shared with them some of the experience, pain points and gut feelings which I got by working many years on PyPy.
Can cross fertilization between PyPy and CPython JIT efforts help already fast PyPy to get even faster? Like, did CPython JIT team try something PyPy developers didn't attempt before?
PyPy is awesome, btw.
From many points of view, what CPython JIT can do is a subset of what PyPy can do.
The biggest differences between the two JITs are: 1. PyPy is meta tracing, CPython is tracing 2. PyPy has "standard" code generation backends, CPython has copy&patch. 3. CPython so far uses "trace projection" while PyPy uses "trace recording".
(1+2) make CPython JIT much faster to compile and warmup than PyPy, although I suspect that most of the gain is because of (1). However, this comes at the expense of generality, because in PyPy you can automatically trace across all builtins, whereas in CPython you are limited to the bytecode.
Trace projection looked very interesting to me because it automatically solve a problem which I found everywhere in real world code: if you do trace recording, you don't know whether you will be actually able to close the loop, and so you must decide to give up after a certain threshold ("trace too long"). The problem is that there doesn't seem to be threshold which is generally good, so you always end up tracing too much (big warmup costs + literally you are doing unnecessary work) or not enough (the generated code is less optimal, sometimes up to 5-10x).
With trace projection you decide which loop to optimize "in retrospect" so you don't have that specific problem. However, you have OTHER problems (in particular that you don't know the actual values used in the trace) which makes it harder to optimize, so CPython JIT plans to switch to trace recording.
Interesting. If PyPy is this capable, wonder whether an effort to make non-CPython implementations of Python compatible with C extensions would have been preferred compared to taking time and effort into implementing JIT into CPython.
pypy IS compatible with C extensions, via what we call "cpyext": https://doc.pypy.org/en/latest/faq.html#do-c-extension-modul...
The problem of cpyext is that it's super slow, for good reasons: https://pypy.org/posts/2018/09/inside-cpyext-why-emulating-c...
There are efforts to create a new C API which is more friendly to alternative implementations (including CPython itself, when they want to change how they do things internally): https://hpyproject.org/ https://github.com/py-ni
I'm not optimistic. A simple loop summing numbers in a list is at least 30 times slower than PyPy, GraalPy, and Node.js, regardless of whether the JIT is enabled or not. I've watched and read everything there is to know about Python performance, and I'm afraid that without Python 4 (which won't be coming), there won't be much to gain. PyPy is great, but I'm afraid it won't be supported for long, and I wouldn't bet on it for a serious project. I migrated my own project to JS and I am rather satisfied (certainly with the performance).
I wonder how much Julia’s JIT could help inform Python’s.
Diagram: https://docs.julialang.org/en/v1/devdocs/img/compiler_diagra...
Documentation: https://docs.julialang.org/en/v1/devdocs/eval/
From what I understand, Julia doesn’t do any tracing at all, it just compiles each function based on the types it receives. Obviously Python doesn’t have multiple dispatch, but that actually might make compilation easier. Swap out the LLVM step with python's IR and they could probably expect a pretty substantial performance improvement. That said I don’t know anything about compilers, I just use both Python and Julia.
This is called a method based JIT, and is generally the more common approach to JIT compilation. Tracing JIT is a deliberate design choice that is quite different from method based JITs
That makes sense. Why exactly would python need a tracing JIT instead of a method one though? It seems like either could work.
(author of the blog post here) "just compile when you know the types" is not a good strategy for Python. My EuroPython talk "Myths and fairy tales about Python performance" explains many reasons why Python is VERY hard to optimize: https://lwn.net/Articles/1031707/
One big advantage of tracing JITs is that they are generally easier to write an to maintain. For the specific case of PyPy, it's actually a "meta tracing JIT": you trace the interpreter, not the underlying program, which TL;DR means that you can write the interpreter (which is "easy") and you get a JIT compiler for free. The basic assumption of a tracing JIT is that you have one or more "hot loops" in which you have one (or few) fast paths which are taken most of the time.
If the assumption holds, tracing has big advantages because you eliminate most of dynamism and you automatically inline across multiple layer of function calls, which in turns make it possible to eliminate allocation of most temporary objects. The problem is that the assumption not always holds, and that's where you start to get problems.
But methods JITs are not THE solution either. Meta has a whole team developing Cinder, which is a method JIT for Python, but they had to introduce what they call "static python", which is an opt-in sets of constraints to remove some Python dynamism to make the JIT job easier.
Finally, as soon as you call any C extension, any JIT is out of luck and must deoptimize to present a "state of the world" which is compatible with that the C extension finds.
IIRC Julia's is a particularly simple method-based JIT for a dynamically-typed language.
I'm not sure exactly how it differs from most JavaScript JITs, but I believe it just compiles each method once for each set of function argument types - for example, it doesn't try to dynamically determine the types of local variables.
I think the compiler figures that if a new method comes up, it’ll just compile that when it needs to.
If you can't do it , let the experts do that job and adopt it. pypy had done that for ages and works very well