A tail calling interpreter for Python (already landed in CPython)

blog.reverberate.org

124 points by phsilva 4 months ago

riffraff 4 months ago

How does this differ from direct threading interpreters?

It seems like it solves the same problem (saving the function call overhead) and has the same downsides (requires non-standard compiler extensions)

EDIT: it seems the answer is that compilers do not play well with direct-threaded interpreters and they are able to perform more/better optimizations when looking at normal-sized functions rather than massive blocks

http://lua-users.org/lists/lua-l/2011-02/msg00742.html

noelwelsh 4 months ago

Unfortunately, most discussion of direct threaded interpreters confuses the implementation techniques (e.g. computed gotos) with the concepts (tail calls, or duality between calls and returns and data and codata, depending on your point of view). What is presented here is conceptually a direct threaded interpreter. It's just implemented in a way that is more amenable to optimization by the compiler technology in use.
(More here: https://noelwelsh.com/posts/understanding-vm-dispatch/)
haberman 4 months ago

This is a great summary. When Mike wrote the message you linked, his conclusion was that you have to drop to assembly to get reasonable code for VM interpreters. Later we developed the "musttail" technique which was able to match his assembly language sequences using C. This makes C a viable option for VM interpreters, even if you want best performance, as long as your compiler supports musttail.
> they are able to perform more/better optimizations when looking at normal-sized functions rather than massive blocks
It's not the size of the function that is the primary problem, it is the fully connected control flow that gums everything up. The register allocator is trying to dynamically allocate registers through each opcode's implementation, but it also has to connect the end of every opcode with the beginning of every opcode, from a register allocation perspective.
The compiler doesn't understand that every opcode has basically the same set of "hot" variables, which means we benefit from keeping those hot variables in a fixed set of registers basically all of the time.
With tail calls, we can communicate a fixed register allocation to the compiler through the use of function arguments, which are always passed in registers. When we pass this hot data in function arguments, we force the compiler to respect this fixed register allocation, at least at the beginning and the end of each opcode. Given that constraint, the compiler will usually do a pretty good job of maintaining that register allocation through the entire function.
- 10000truths 4 months ago
  
  I feel like using calling conventions to massage the compiler's register allocation strategy is a hack. If the problem is manual control over register allocation, then the ideal solution should be... well, exactly that and no more? An annotation for local variables indicating "always spill this" (for cold-path locals) or "never spill this or else trigger a build error" (for hot-path locals). Isn't that literally why the "register" keyword exists in C? Why don't today's C compilers actually use it?
  - haberman 4 months ago
    
    If the tail calling pattern made the code ugly, I would be more inclined to agree with this. But putting each opcode in its own function isn't so bad: it seems just as readable, if not more so, than a mondo function that implements every opcode.
    By contrast, a mondo function that also has a bunch of register allocation annotations seems less readable.
    
    10000truths 4 months ago
    
    I don't see how a hypothetical __attribute__((never_spill)) annotation on local variables would preclude splitting opcode logic into separate functions. It just means those functions would have to be inlined into the interpreter loop to avoid conflicts with calling convention constraints.
    
    haberman 4 months ago
    
    Agreed -- I'm just saying that the tail call pattern doesn't seem so bad to me. The shape it imposes on your code doesn't detract from readability in my opinion.
- riffraff 4 months ago
  
  thanks for the explanation!
coldtea 4 months ago

>and has the same downsides (requires non-standard compiler extensions)
It's not a downside if:
(a) you have those non-standard compiler extensions in the platforms you target
(c) for the rest, you can ifdef an alternative that doesn't require them

saidinesh5 4 months ago

Recent discussion: https://news.ycombinator.com/item?id=42999672

Do check out the articles in the top most comment.. about how tail call optimization gets you faster interpreters.

It completely eliminates the overhead of function calls in the generated machine code while you still your code modularly using functions.

haberman 4 months ago

Yes, that is the same article linked in the first sentence of this "update" article. :)
I published this technique four years ago, and it's very exciting to see that others have taken up the cause and done the work to land it in CPython.
nine_k 4 months ago

I think this technique is known since 1970s as "direct threaded code".

__s 4 months ago

`return goto f()` syntax in C seems interesting

I had a similiar idea that Python could have `return from f()` to support tail calls without the issues raised about implicit tail calls

dammaj 4 months ago

To read about the basics of tail calls optimization:

https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

asicsp 4 months ago

See also this little bit of discussion about a week back: https://news.ycombinator.com/item?id=42999672

thunkingdeep 4 months ago

This does NOT mean Python will get Tail Call Optimization, as Guido cannot be shown The Light, and has decided.

beagle3 4 months ago

It is not an optimization ; it changes program semantics - converts programs that will run out of stack eventually regardless of the amount of available memory (and raise exceptions an the process, for example, which a program might rely on. Either way, semantics are changed)
It should only be called Tail Call Elimination.
- dragonwriter 4 months ago
  
  By that standard, any optimization that changes scaling in any dimension changes semantics, which, well, I’m not saying its wrong, but I would say it is exactly what people looking for optimization want.
  - beagle3 4 months ago
    
    I disagree.
    An optimization that speeds a program by x2 has the same effect as running on a faster CPU. An optimization that packs things tighter into memory has the same effect as using more memory.
    Program semantics are usually referred to as “all output given all input, for any input configuration” but ignoring memory use or CPU time, provided they are both finite (but not limited).
    TCE easily converts a program that will halt, regardless of available memory, to one that will never halt, regardless of available memory. That’s a big change in both theoretical and practical semantics.
    I probably won’t argue that a change that reduces an O(n^5) space/time requirement to an O(1) requirement is a change in semantics, even though it practically is a huge change. But TCE changes a most basic property of a finite memory Turing machine (halts or not).
    We don’t have infinite memory Turing machines.
    edited: Turing machine -> finite memory Turing machine.
    
    coldtea 4 months ago
    
    >I probably won’t argue that a change that reduces an O(n^5) space/time requirement to an O(1) requirement is a change in semantics, even though it practically is a huge change
    Space/time requirements aren't language semantics though, are they?
    
    tsegratis 4 months ago
    
    it changes debug semantics
    this is the reason guido avoids it. programs will still fail, except now without a stacktrace
    
    rollcat 4 months ago
    
    GvR always prioritised ease of debugging over performance, and honestly I'm in the same camp. What good does a fast program do if it's incorrect?
    But I think you can get a fine balance by keeping a recent call trace (in a ring buffer?). Lua does this and honestly it's OK, once you get used to the idea that you're not looking at stack frames, but execution history.
    IMHO Python should add that, and it should clearly distinguish between which part of a crash log is a stack trace, and which one is a trace of tail calls.
    Either way this is going to be quite a drastic change.
    
    tsegratis 4 months ago
    
    that's a nice solution!
  - dataflow 4 months ago
    
    > By that standard, any optimization that changes scaling in any dimension changes semantics
    That doesn't follow. This isn't like going from driving a car to flying an airplane. It's like going from driving a car to just teleporting instantly. (Except it's about space rather than time.)
    It's a difference in degree (optimization), yes, but by a factor of infinity (O(n) overhead to 0 overhead). At that point it's not unreasonable to consider it a difference in kind (semantics).
    
    tomsmeding 4 months ago
    
    Modern C compilers are able to transform something like this:
    for (int i = 0; i < n; i++) a += i;
    To:
    a += n * (n+1) / 2;
    Is this an optimisation or a change in program semantics? I've never heard anyone call it anything slse than an optimisation.
    
    dataflow 4 months ago
    
    This will get a bit pedantic, but it's probably worthwhile... so here we go.
    > Is this an optimisation or a change in program semantics?
    Note that I specifically said something can be both an optimization and a change in semantics. It's not either-or.
    However, it all depends on how the program semantics are defined. They are defined by the language specifications. Which means that in your example, it's by definition not a semantic change, because it occurs under the as-if rule, which says that optimizations are allowed as long as they don't affect program semantics. In fact, I'm not sure it's even possible to write a program that would be guaranteed to distinguish them based purely on the language standard. Whereas with tail recursion it's trivial to write a program that will crash without tail recursion but run arbitrarily long with it.
    We do have at least one optimization that is permitted despite being prohibited by the as-if rule: return-value optimization (RVO). People certainly consider that a change in semantics, as well as an optimization.
    
    tomsmeding 4 months ago
    
    You do have a point. However, if I'm allowed to move the goalposts a little: not all changes in semantics are equal. If you take a program that crashes for certain inputs and turn it into one that is semantically equivalent except that in some of those crashing cases, it actually continues running (as if on a machine with infinite time and/or memory), then that is not quite as bad as one that changes a non-crashing result into a different non-crashing result, or one that turns a non-crashing result into a crash.
    With this kind of "benign" change, all programs that worked before still work, and some that didn't work before now work. I would argue this is a good thing.
    
    rat87 4 months ago
    
    It can be, especially when you do something undefined the compiler can do all sorts of odd things while transforming code
    
    pinoy420 4 months ago
    
    Amazing it can do that. How does it work?
    That definitely does seem to change its semantics to me. I am not a c expert but this surely has problems the previous one doesn’t?
    
    hathawsh 4 months ago
    
    It does change the semantics if n is negative or large enough to cause an overflow. The challenge for the compiler is to somehow prove that neither of those things can happen.
    
    kryptiskt 4 months ago
    
    It doesn't have to prove absence of overflow since that is undefined behavior in C and thus modern compilers assume it can never happen.
    
    hathawsh 4 months ago
    
    Great point.
  - rat87 4 months ago
    
    The important thing is whether theres a garuntee of it happening in particular circumstance or not. Like with python referencing counting theoretically finalizers should be called after you lose all references to a file (assuming no cycles) but you cant rely on it.
    Python dicts were in insert sort order for 3.6 but this only became a garuntee as opposed to an implementation choice that could be changed at anyvtime with python3.7
- flakes 4 months ago
  
  > converts programs that will run out of stack eventually regardless of the amount of available memory (and raise exceptions an the process, for example, which a program might rely on
  https://xkcd.com/1172/
rpcope1 4 months ago

That's probably one of the more frustrating things about Python. Each release it gets all sorts of questionable new syntax (including the very strange pattern matching "feature" that kind of sucks compared to something like Erlang or Scala), but we never get real useful quality of life improvements for basic functional programming like TCO or multi line lambdas
- throwaway81523 4 months ago
  
  Python has always been unashamedly imperative, with some functional features entering by slipping through the cracks. The pattern matching thing seemed ok to me when I tried it, but I haven't used it except briefly, since I'm still mostly on Python 3.9. Interestingly, Python has been losing users to Rust. I don't entirely understand that, other than everyone saying how Rust's tooling is so much better.
  - flakes 4 months ago
    
    > Python has been losing users to Rust. I don't entirely understand that, other than everyone saying how Rust's tooling is so much better.
    Not to rust, but to Go and C++ for myself. The biggest motivating factor is deployment ease. It is so difficult to offer a nice client install process when large virtual environments are involved. Static executables solve so many painpoints for me in this arena. Rust would probably shine here as well.
    If its for some internal bespoke process, I do enjoy using Python. For tooling shipped to client environments, I now tend to steer clear of it.
    
    sieve 4 months ago
    
    > For tooling shipped to client environments, I now tend to steer clear of it.
    A guy on r/WritingWithAI is building a new writing assistant tool using python and pyQt. He is not a SE by trade. Even so, the installation instructions are:
    - Install Python from the Windows app store
    - Windows + R -> cmd -> pip install ...
    - Then run python main.py
    This is fine for technical people. Not regular folks.
    For most people, these incantations to be typed as-is in a black window mean nothing and it is a terrible way of delivering a piece of software to the end-user.
    
    pjmlp 4 months ago
    
    As someone that always kept a foot on C++ land, dispite mostly working on managed languages, I would that by C++17 (moreso now in C++23), dispite all its quirks and warts, C++ has become good enough that I can write Python like code with it.
    Maybe it is only a thing to those of us already damaged with C++, and with enough years experience using it, but there are still plenty of such folks around to matter, specially to GPU vendors, and compiler writers.
    
    cempaka 4 months ago
    
    Are there any books or curricula you'd recommend to someone starting out, who wants to learn a more modern style? My main worry is just that everything is going to be geared to C++11 (or worse, 98).
    
    throwaway81523 4 months ago
    
    I liked "Effective Modern C++" although it is somewhat out of date by now. Stroustrup's recent article "21st century C++" https://cacm.acm.org/blogcacm/21st-century-c/ gives an overview (but not details) of more recent changes. There are also the C++ core guidelines though maybe those are also out of date? https://github.com/isocpp/CppCoreGuidelines
    I've been looking at Rust and it's a big improvement over C, but it still strikes me as a work in progress, and its attitude is less paranoid than that of Ada. I'd at least like to see options to crank up the paranoia level. Maybe Ada itself will keep adapting too. Ada is clunky, but it is way more mature than Rust.
    
    pjmlp 4 months ago
    
    Yes, from Bjarne Stroustroup himself,
    A Tour of C++, preferably the 2nd edition
    Programming -- Principles and Practice Using C++, preferably the 3rd edition
    
    mversiotech 4 months ago
    
    The latest edition of "A Tour of C++" is the 3rd one, from 2022. Is there any specific reason why you would recommend the 2nd edition (from 2018) over that one?
    
    pjmlp 4 months ago
    
    I wasn't aware there is already a 3rd one.
    
    cempaka 4 months ago
    
    Thanks!
    
    svilen_dobrev 4 months ago
    
    kind-a summary: 21st century c++ (still by Bjarne Stroustrup)
    https://news.ycombinator.com/item?id=42946321
  - mikepurvis 4 months ago
    
    I'm largely still a Python user, but when I've used it, rust overall gross way more thoughtfully and consistently designed— both in the core language features and in the stdlib.
    Python's thirty years of evolution really shows at this point.
  - coldtea 4 months ago
    
    >Python has been losing users to Rust
    Not really.
- dragonwriter 4 months ago
  
  > we never get real useful quality of life improvements for basic functional programming like TCO or multi line lambdas
  A lambda can be as big of an expression as you want, including spanning multiple lines; it can't (because it is an expression) include statements, which is only different than lambdas in most functional languages in that Python actually has statements.
  - kqr 4 months ago
    
    > most functional languages
    Most popular functional languages I can think of except maybe Haskell has statements!
- jgalt212 4 months ago
  
  The utility value of multi-line lambdas is real, but the readability of these is terrible. And Python prizes readability. So you know where this initiative will end up.
  - saagarjha 4 months ago
    
    Nothing more readable than a triply-nested list comprehension on an object that exists only to vend its __getattr__ for some unholy DSL
    
    pinoy420 4 months ago
    
    Annoying. Because it “compiles” to less optimal code than writing it explicitly.
  - vrighter 4 months ago
    
    I personally find python to be highly UNreadable, especially with all of its syntax and braindead scoping rules
- pinoy420 4 months ago
  
  The choice of “unique” verbs is weird too. Case match. Try except?
  - maleldil 4 months ago
    
    `match/case` looks absolutely fine to me. What's the problem?
    `try/except` is definitely weird, though.
ehsankia 4 months ago

Guido is no longer BDFL though, it's the steering committee that decides.
- thunkingdeep 4 months ago
  
  Ah, you’re correct. My comment was mainly meant as a tongue in cheek remark to point out that this definition of tailcall is wholly separate from Python function objects and merely an implementation detail.
- riffraff 4 months ago
  
  the steering committee seems way less conservative than Guido, right?
  Looking at python from the outside a lot of changes since GvR stepped down seem like stuff he'd not have been fond of.
  - kqr 4 months ago
    
    I think this is a change longer in the making than that. Back when I started working with Python in the mid--late 2000s, the Zen was holy and it seemed very unlikely to ever see multiple ways to do "one thing".
    The Python community has since matured and realised that what they previously thought of as "one thing" were actually multiple different things with small nuances and it makes sense to support several of them for different use cases.
    
    guappa 4 months ago
    
    One way to do the things. That's why there's 5000 ways to install a module.
    
    raverbashing 4 months ago
    
    And 4900 "wrong ways" that will hurt you one way or another
    
    peterfirefly 4 months ago
    
    More like 5001.
    
    riffraff 4 months ago
    
    You may be right. I checked and found the introduction of the ternary expression, which I found to be wildly "unpythonic", was back in 2006. Time flies.
  - pansa2 4 months ago
    
    Any examples? The biggest change since Guido stepped down has been the addition of pattern matching, which he was strongly in favour of.
    Moreover, Guido is in favour of ongoing addition of major new features (like pattern matching), worrying that without them Python would become a “legacy language”:
    https://discuss.python.org/t/pep-8012-frequently-asked-quest...
    
    riffraff 4 months ago
    
    I was thinking of the walrus operator, various f-string changes, relenting on the "GIL removal must not cost performance" stance (although"covered" by other improvements), things like that.
    I don't follow python closely so it may 100% be stuff that GvR endorsed too, or I'm mixing up the timelines. It just feels to me that python is changing much faster than it did in the 2.x days.
    
    dragonwriter 4 months ago
    
    > python is changing much faster than it did in the 2.x days.
    I think part of the reason Guido stepped down was that the BDFL structure created too much load on him dealing with actual and potential change, so its unsurprising that the rate of change increased when the governance structure changed to one that managed change without imposing the same load on a particular individual.
    
    kqr 4 months ago
    
    This may just be time passing faster now that you're older.
    
    pinoy420 4 months ago
    
    Pattern matching seems like a cool feature that was added just because it was cool. I think the syntax is really odd too - apparently to “be pythonic”. I really see no use for it other than to “look smart”. The fact that case match (switch case is a much better description) is expanded to practically a huge if else is disturbing. Similarly the walrus operator. Other than an answer to “what is a new feature of python that you like” interview trivia question, really, who has actually used it?
    
    dragonwriter 4 months ago
    
    I don't use pattern matching much, but I use walrus fairly regularly.
    
    Qem 4 months ago
    
    > Similarly the walrus operator. Other than an answer to “what is a new feature of python that you like” interview trivia question, really, who has actually used it?
    At least in my case I use it all the time, to avoid duplicated operations inside comprehensions.
    
    pansa2 4 months ago
    
    Yeah, it was added to tick the box for people who ask "does Python have pattern matching?"
    If you look at the feature in detail, and especially how it clashes with the rest of the language, it's awful. For example:
    https://x.com/brandon_rhodes/status/1360226108399099909
    
    kqr 4 months ago
    
    To be fair, "The Substitution Principle" (more commonly known as "equational reasoning" in this context) has never been valid in any languages that aren't... Haskell, and maybe Ada? Any expression that can trigger side effects is an unsafe substitution. (The reason such substitutions are safe in Haskell and Ada is that those languages prevent expressions from triggering side effects in the first place.)
    
    pansa2 4 months ago
    
    This isn't about general substitutability though, just about naming constants. If you have `case 404:` and you add a named constant `NOT_FOUND = 404`, you can't change the code to `case NOT_FOUND:` because that completely changes its semantics.
    Given that one of the fundamental rules of programming is "don't use magic numbers, prefer named constants", that's terrible language design.
coldtea 4 months ago

Hasn't Guido step down from BD anyway?

VWWHFSfQ 4 months ago

Will Python ever get fast? Or even _reasonably_ fast?

The answer is no, it will not. Instead they'll just keep adding more and more syntax. And more and more ways to do the same old things. And they'll say that if you want "fast" then write a native module that we can import and use.

So then what's the point? Is Python really just a glue language like all the rest?

maxwelljoslyn 4 months ago

VWWHFSfQ, you may already know this, but: I recommend this talk by Armin Ronacher (Flask creator) on how Python's implementation internals contribute to the difficulties of making Python faster.
https://www.youtube.com/watchv=qCGofLIzX6g
One case study Ronacher gets into is the torturous path taken through the Python interpreter (runtime?) when you evaluate `__add__`. Fascinating stuff.
- cudder 4 months ago
  
  Your link is broken, here's a working one: https://www.youtube.com/watch?v=qCGofLIzX6g
IgorPartola 4 months ago

Python is fast enough for a whole set of problems AND it is a pretty, easy to read and write language. I do think it can probably hit pause on adding more syntax but at least everything it adds is backwards compatible. You won’t be writing a 3D FPS game engine in Python but you definitely can do a whole lot of real time data processing, batch processing, scientific computing, web and native applications, etc. before you need to start considering a faster interpreter.
If your only metric for a language is speed then nothing really beats hand crafted assembly. All this memory safety at runtime is just overhead. If you also consider language ergonomics, Python suddenly is not a bad choice at all.
- sieve 4 months ago
  
  > If your only metric for a language is speed then nothing really beats hand crafted assembly
  Only if you know the micro-architecture of the processor you are running on at great depth and can schedule the instructions accordingly. Modern compilers and vms can do crazy stuff at this level.
  > Python is fast enough for a whole set of problems AND it is a pretty, easy to read and write language.
  It is definitely easy to read. But speed is debatable. It is slow enough for my workload to start wondering about moving to pypy.
  - IgorPartola 4 months ago
    
    Will your program ever be fast if you don’t learn the microarchitecture of your CPU first? :)
    PyPy is a valid option and one I would explore if it fits what you are doing.
- VWWHFSfQ 4 months ago
  
  I guess I'm wondering what is the point of tail-call optimizations, or even async/await when it's all super slow and bounded by the runtime itself? There are basically no improvements whatsoever to the core cpython runtime. So really what is all this for? Some theoretical future version of Python that can actually use these features in an optimal way?
  - throwaway81523 4 months ago
    
    This TCO is in how the CPython interpreter works, not in making Python itself tail recursive. Some of the C code in the interpreter has been reorganized to put some calls into tail position where the C compiler turns them into jumps. That avoids some call/return overhead and makes the interpreter run a little faster. It's still interpreting the same language with the same semantics.
- podunkPDX 4 months ago
  
  > You won’t be writing a 3D FPS game engine in Python
  While Eve Online isn’t an FPS, it is an MMORPG written in stackless Python, and seems to be doing OK.
  - lstodd 4 months ago
    
    It was, once.
    Nowadays (for about 12 years already I think) there is nothing much stackless about it.
    The concept was nice. Stackless and greenlets.. yess. But the way they rewrote C stack just killed caches. Even a naive reimplementation just using separate mmapped stacks and wrapping the whole coro concept under then-Python's threading module instantly gained like 100x speedup on context switch loads like serving small stuff over HTTP.
    Edit: Though at this point it didn't much differ from ye olde FreeBSD N:M pthread implementation. Which ended badly if anyone can remember.
  - rcxdude 4 months ago
    
    They do continuously struggle with CPU load and by all accounts have a mountain range of technical debt from that decision, though.
- vrighter 4 months ago
  
  everything it adds is by default backwards compatible, because old programs didn't use it, because it wasn't there yet, and so won't break.
  Python's problem is that the non-new stuff is not always backwards compatible. It happens way too often that A new python version comes out and half the python programs on my system just stop working.
mattbillenstein 4 months ago

The JIT will improve - you can also use PyPy to get speedups on programs that don't use a ton of C extensions.
Also, free-threading is coming so we'll have threads soon.
I don't know if Python can every really be fast as by design, objects are scattered all over memoryand even things like iterating a list, you're chasing pointers to PyObject all over the place - it's just not cache friendly.
- olau 4 months ago
  
  PyPy has a list implementation that specializes under the hood. So if you stuff it with integers, it will contain the integers directly instead of pointers to them. That's at least how I understood it.
  - mattbillenstein 4 months ago
    
    Yeah, Python has the array class too - well and numpy/torch/etc. Useful for numeric stuff, but you can't say have like a list of objects all contiguous very easily I don't believe.
tcoff91 4 months ago

I think if you want python but fast then Mojo is your only hope.
EDIT: yes and there’s pypy as well as pointed out below. Basically you gotta use an alternative python implementation of some kind.
- pansa2 4 months ago
  
  There’s always PyPy - it’s much faster than CPython and, unlike Mojo, is ready to use today.