GIL-Removal Project Takes Another Step (Posting On Python-List Prohibited)

Discussion:

(too old to reply)

Lawrence D'Oliveiro

2024-03-15 01:51:12 UTC

Python takes another step in removing a major bottleneck to
multithreading performance
<https://devclass.com/2024/03/12/python-progresses-towards-faster-concurrency-option-to-disable-gil-merged-into-main-code/>.

Currently Python uses a combination of reference-counting and garbage
collection in order to avoid the need for programmers to have to keep
track of allocated objects and remembering when to dispose of them.
A pure garbage collection scheme, like in Java or LISP, makes it easier
to support multithreading, but at the cost of memory usage that can
easily get out of hand. Reference counting helps to ensure that objects
disappear as soon as the program forgets its last reference to them,
and this works well for most objects in a typical program, with garbage
collection as a fallback for cleaning up the rest.

But Python’s present scheme for maintaining reference counts (the
“Global Interpeter Lock” or “GIL”) prevents Python code for taking full
advantage of multiple threads. Some are advocating switching to the
pure garbage-collection approach, but fortunately (I think) this is not
the plan that has been adopted by the Council. Instead, they are going
to use a technique known as “Biased Reference Counting”. This splits
the reference count into two components, one managed by a thread which
is considered to “own” the object (and is presumably responsible for
most accesses to the object), while the other is managed on a shared
basis by other threads accessing the object (and making fewer accesses
to it). This seems to offer the best performance in tests so far.

The switchover is a complicated procedure, which is certain to have
implications for some existing Python code that never had to worry about
thread safety before, as well as far-reaching implications for the
design of the CPython implementation itself. So it will take place in
multiple stages over some years, and if worst comes to worst, the
changes can always be rolled back. (Or a different strategy chosen.)

Seems some people are still smarting over the flak they got from the
Python 2 → 3 transition. “This is not Python 4,” they are saying. But
why not call it “Python 4”, as a warning over the likely compatibility
issues? Even if it probably won’t be quite as painful ...

Paul Rubin

2024-03-19 00:11:52 UTC

Permalink

Post by Lawrence D'Oliveiro
But Python’s present scheme for maintaining reference counts (the
“Global Interpeter Lock” or “GIL”) prevents Python code for taking full
advantage of multiple threads. Some are advocating switching to the
pure garbage-collection approach, but fortunately (I think) this is not
the plan that has been adopted by the Council.

I don't know the pro and anti GC arguments specifically in flight, so I
won't take a side here, but the idea that GC uses more memory seems
erroneous to me. MicroPython uses GC and runs in machines with as
little as 16KB of ram (the BBC Micro v1). I've used the CircuitPython
variant on chips with 32KB of ram and it is reasonably comfortable on
those. I don't think CPython has ever run on machines that small. GC
was invented in the 1950s for use in Lisp, on computers of that era that
were tiny compared with today's computers.

Classic GC adds one bit of overhead to each object, while reference
counting requires storing a refcount that is potentially large. A
strict refcounting approach in a big computer might even need 64 bit
refcounts. Also, the refcount has to be modified constantly. I'm
amazed if that doesn't slow things down badly even in the single
threaded case.

Also, the "with" statement was added to Python partly to support
GC-based implementations, as some applications were relying on
refcounting to release resources when the object went out of scope, a
bad kludge.

Post by Lawrence D'Oliveiro
Instead, they are going to use a technique known as “Biased Reference
Counting”.

That sounds ugly but I'm far away from it so dunno.

Post by Lawrence D'Oliveiro
This seems to offer the best performance in tests so far.

They compared it to a serious GC and it won? If yes, that is
interesting.

Post by Lawrence D'Oliveiro
Seems some people are still smarting over the flak they got from the
Python 2 → 3 transition. “This is not Python 4,” they are saying. But
why not call it “Python 4”, as a warning over the likely compatibility
issues? Even if it probably won’t be quite as painful ...

We seem to be getting a Python 4 transition (i.e. breaking old code)
with each new release of Python 3, so this is just more of the same.

Anyway I'm glad effort is being made to remove the GIL. If it were up
to me, I'd switch to BEAM or something like it as the underlying VM.

Lawrence D'Oliveiro

2024-03-19 02:33:44 UTC

Permalink

... the idea that GC uses more memory seems erroneous to me.

It will use whatever memory it is permitted to use.

Consider that every time you call a method, a new method-wrapper object is
dynamically created, and almost always immediately deleted when the call
returns. So even a very simple, seemingly well-behaved Python script, if
running for long enough, would consume more and more memory if it were not
for reference-counting.

Paul Rubin

2024-03-20 00:51:54 UTC

Permalink

Post by Lawrence D'Oliveiro
So even a very simple, seemingly well-behaved Python script, if
running for long enough, would consume more and more memory if it were
not for reference-counting.

That is completely false. It's usual to set a GC to run every so-many
allocations. GHC normally does a minor GC every 256K of allocations so
that the most recent stuff fits in the L2 CPU cache, speeding things up
a lot. Refcounting schemes are of course incapable of that optimization
because they don't relocate objects in memory.

You can of course configure a GC to not run very often, in which case
the memory region can get large. That is an optimization you do
intentionally, to spend less CPU time doing GC, and of course you only
do that if you have the memory for it. I think you are imagining that
people always do that, but again remember MicroPython.

The allocation of a new method wrapper on every method call is of course
something that the interpreter could also be optimized to not do. The
Emacs Lisp interpreter does something like that for function args, IIRC.
They are passed on a permanent stack instead of in temporary cons cells.

Erlang on a midsized server can run millions of lightweight processes in
its VM, each with its own GC. The minimum ram size of an Erlang process
is around 2KB iirc. But I don't know if they get bigger than that
before the GC runs.

Lawrence D'Oliveiro

2024-03-20 03:14:49 UTC

Permalink

So even a very simple, seemingly well-behaved Python script, if running
for long enough, would consume more and more memory if it were not for
reference-counting.

That is completely false. It's usual to set a GC to [fix it so it’s not
false] ...

In other words, it’s not “completely” false if you have to do something to
make it false. But that GC process creates its own overhead, not to
mention the latency when there isn’t quite enough memory for an allocation
and you have to wait until the next GC run to proceed. Run the GC a
thousand times a second, and the latency is still 1 millisecond.

With reference counting, most objects are immediately freed as soon as
they are discarded--no need to wait for the next GC run.

Greg Ewing

2024-03-20 07:29:30 UTC

Permalink

Post by Lawrence D'Oliveiro
not to
mention the latency when there isn’t quite enough memory for an allocation
and you have to wait until the next GC run to proceed. Run the GC a
thousand times a second, and the latency is still 1 millisecond.

That's not the way it usually works. If you run out of memory, you
run a GC there and then. You don't have to wait for GCs to occur on
a time schedule.

Also, as a previous poster pointed out, GCs are typically scheduled
by number of allocations, not by time.

--
Greg

Chris Angelico

2024-03-20 07:42:21 UTC

Permalink

On Wed, 20 Mar 2024 at 18:31, Greg Ewing via Python-list

Post by Greg Ewing

That's not the way it usually works. If you run out of memory, you
run a GC there and then. You don't have to wait for GCs to occur on
a time schedule.
Also, as a previous poster pointed out, GCs are typically scheduled
by number of allocations, not by time.

FYI you're violating someone's request by responding to them in a way
that results in it getting onto python-list, so it's probably safest
to just ignore cranks and trolls and let them stew in their own
juices.

But normally the GC doesn't need to be scheduled at all. In CPython,
the only reason to "run garbage collection" is to detect cycles, so
you would have to be generating inordinate amounts of cyclic garbage
for this to matter at all.

ChrisA

Paul Rubin

2024-03-20 08:37:50 UTC

Permalink

Post by Chris Angelico
FYI you're violating someone's request by responding to them in a way
that results in it getting onto python-list,

I don't know if I'm doing that, but if yes, it's not on purpose. I'm
responding on Usenet.

Post by Chris Angelico
In CPython, the only reason to "run garbage collection" is to detect
cycles

That's the current version. We're discussing the possibility of
switching from refcounting to a GC system as part of the GIL removal
project.

Paul Rubin

2024-03-20 08:35:00 UTC

Permalink

Post by Lawrence D'Oliveiro
In other words, it’s not “completely” false if you have to do something to
make it false.

No you don't have to do anything, it is false by default. In Java the
default GC interval is relatively small, and to make it bigger you use
the -XMx option iirc.

Remember we are talking about GIL removal to get speedups on multicore
processors. I expect any such processors these days

Post by Lawrence D'Oliveiro
But that GC process creates its own overhead

Refcounting has overhead too!

Post by Lawrence D'Oliveiro
not to mention the latency when there isn’t quite enough memory for an
allocation and you have to wait until the next GC run to proceed.

If you are GC'ing every N allocations and you run out of free space
before you've done those N, you increase the region size by asking for
more memory from the system. If the system is out of memory, it is
out of memory and you need a bigger computer or some other change.
You don't "wait for the next GC run" as if it were a periodic daemon.

Post by Lawrence D'Oliveiro
With reference counting, most objects are immediately freed as soon as
they are discarded--no need to wait for the next GC run.

In other words you effectively GC every time an object is freed instead
of having a tuneable parameter that you can optimize. And of course you
don't get freedom from pauses either. If you allocate a million element
list in Python, then drop the last reference to the list, you have to
decrement the refcounts of each of the million elements, however long
that takes. Plus if that decrements most or all of them to zero, you
have to free them one by one. With a copying-style GC, you never have
to visit those elements or free them individually.

Look, widely used GC'd languages include Java, SBCL and other comparable
Lisp systems, GHC, OCaml, Erlang, .NET (C# and F#), Golang, current
incarnations of Javascript, and others. All of them beat the pants off
of CPython in performance. If you're claiming CPython's refcounting
system somehow outperforms the above mentioned GC's, I'd be interested
in seeing some benchmarks. There may be some trade-offs in CPython that
make its refcount system still advantageous for some things, but
performance is unlikely to be one.