summaryrefslogtreecommitdiff
path: root/doc/contributing/concurrency_guide.md
blob: 1fb58f7203ad8fccb3d504719d2994534716aff7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# Concurrency Guide

This is a guide to thinking about concurrency in the cruby source code, whether that's contributing to Ruby
by writing C or by contributing to one of the JITs. This does not touch on native extensions, only the core
language. It will go over:

* What needs synchronizing?
* How to use the VM lock, and what you can and can't do when you've acquired this lock.
* What you can and can't do when you've acquired other native locks.
* The difference between the VM lock and the GVL.
* What a VM barrier is and when to use it.
* The lock ordering of some important locks.
* How ruby interrupt handling works.
* The timer thread and what it's responsible for.

## What needs synchronizing?

Before ractors, only one ruby thread could run at once. That didn't mean you could forget about concurrency issues, though. The timer thread
is a native thread that interacts with other ruby threads and changes some VM internals, so if these changes can be done in parallel by both the timer
thread and a ruby thread, they need to be synchronized.

When you add ractors to the mix, it gets more complicated. However, ractors allow you to forget about synchronization for non-shareable objects because
they aren't used across ractors. Only one ruby thread can touch the object at once. For shareable objects, they are deeply frozen so there isn't any
mutation on the objects themselves. However, something like reading/writing constants across ractors does need to be synchronized. In this case, ruby threads need to
see a consistent view of the VM. If publishing the update takes 2 steps or even two separate instructions, like in this case, synchronization is required.

Most synchronization is to protect VM internals. These internals include structures for the thread scheduler on each ractor, the global ractor scheduler, the
coordination between ruby threads and ractors, global tables (for `fstrings`, encodings, symbols and global vars), etc. Anything that can be mutated by a ractor
that can also be read or mutated by another ractor at the same time requires proper synchronization.

## The VM Lock

There's only one VM lock and it is for critical sections that can only be entered by one ractor at a time.
Without ractors, the VM lock is useless. It does not stop all ractors from running, as ractors can run
without trying to acquire this lock. If you're updating global (shared) data between ractors and aren't using
atomics, you need to use a lock and this is a convenient one to use. Unlike other locks, you can allocate ruby-managed
memory with it held. When you take the VM lock, there are things you can and can't do during your critical section:

You can (as long as no other locks are also held before the VM lock):

* Create ruby objects, call `ruby_xmalloc`, etc.

You can't:

* Context switch to another ruby thread or ractor. This is important, as many things can cause ruby-level context switches including:

    * Calling any ruby method through, for example, `rb_funcall`. If you execute ruby code, a context switch could happen.
    This also applies to ruby methods defined in C, as they can be redefined in Ruby. Things that call ruby methods such as
    `rb_obj_respond_to` are also disallowed.

    * Calling `rb_raise`. This will call `initialize` on the new exception object. With the VM lock
      held, nothing you call should be able to raise an exception. `NoMemoryError` is allowed, however.

    * Calling `rb_nogvl` or a ruby-level mechanism that can context switch like `rb_mutex_lock`.

    * Enter any blocking operation managed by ruby. This will context switch to another ruby thread using `rb_nogvl` or
    something equivalent. A blocking operation is one that blocks the thread's progress, such as `sleep` or `IO#read`.

Internally, the VM lock is the `vm->ractor.sync.lock`.

You need to be on a ruby thread to take the VM lock. You also can't take it inside any functions that could be called during sweeping, as MMTK sweeps
on another thread and you need a valid `ec` to grab the lock. For this same reason (among others), you can't take it from the timer thread either.

## Other Locks

All native locks that aren't the VM lock share a more strict set of rules for what's allowed during the critical section. By native locks, we mean
anything that uses `rb_native_mutex_lock`. Some important locks include the `interrupt_lock`, the ractor scheduling lock (protects global scheduling data structures),
the thread scheduling lock (local to each ractor, protects per-ractor scheduling data structures) and the ractor lock (local to each ractor, protects ractor data structures).

When you acquire one of these locks,

You can:

* Allocate memory though non-ruby allocation such as raw `malloc` or the standard library. But be careful, some functions like `strdup` use
ruby allocation through the use of macros!

* Use `ccan` lists, as they don't allocate.

* Do the usual things like set variables or struct fields, manipulate linked lists, signal condition variables etc.

You can't:

* Allocate ruby-managed memory. This includes creating ruby objects or using `ruby_xmalloc` or `st_insert`. The reason this
is disallowed is if that allocation causes a GC, then all other ruby threads must join a VM barrier as soon as possible
(when they next check interrupts or acquire the VM lock). This is so that no other ractors are running during GC. If a ruby thread
is waiting (blocked) on this same native lock, it can't join the barrier and a deadlock occurs because the barrier will never finish.

* Raise exceptions. You also can't use `EC_JUMP_TAG` if it jumps out of the critical section.

* Context switch. See the `VM Lock` section for more info.

## Difference Between VM Lock and GVL

The VM Lock is a particular lock in the source code. There is only one VM Lock. The GVL, on the other hand, is more of a combination of locks.
It is "acquired" when a ruby thread is about to run or is running. Since many ruby threads can run at the same time if they're in different ractors,
there are many GVLs (1 per `SNT` + 1 for the main ractor). It can no longer be thought of as a "Global VM Lock" like it once was before ractors.

## VM Barriers

Sometimes, taking the VM Lock isn't enough and you need a guarantee that all ractors have stopped. This happens when running `GC`, for instance.
To get a barrier, you take the VM Lock and call `rb_vm_barrier()`. For the duration that the VM lock is held, no other ractors will be running. It's not used
often as taking a barrier slows ractor performance down considerably, but it's useful to know about and is sometimes the only solution.

## Lock Orderings

It's a good idea to not hold more than 2 locks at once on the same thread. Locking multiple locks can introduce deadlocks, so do it with care. When locking
multiple locks at once, follow an ordering that is consistent across the program, otherwise you can introduce deadlocks. Here are the orderings of some important locks:

* VM lock before ractor_sched_lock
* thread_sched_lock before ractor_sched_lock
* interrupt_lock before timer_th.waiting_lock
* timer_th.waiting_lock before ractor_sched_lock

These orderings are subject to change, so check the source if you're not sure. On top of this:

* During each `ubf` (unblock) function, the VM lock can be taken around it in some circumstances. This happens during VM shutdown, for example.
See the "Interrupt Handling" section for more details.

## Ruby Interrupt Handling

When the VM runs ruby code, ruby's threads intermittently check ruby-level interrupts. These software interrupts
are for various things in ruby and they can be set by other ruby threads or the timer thread.

* Ruby threads check when they should give up their timeslice. The native thread switches to another ruby thread when their time is up.
* The timer thread sends a "trap" interrupt to the main thread if any ruby-level signal handlers are pending.
* Ruby threads can have other ruby threads run tasks for them by sending them an interrupt. For instance, ractors send
the main thread an interrupt when they need to `require` a file so that it's done on the main thread. They wait for the
main thread's result.
* During VM shutdown, a "terminate" interrupt is sent to all ractor main threads top stop them asap.
* When calling `Thread#raise`, the caller sends an interrupt to that thread telling it which exception to raise.
* Unlocking a mutex sends the next waiter (if any) an interrupt telling it to grab the lock.
* Signalling or broadcasting on a condition variable tells the waiter(s) to wake up.

This isn't a complete list.

When sending an interrupt to a ruby thread, the ruby thread can be blocked. For example, it could be in the middle of a `TCPSocket#read` call. If so,
the receiving thread's `ubf` (unblock function) gets called from the thread (ruby thread or timer thread) that sent the interrupt.
Each ruby thread has a `ubf` that is set when it enters a blocking operation and is unset after returning from it. By default, this `ubf` function sends a
`SIGVTALRM` to the receiving thread to try to unblock it from the kernel so it can check its interrupts. There are other `ubfs` that
aren't associated with a syscall, such as when calling `Ractor#join` or `sleep`. All `ubfs` are called with the `interrupt_lock` held,
so take that into account when using locks inside `ubfs`.

Remember, `ubfs` can be called from the timer thread so you cannot assume an `ec` inside them. The `ec` (execution context) is only set on ruby threads.

## The Timer Thread

The timer thread has a few functions. They are:

* Send interrupts to ruby threads that have run for their whole timeslice.
* Wake up M:N ruby threads (threads in non-main ractors) blocked on IO or after a specified timeout. This
uses `kqueue` or `epoll`, depending on the OS, to receive IO events on behalf of the threads.
* Continue calling  the `SIGVTARLM` signal if a thread is still blocked on a syscall after the first `ubf` call.
* Signal native threads (`SNT`) waiting on a ractor if there are ractors waiting in the global run queue.
* Create more `SNT`s if some are blocked, like on IO or on `Ractor#join`.