ruby.git/yjit/src/asm/arm64/mod.rs, branch v3_2_11

YJIT: Respect destination num_bits on STUR (#6848)

2022-12-02T00:13:38+00:00

YJIT: fix 32 and 16 bit register store (#6840)

2022-12-01T15:53:50+00:00

* Fix 32 and 16 bit register store in YJIT

Co-Authored-By: Takashi Kokubun 

* Remove an unnecessary diff

* Reuse an rm_num_bits result

* Use u16::MAX instead

* Update the link

Co-authored-by: Alan Wu 

* Just use sturh for 16 bits

Co-authored-by: Takashi Kokubun 
Co-authored-by: Alan Wu

32 bit comparison on shape id

2022-11-18T20:04:10+00:00

This commit changes the shape id comparisons to use a 32 bit comparison
rather than 64 bit.  That means we don't need to load the shape id to a
register on x86 machines.

Given the following program:

```ruby
class Foo
  def initialize
    @foo = 1
    @bar = 1
  end

  def read
    [@foo, @bar]
  end
end

foo = Foo.new
foo.read
foo.read
foo.read
foo.read
foo.read

puts RubyVM::YJIT.disasm(Foo.instance_method(:read))
```

The machine code we generated _before_ this change is like this:

```
== BLOCK 1/4, ISEQ RANGE [0,3), 65 bytes ======================
  # getinstancevariable
  0x559a18623023: mov rax, qword ptr [r13 + 0x18]
  # guard object is heap
  0x559a18623027: test al, 7
  0x559a1862302a: jne 0x559a1862502d
  0x559a18623030: cmp rax, 4
  0x559a18623034: jbe 0x559a1862502d
  # guard shape, embedded, and T_OBJECT
  0x559a1862303a: mov rcx, qword ptr [rax]
  0x559a1862303d: movabs r11, 0xffff00000000201f
  0x559a18623047: and rcx, r11
  0x559a1862304a: movabs r11, 0xb000000002001
  0x559a18623054: cmp rcx, r11
  0x559a18623057: jne 0x559a18625046
  0x559a1862305d: mov rax, qword ptr [rax + 0x18]
  0x559a18623061: mov qword ptr [rbx], rax

== BLOCK 2/4, ISEQ RANGE [3,6), 0 bytes =======================
== BLOCK 3/4, ISEQ RANGE [3,6), 47 bytes ======================
  # gen_direct_jmp: fallthrough
  # getinstancevariable
  # regenerate_branch
  # getinstancevariable
  # regenerate_branch
  0x559a18623064: mov rax, qword ptr [r13 + 0x18]
  # guard shape, embedded, and T_OBJECT
  0x559a18623068: mov rcx, qword ptr [rax]
  0x559a1862306b: movabs r11, 0xffff00000000201f
  0x559a18623075: and rcx, r11
  0x559a18623078: movabs r11, 0xb000000002001
  0x559a18623082: cmp rcx, r11
  0x559a18623085: jne 0x559a18625099
  0x559a1862308b: mov rax, qword ptr [rax + 0x20]
  0x559a1862308f: mov qword ptr [rbx + 8], rax
```

After this change, it's like this:

```
== BLOCK 1/4, ISEQ RANGE [0,3), 41 bytes ======================
  # getinstancevariable
  0x5560c986d023: mov rax, qword ptr [r13 + 0x18]
  # guard object is heap
  0x5560c986d027: test al, 7
  0x5560c986d02a: jne 0x5560c986f02d
  0x5560c986d030: cmp rax, 4
  0x5560c986d034: jbe 0x5560c986f02d
  # guard shape
  0x5560c986d03a: cmp word ptr [rax + 6], 0x19
  0x5560c986d03f: jne 0x5560c986f046
  0x5560c986d045: mov rax, qword ptr [rax + 0x10]
  0x5560c986d049: mov qword ptr [rbx], rax

== BLOCK 2/4, ISEQ RANGE [3,6), 0 bytes =======================
== BLOCK 3/4, ISEQ RANGE [3,6), 23 bytes ======================
  # gen_direct_jmp: fallthrough
  # getinstancevariable
  # regenerate_branch
  # getinstancevariable
  # regenerate_branch
  0x5560c986d04c: mov rax, qword ptr [r13 + 0x18]
  # guard shape
  0x5560c986d050: cmp word ptr [rax + 6], 0x19
  0x5560c986d055: jne 0x5560c986f099
  0x5560c986d05b: mov rax, qword ptr [rax + 0x18]
  0x5560c986d05f: mov qword ptr [rbx + 8], rax
```

The first ivar read is a bit more complex, but the second ivar read is
much simpler.  I think eventually we could teach the context about the
shape, then emit only one shape guard.

Implement LDURH on Aarch64

2022-11-15T01:04:50+00:00

When RUBY_DEBUG is enabled, shape ids are 16 bits.  I would like to do
16 bit comparisons, so I need to load halfwords sometimes.  This commit
adds LDURH so that I can load halfwords.

  https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/LDURH--Load-Register-Halfword--unscaled--?lang=en

I verified the bytes using clang:

```
$ cat asmthing.s
.global _start
.align 2

_start:
  ldurh w10, [x1]
  ldurh w10, [x1, #123]
$ as asmthing.s -o asmthing.o && objdump --disassemble asmthing.o

asmthing.o:	file format mach-o arm64

Disassembly of section __TEXT,__text:

0000000000000000 :
       0: 2a 00 40 78  	ldurh	w10, [x1]
       4: 2a b0 47 78  	ldurh	w10, [x1, #123]
```

YJIT: fix ARM64 bitmask encoding for 32 bit registers (#6503)

2022-10-06T22:41:38+00:00

For logical instructions such as AND, there is a constraint that the N
part of the bitmask immediate must be 0. We weren't respecting this
condition previously and were silently emitting undefined instructions.

Check for this condition in the assembler and tweak the backend to
correctly detect whether a number could be encoded as an immediate in a
32 bit logical instruction. Due to the nature of the immediate encoding,
the same numeric value encodes differently depending on the size of
the register the instruction works on.

We currently don't have cases where we use 32 bit immediates but we ran
into this encoding issue during development.

A bunch of clippy auto fixes for yjit (#6476)

2022-09-30T15:14:55+00:00

Change IncrCounter lowering on AArch64 (#6455)

2022-09-27T20:58:01+00:00

* Change IncrCounter lowering on AArch64

Previously we were using LDADDAL which is not available on
Graviton 1 chips. Instead, we're going to use an exclusive
load/store group through the LDAXR/STLXR instructions.

* Update yjit/src/backend/arm64/mod.rs

Co-authored-by: Maxime Chevalier-Boisvert

YJIT: Add Opnd#with_num_bits to use only 8 bits (#6359)

2022-09-14T14:27:52+00:00

* YJIT: Add Opnd#sub_opnd to use only 8 bits

* Add with_num_bits and let arm64_split use it

* Add another assertion to with_num_bits

* Use only with_num_bits

Better offsets (#6315)

2022-09-09T15:37:41+00:00

* Introduce InstructionOffset for AArch64

There are a lot of instructions on AArch64 where we take an offset
from PC in terms of the number of instructions. This is for loading
a value relative to the PC or for jumping.

We were usually accepting an A64Opnd or an i32. It can get
confusing and inconsistent though because sometimes you would
divide by 4 to get the number of instructions or multiply by 4 to
get the number of bytes.

This commit adds a struct that wraps an i32 in order to keep all of
that logic in one place. It makes it much easier to read and reason
about how these offsets are getting used.

* Use b instruction when the offset fits on AArch64

Better b.cond usage on AArch64 (#6305)

2022-08-31T19:44:26+00:00

* Better b.cond usage on AArch64

When we're lowering a conditional jump, we previously had a bit of
a complicated setup where we could emit a conditional jump to skip
over a jump that was the next instruction, and then write out the
destination and use a branch register.

Now instead we use the b.cond instruction if our offset fits (not
common, but not unused either) and if it doesn't we write out an
inverse condition to jump past loading the destination and
branching directly.

* Added an inverse fn for Condition (#443)

Prevents the need to pass two params and potentially reduces errors.

Co-authored-by: Jimmy Miller 

Co-authored-by: Maxime Chevalier-Boisvert 
Co-authored-by: Jimmy Miller