Skip to content
Connor Abbott edited this page Jul 31, 2017 · 20 revisions

Introduction

This page aims to describes, in a hopefully understandable way, what I’ve discovered so far about the instruction set for ARM’s Bifrost architecture. In addition to the technical details, I’ll try and give a higher-level view of what’s going on; that is, why I think ARM did something, and the design tradeoffs involved, rather than just the instruction encoding and what it does. Some of this can be ascertained from patents, articles, etc., but some of it will necessarily be guesswork.

Public information from ARM

ARM has publicly explained various parts of the ISA in various places. Obviously, any public information from ARM is incredibly useful in the reverse-engineering process. The two main sources of information are articles/presentations and patents. Articles and presentations are aimed at a general audience, and so they may not go into the detail required to actually understand the ISA, but they are good at giving a broad overview of ARM’s intentions in designing the microarchitecture. Patents tend to be more specific, since ARM needs to describe the technical details more precisely for their patent to be accepted, but they’re also full of legalese that can be confusing and tedious at times.

Reading patents

Note: the usual IANAL disclaimer applies here. The information here is only meant to be used for reverse engineering, I can’t vouch for its accuracy, don’t use it for anything legally significant without consulting a lawyer first, etc. You’ve been warned!

Usually, for our purpose, it’s better to skip the claims section and focus on the "embodiments" section. The claims section is important legally, since it governs when ARM can sue someone for patent infringement (in some sense, it’s the "meat" of the patent), but we aren’t concerned about that. The embodiments section, on the other hand, is supposed to provide the patent examiner and a potential judge/jury context about the invention it’s disclosing by describing how ARM is actually using it in its own products. An "embodiment" here is just a way that the invention could actually be used in an actual, shipping product. Technically, how the patent is actually used in practice is irrelevant from a legal point of view, so you’ll see language like "in one possible embodiment, …​", but it’s usually obvious which "possible embodiment" is the one that ARM is actually using — the patent will go into detail on only one of the possible embodiments, or explicitly talk about a "preferred embodiment" which is usually the one they’re actually using.

  • Anandtech article, gives a general overview

  • UK Patent Application GB2540970A, on the Bifrost clause architecture

  • US Patent Application US 2016-0364209 A1, describing a way to implement more efficient argument reduction for logarithms using a lookup table. It seems that the Itanium math libraries used the exact same technique, described in a paper here, specifically in the "novel reduction" section. Yes, this means the patent application is probably bogus. No, I’m not a lawyer.

Overview

Bifrost is a clause-oriented architecture. Instructions are grouped into clauses that consist of instructions that can be executed back-to-back without any scheduling decisions. A clause consists of a clause header with information required for scheduling followed by a series of instructions and constants (immediates) referenced by the instructions. Instead of choosing instructions to run, the scheduler chooses clauses to run. As we’ll see, instruction decoding is also per-clause instead of per-instruction. Instructions and immediates are packed into a clause and unpacked at run-time. That way, instructions can be packed to save space, and the unpacking cost is amortized over the entire clause.

Internally, each instruction consists of a register read/write stage, an FMA pipeline stage, and an ADD pipeline stage. The register stage is decoupled in the ISA from the functional units, which simplifies the decode hardware. The register stage bits describe which registers to read/write with various ports, while the FMA and ADD stages have only 3 bits for each source which describe which port to read for each source. In addition, the result of both stages from the previous cycle appears as another "port", allowing instructions to bypass the register file to save power and reduce register pressure, avoiding spilling. In addition, the register stage can read a given constant embedded in the clause or a uniform register, which is another "port." Finally, there’s one last port, which is always 0 in the FMA unit but is the result of the FMA unit from the same instruction in the ADD unit. For example, an unfused multiply-add is done by doing a multiply in the FMA unit, using the 0 port for the sum, and then an addition in the ADD unit with the same port (here representing the result of the multiply) in the ADD unit.

Note that in addition to the execution unit, which we’ve been describing, there are a few other units:

  • Varying interpolation unit

  • Attribute fetching unit

  • Load/store unit

  • Texture unit

The execution unit interacts with these fixed-function blocks through special, variable-latency instructions in the FMA and ADD units. They bypass the usual, fixed-latency mechanism for reading/writing registers, and as such instructions in the same clause can’t assume that registers have been read/written after the instruction is done (TODO: verify this for registers being read, and not just written). Instead, any dependent instructions must be put into a separate clause with the appropriate dependencies in the clause header set.

Clauses

Conceptually, each clause consists of a clause header, followed by one or more 78-bit instruction words and then zero or more 60-bit constants. (Constants are actually 64 bits, but they’re loaded the same port as uniform registers and share the same field in the instruction word, which includes 7 bits to choose which uniform register to load, some of which would be unused for constants, so ARM decided to be clever and stick the bottom 4 bits in each instruction where the constant is loaded, so the actual constants in the instruction stream are only 60 bits). But the instruction fetching hardware only works in 128-bit quadwords, so each clause has to be a multiple of 128 bits. To make the representation of the clauses as compact as possible which still making the decoding circuitry relatively simple, the instructions are packed so that two 128-bit quadwords can store 3 78-bit instructions, or 3 128-bit quadwords can store 4 instructions and a 60-bit constant. There were some bits left over, which seem to have been used to make the decoding circuitry a little simpler by somewhat obviating the need to keep track of state between each word.

Quadword formats

The bottom 8 bits of each 128-bit quadword are a "tag" that’s read by the decoding circuitry. They describe how to interpret the rest of the word, as well as possibly containing some bits from some of the instructions to decode if there were more bits to spare. The format of the tag is described below.

C

S

fmt1

fmt2

  

  

  

  

  

  

  

  

In most cases, except as noted, the next 75 bits are the low 75 bits specify the low 75 bits of an instruction word. Where the high 3 bits come from, and where the instruction word is placed in the stream, varies though. The C and fmt1 bits let one select from various formats for the rest of the quadword, described below. The S bit usually means that the clause is ending, except for the exceptions described below.

Format when fmt1 == 5

This format is used for the first quadword in a multiple-quadword clause. The next 75 bits are the low 75 bits of the instruction 0 in the stream, while the high 3 bits of instruction 0 are stored in fmt2. The highest 45 bits in the quadword store the clause header.

Format when fmt1 == 1

This is similar to the format above, except that it’s used when the clause only consists of only one quadword. The S bit must be set to indicate that the clause stops after this quadword. It’s not yet clear why this format is separate from the previous one.

Format when fmt1 == 4

When this format is set, the next 75 bits, plus fmt2, specify either instruction 1 if S is not set, or instruction 4 if S is set. The S bit is reused here, since this quadword can never end a clause. The next 45 bits describe the low 45 bits of the next instruction in the stream (either instruction 2 or 5). Normally, the next quadword has fmt1 == 0, so that the two quadwords together contain instructions 1 through 3 or 4 through 6.

Format when fmt1 == 0

As usual, the next 75 bits and fmt2 together describe an instruction. If a quadword with this format follows one with fmt1 == 4, then this instruction is the next instruction after the partially-filled-in one created by the previous quadword. The next 30 bits after that, and the high 3 bits of the quadword, give the rest of the 78-bit instruction that was partially specified by the previous quadword. This leaves 12 bits, which seem to be additional "tag" bits. I’ve only seen the second-highest and third-highest bits set. I’m not entirely sure what they do, but if they’re both 0, then the 75 bits seems to be interpreted as a constant instead of an instruction, with the high 15 bits ignored (set to 0) and fmt2 playing the same role as when fmt1 == 6/7.

If that’s not the case, i.e. this follows a quadword with a different format from 0, then it just emits a single instruction following whatever instruction was last completed. This is used for the final instruction if there are 2 or 5 instructions in a clause.

Note that it’s possible for the clause to end after a quadword with this format, in which case S is set to 1.

Format when C == 1

when C is set to 1, fmt1 has a different meaning. The quadword when C == 1 is similar to when fmt1 == 0, except that the high 3 bits of the regular instruction comes from fmt1 instead of fmt2, and fmt2 stores the high 3 bits of the partial instruction which started with the last quadword. The extra 12 tag bits are also omitted, meaning that the high 15 bits of the quadword are freed up. They’re used to store the low 15 bits of a constant. Where the constant is emitted is based on the S bit (this bit is reused because this quadword can never end a clause — the rest of the constant hasn’t been specified yet). Either this is the third quadword in a clause, in which case S must be 0, or its the 5th, in which case S must be 1. Presumably, the S bit controls where the constant goes, except that instead of any possible location as is usual (see the description of fmt1 == 6/7), there are only two hardcoded locations based on the usual positions of this type of quadword in the instruction stream, since they were running low on available bits.

Format when fmt1 == 2 or 3

This format always comes after a quadword where C == 1. As usual, the next 75 bits plus fmt2 form an instruction after the last created one. The remaining 45 bits are used to finish the 60-bit constant started with the previous instruction. The low bit of fmt1 must match S from the previous instruction, so that we know where to insert the constant.

Format when fmt1 == 6 or 7

This format creates two 60-bit literals out of the remaining 120 bits. fmt2, plus the low bit of fmt1, seems to indicate where the constant placed in the instruction stream. It increases when this instruction comes later in the clause, after more instructions have been emitted. Presumably, there is a shared buffer for decoded instructions and constants, and this tells the decoder where to put the constant. This is usually implied, or partially implied, for the other instruction types, but fully spelled out here since there are bits to spare — this probably simplifies the decoder a little, without bloating code size.

Algorithm for packing clauses

This section describes how the blob compiler seems to use these formats to pack as many instructions and constants into as few words as possible. There may be other equivalent ways to do it, but given how complicated all the different formats are, and the fact that it’s not known how much information on what to do is computed from the current quadword vs. how much is deduced from the previous quadwords, it’s probably best to stick with what the blob does.

First, we assign instructions to quadwords. We may assign an entire instruction to a quadword, or we may split an instruction across two quadwords, in which case it is represented using fmt1 == 4 and fmt1 == 0 (or C == 1) quadwords.

  • If only one instruction, assign it to the first quadword.

  • Assign the second instruction to the second quadword.

  • Split the third instruction across the second and third quadwords.

  • Assign the fifth instruction to the third quadword.

  • Assign the sixth instruction to the fourth quadword.

  • Split the seventh instruction acrosss the fourth and fifth quadword.

  • Assign the eighth instruction across the fifth quadword.

Simply go down the list until there are no more instructions left.

Now, we assign constants to quadwords. We do this by looking at the last quadword, and do the following:

  • If it only contains an instruction that was split across two quadwords, we must have fmt1 == 0. Put the constant where the next instruction would have gone, and set the 12 high tag bits to 0 to indicate we have a constant.

  • If it only contains one instruction, and the previous quadword has an instruction and a split instruction, then we can change the second-to-last quadword to have C == 1 and the last to fmt1 == 2/3 and spread the constant across those two instructions.

For any remaining constants, we simply add quadwords with fmt1 == 6/7. Note that if there were no constants, and the last quadword only has a split instruction, we need to add a "dummy" constant to keep the processor from executing anything during the empty slot where an instruction could’ve gone.

Clause Header

TODO figure this out

  • Program stop

  • 32 vs. 64 bit

  • whether a variable-latency instruction exists

  • dependencies

  • "back-to-back" mode from patent?

Instructions

Now that we know how instructions and constants are packed into clauses, let’s take a look at the instructions themselves. Each instruction is executed 3 stages, the register read/write stage, the FMA stage, and the ADD stage, each of which corresponds to a part of the 78 bit instruction word. We’ll describe what they do, and the format of their part of the instruction word, in the next sections. We’ll go by the order they’re executed, as well as the order of the bits in the instruction word.

Register read/write (35 bits)

As the name suggests, this stage reads from and writes to the register file. The current instruction reads from the register file at the same time that the previous instruction writes to the register file. Thus, the field contains both reads from the current instruction and writes from the previous instruction. Presumably, the scheduler makes this happen by interleaving the execution of multiple clauses from different quads. It only executes one instruction from a given quad every 3 cycles, so that the register write phase of one instruction happens at the same time as the register read phase of the next. Of course, it’s possible that the FMA and ADD stages take more than 1 cycle, and more threads are interleaved as a consequence; this is a microarchitectural decision that’s not visible to us. The result is that a write to a register that’s immediately read by the next instruction won’t work, but that’s never necessary anyways thanks to the passthrough sources detailed later.

The register file has four ports, two read ports, a read/write port, and a write port. Thus, up to 3 registers can be read during an instruction. These ports are represented directly in the instruction word, with a field for telling each port the register address to use. There are three outputs, corresponding to the three read ports, and two inputs, corresponding to the FMA and ADD results from the previous stage. The ports are controlled through what the ARM patent calls the "register access descriptor," which is a 4-bit entry that says what each of the ports should do. Finally, there is the uniform/const port, which is responsible for loading uniform registers and constants embedded in the clause. Note that the uniforms and constants share the same port, which means that only one uniform or one constant (but not both) can be loaded for an instruction. This port supplies 64 bits of data, though, so two 32-bit parts of the same 64-bit value can be accessed in the same instruction.

The format of the register part of the instruction word is as follows:

Field Bits

Uniform/const

8

Port 2 (read/write)

6

Port 3 (write)

6

Port 0 (read)

5

Port 1 (read)

6

Control

4

Control is what ARM calls the "register access descriptor." To save bits, if the Control field is 0, then Port 1 is disabled, and the field for Port 1 instead contains the "real" Control field in the upper 4 bits. Bit 1 is set to 1 if Port 0 is disabled, and bit 0 is unknown (always 0?). If the Control field isn’t 0, then both Port 0 and Port 1 are always enabled. In this way, the Control field only needs to describe how Port 2 and Port 3 are configured, except for the magic 0 value, reducing the number of bits required.

Before we get to the actual format of the Control field, though, we need to describe one more subtlety. Each instruction’s register field contains the writes for the previous instruction, but what about the writes of the last instruction in the clause? Clauses should be entirely self-contained, so we can’t look at the first instruction in the next clause. The answer turns out to be that the first instruction in the clause contains the writes for the last instruction. There are a few extra values for the control field, marked "first instruction," which are only used for the first instruction of a clause. The reads are processed normally, but the writes are delayed until the very end of the clause, after the last instruction. The list of values for the control field is below:

Value Meaning

1

Write FMA with Port 2

3

Write FMA with Port 2, read with Port 3

4

read with Port 3

5

Write ADD with Port 2

6

Write ADD with Port 2, read with Port 3

8

Nothing, first instruction

9

Write FMA, first instruction

11

Nothing

12

read with Port 3, first instruction

15

Write FMA with Port 2, write ADD with Port 3

TODO: this isn’t complete, missing a few "first instruction" ones.

The uniform/const bit describes the uniform/const port. If the high bit is set, then the low 7 bits describe which pair of 32-bit uniform registers to load. For example, 10000001 would load from uniform registers 2 and 3. If the high bit isn’t set, then the next-highest 3 bits indicate what 64-bit constant to load, while the low 4 bits contain the low 4 bits of the constant. The mapping from from bits to constants is a little strange:

Field value Constant loaded

4

const0/const1

5

const2/const3

6

const4/const5

7

const6/const7

2

const8/const9

0

unused?

1

unused?

Source fields

When the FMA and ADD stages want to use the result of the register stage, they do so through a 3-bit source field in the instruction word. There are as many source fields are there are sources for each operation. The following table shows the meaning of this field:

Field value Meaning

0

Port 0

1

Port 1

2

Port 3

3

FMA: always 0

ADD: result of FMA unit from same instruction

4

Low 32 bits of uniform/const

5

High 32 bits of uniform/const

6

Result of FMA from previous instruction (FMA passthrough)

7

Result of ADD from previous instruction (ADD passthrough)

FMA (23 bits)

Both the FMA and ADD units have various instruction formats. The high bits are always an opcode, of varying length. They must have the property that no opcode from one format is a prefix for another opcode in a different format. This guarantees that no instruction is ambiguous. Since there’s no format tag, it would seem that decoding which format each instruction has is complicated, although it’s possible that some trick is used to speed it up. In the disassmbler, we just try matching each opcode with the actual one, masking off irrelevant bits. I’m only going to list the categories here, and not the actual opcodes; I’ll leave the former to the disassembler source code, simply because there are a lot of them and it’s tedious to type them all up, and error-prone too. The disassembler is at least easy to test, so the chances of making a mistake are lower.

One Source (FMAOneSrc)

Field Bits Src0 3 Opcode 20

Two Source (FMATwoSrc)

Field Bits Src0 3 Src1 3 Opcode 17

Floating-point Comparisons (FMAFcmp)

Field Bits Src0 3 Src1 3 Src1 absolute value 1 unknown 1 Src1 negate 1 unknown 3 Src0 absolute value 1 Comparison op 3 Opcode 7

Where the comparison ops are given by:

Value Meaning 0 Ordered Equal 1 Ordered Greater Than 2 Ordered Greater Than or Equal 3 Unordered Not-Equal 4 Ordered Less Than 5 Ordered Less Than or Equal

Two Source with Floating-point Modifiers (FMATwoSrcFmod)

Field Bits Src0 3 Src1 3 unknown 1 Src0 negate 1 Src1 negate 1 unknown 6 Outmod 2 Opcode 6

The output modifier (Outmod) is given by:

Value Meaning 0 Nothing 1 max(output, 0) 2 unknown (used in sin/cos) 3 saturate

Three Source (FMAThreeSrc)

Field Bits Src0 3 Src1 3 Src2 3 Opcode 14

Three Source with Floating Point Modifiers (FMAThreeSrcFmod)

Field Bits Src0 3 Src1 3 Src2 3 unknown 6 Outmod 2 Src0 negate 1 Src1 negate 1 Opcode 4

Four Source (FMAFourSrc)

Field Bits Src0 3 Src1 3 Src2 3 Src3 3 Opcode 11

ADD (20 bits)

The instruction formats for ADD are similar, except it can only have up to 2 sources instead of FMA’s four sources.

One Source (ADDOneSrc)

Field Bits Src0 3 Opcode 17

Two Source (ADDTwoSrc)

Field Bits Src0 3 Src1 3 Opcode 14

Two Source with Floating-point Modifiers (ADDTwoSrcFmod)

Field Bits Src0 3 Src1 3 unknown 1 Src0 negate 1 Src1 negate 1 unknown 2 Outmod 2 Opcode 7

Floating-point Comparisons (ADDFcmp)

Field Bits Src0 3 Src1 3 Comparison op 3 unknown 2 Src0 absolute value 1 Src1 absolute value 1 Src0 negate 1 Opcode 6

Clone this wiki locally