Dalvik bytecode的总体设计

最新推荐文章于 2022-07-11 07:35:00 发布

原创最新推荐文章于 2022-07-11 07:35:00 发布 · 583 阅读

0 ·

CC 4.0 BY-SA版权

android 专栏收录该内容

10 篇文章

订阅专栏

Dex字节码模仿了真实指令架构和C语言调用约定，采用基于寄存器的机器模型，帧大小固定。每个帧包含一定数量的寄存器和附加数据，如程序计数器和文件引用。参数按顺序放在最后的寄存器中，宽字节参数占两个寄存器。指令以16位无符号大小表示，部分指令有类型通用性，部分针对特定类型。指令格式和常量池有助于优化和提高执行效率。

Dakvik bytecode总体设计

机器模型和调用约定大致是模仿常见的真实指令体系结构和C语言风格的调用约定：
- 执行指令的机器是 基于寄存器 的，并且，帧的大小在 创建时 就已经固定。每个帧由特定数量的寄存器（由方法指定）以及用于执行方法的附加数据组成，例如，程序计数器PC和指向包含该方法的 dex 文件的引用。
- 当寄存器用于处理数字常量时（例如int，float等），寄存器的位宽为32位。相邻的一对寄存器用于64位值的处理。一对的寄存器没有对齐的要求。
- 当寄存器用于处理引用数据时，寄存器位宽足够容纳一个引用（即寄存器宽度=指针长度）。
- 从bit位角度看，(Object) null == (int) 0 成立。
- 方法的N个参数会依次落在方法调用栈帧的最后N个寄存器上。宽字节 参数占用 两个寄存器。实例方法的第一个参数为 this 引用。
指令中的存储单元是16位无符号大小的。在一些指令中的某些bit位会被忽略或者为0。
指令不会被无条件限制成某种特定类型。例如，对于在没有解释说明下，对32位的寄存器执行 move 指令是不能确定究竟是移动int还是float。
对于字符串，类型，字段和方法的引用，有单独的枚举和索引常量池。
bit字面值在指令流中是以内联形式表示。
因为在实际应用中，一个方法使用超过16个寄存器并不常见，并且因为需要8个寄存器的方法比较普遍，所以很多指令仅仅限于访问前16个寄存器。在合理可能的情况下，指令也允许有引用最多256个寄存器。除此之外，一些指令变体具有允许更大数量的寄存器，例如，可以在 v0 – v65535 的寄存器数量范围内寻址的一对catch-all move 指令。在某些情况下，当一些指令变体不能寻址到需要寻址的寄存器时，那么就需要在操作进行前，把寄存器的值移动到更低位的寄存器以便指令能够访问，在完成操作后，把结果从低位寄存器移动到高位寄存器中。
有几个“伪指令”用于保存可变长度的数据载荷，而这些数据载荷实际是由常规指令引用的（例如，fill-array-data 指令，可参考：https://stackoverflow.com/questions/19721477/when-will-the-instruction-filled-new-array-appear ）。在正常执行流程中，一般是不会见到这种指令。此外，指令必须位于偶数字节码偏移的，也就是4直接对齐。为了满足这个要求，如果这些指令导致不对齐，dex生成工具会插入nop 指令以填补空缺。最后，尽管这不是必需的，但是还是希望大部分工具在方法的末尾选择插入这些指令，不然就要使用额外的指令来分支它们。（branch them不太明白什么意思）
当被安装到运行中的系统时，因为安装时的静态链接优化，某些指令可能会被修改，或者改变它们的格式。这些优化操作主要是为了能够执行效率更快。可以参考 instruction formats document 看这些建议的变体。这些建议，并不需强制性实现的。
可读性和助记符：
- 参数按 Dest-then-source 排序。
- 某些操作码具有明确的 名称后缀 ，以指示它们当前操作数的类型：
  - 通用类型的32位操作码没有标记。
  - 通用类型的64位操作码带 -wide 后缀。
  - 特定类型的操作码带对应类型的后缀，包括 -boolean -byte -char -short -int -long -float -double -object -string -class -void 。
- 某些操作码具有消除歧义的 操作码后缀，以区分具有不同指令布局或选项的其他相同操作码。这些后缀与主名称之间用斜杠（ “/” ）分隔，这使得在生成或者解释可执行文件中，每条指令是以独一无二的固定格式存在的（减少歧义是对于人类而言）。
- 在下面的描述中，通过使用 4位宽度 的字母来表示数值（用于指示常量或者寻址的寄存器）。
- 例如，在指令move-wide/from16 vAA, vBBBB 中：
  - move 是基本操作码，表示基本操作码。
  - wide 是名称后缀，表示它处理64位宽的数据。
  - from16 是操作码后缀，表示它是一条指令变体（variant），源寄存器的范围是16位范围内（注意不是寄存器本身是16位，和寄存器宽没有关系，而是第几个的意思）。
  - vAA 是目的寄存器，它的范围是v0 – v255 （一个A是4位，所以vAA 为8位，8位的数值范围是0-255）。
  - vBBBB 是源寄存器，它的范围是 v0 – v65535 。
参考 the instruction formats document ，更深入了解指令的格式。
参考 .dex file format document ，了解字节码指令和dex文件的关联。

官方英文原文

The machine model and calling conventions are meant to approximately imitate common real architectures and C-style calling conventions:
The machine is register-based, and frames are fixed in size upon creation. Each frame consists of a particular number of registers (specified by the method) as well as any adjunct data needed to execute the method, such as (but not limited to) the program counter and a reference to the .dex file that contains the method.
When used for bit values (such as integers and floating point numbers), registers are considered 32 bits wide. Adjacent register pairs are used for 64-bit values. There is no alignment requirement for register pairs.
When used for object references, registers are considered wide enough to hold exactly one such reference.
In terms of bitwise representation, (Object) null == (int) 0.
The N arguments to a method land in the last N registers of the method’s invocation frame, in order. Wide arguments consume two registers. Instance methods are passed a this reference as their first argument.
The storage unit in the instruction stream is a 16-bit unsigned quantity. Some bits in some instructions are ignored / must-be-zero.
Instructions aren’t gratuitously limited to a particular type. For example, instructions that move 32-bit register values without interpretation don’t have to specify whether they are moving ints or floats.
There are separately enumerated and indexed constant pools for references to strings, types, fields, and methods.
Bitwise literal data is represented in-line in the instruction stream.
Because, in practice, it is uncommon for a method to need more than 16 registers, and because needing more than eight registers is reasonably common, many instructions are limited to only addressing the first 16 registers. When reasonably possible, instructions allow references to up to the first 256 registers. In addition, some instructions have variants that allow for much larger register counts, including a pair of catch-all move instructions that can address registers in the range v0 – v65535. In cases where an instruction variant isn’t available to address a desired register, it is expected that the register contents get moved from the original register to a low register (before the operation) and/or moved from a low result register to a high register (after the operation).
There are several “pseudo-instructions” that are used to hold variable-length data payloads, which are referred to by regular instructions (for example, fill-array-data). Such instructions must never be encountered during the normal flow of execution. In addition, the instructions must be located on even-numbered bytecode offsets (that is, 4-byte aligned). In order to meet this requirement, dex generation tools must emit an extra nop instruction as a spacer if such an instruction would otherwise be unaligned. Finally, though not required, it is expected that most tools will choose to emit these instructions at the ends of methods, since otherwise it would likely be the case that additional instructions would be needed to branch around them.
When installed on a running system, some instructions may be altered, changing their format, as an install-time static linking optimization. This is to allow for faster execution once linkage is known. See the associated instruction formats document for the suggested variants. The word “suggested” is used advisedly; it is not mandatory to implement these.
Human-syntax and mnemonics :
Dest-then-source ordering for arguments.
Type-general 32-bit opcodes are unmarked.
Type-general 64-bit opcodes are suffixed with -wide.
Type-specific opcodes are suffixed with their type (or a straightforward abbreviation), one of: -boolean -byte -char -short -int -long -float -double -object -string -class -void.
Some opcodes have a disambiguating suffix to distinguish otherwise-identical operations that have different instruction layouts or options. These suffixes are separated from the main names with a slash ("/") and mainly exist at all to make there be a one-to-one mapping with static constants in the code that generates and interprets executables (that is, to reduce ambiguity for humans).
In the descriptions here, the width of a value (indicating, e.g., the range of a constant or the number of registers possibly addressed) is emphasized by the use of a character per four bits of width.
For example, in the instruction move-wide/from16 vAA, vBBBB:
- mov is the base opcode, indicating the base operation (move a register’s value).
- wide is the name suffix, indicating that it operates on wide (64 bit) data.
- from16 is the opcode suffix, indicating a variant that has a 16-bit register reference as a source.
- vAA is the destination register (implied by the operation; again, the rule is that destination arguments always come first), which must be in the range v0 – v255.
- vBBBB is the source register, which must be in the range v0 – v65535.
See the instruction formats document for more details about the various instruction formats (listed under “Op & Format”) as well as details about the opcode syntax.
See the .dex file format document for more details about where the bytecode fits into the bigger picture.