https://www.obdev.at/articles/implementing-usb-1.1-in-firmware.html
Implementing USB 1.1 in Firmware
by Christian Starkjohann
纯固件实现USB1.1(低速)协议
——Christian Starkjohann
This article describes the techniques we used to implement NRZI decoding and bitstuff decoding in realtime in our firmware-only USB driver for Atmel AVR microcontrollers.
For more information about the USB driver click here...
此文描述了我们(obdev)在实现NRZI解码和位填充解码时所用的技巧,这种技巧被用于基于AVR的纯软件USB通信方案。
The Challenge
挑战
For a USB 1.1 compatible low-speed device, a bit stream of 1.5 Mbit/s must be decoded. For a processor clocked at 12 MHz, this means that we have 8 CPU cycles for each bit. Being a RISC processor, the AVR executes most instructions in a single clock cycle. This gives us roughly 8 instructions to do the following operations on each bit:
对于一个USB1.1低速设备而言,以1.5Mbit/s的数据流需要解码。这意味着在一个12M主频的单片机上我们有8个CPU周期用来处理一个位。AVR作为一种RISC处理器大部分指令为单周期指令。这就大体上给我们8条指令用针对一个位进行如下操作:
- NRZI decoding. A "1" is encoded as no change of the data lines, a "0" as a change. NRZI decoding can therefore be done by a negative exclusive or operation between the current status and the previous one (8 cycles earlier).
- NRZI解码:一个逻辑“1”被编码成数据线上电平不变,而“0”则编码成电平变化。因此NRZI解码可以只在相反情况下进行或在当前状态下和之前状态(8周期之前的)之间操作。
- Bitstuff decoding. In order to preserve synchronization during long sequences of "1", a "0" (change of data lines) is inserted every 6 consecutive "1" bits. This "stuffed" bit must be removed during reception.
- 位填充解码:为了在一长串逻辑“1”之后维持同步,每经过6个连续的逻辑"1"后插入一个逻辑"0“(即改变数据线电平)。在接受数据期间这个填充位必须被移除。
- End of Packet recognition. The end of a packet is notified by a "SE0" condition. This means that both data lines (which are normally the inverse of each other) are at logical "0" level for two bit times.
- 包结束识别:数据包的包尾以”SE0“条件表达。这就是说两条数据线(通常是极性相反的两条)同时保持逻辑”0“电平两次。
In addition to these tasks, the received byte must be stored and a buffer overflow check performed every 8 data bits.
除了这些任务之外,还需每隔8个位就把接收到的字节必须存起来并且还要检查缓冲是否溢出。
The Naive Approach
幼稚的实现方法
The straight forward solution for the bit-processing loop looked like this (registers, I/O ports and constants have been replaced with symbolic constants for better readability):
用于位处理循环的直白的解决方案就像这样(寄存器,I/O端口和常量已经用符号常量替换以获得更好的可读性):
loop: in x1, port ; 1 read data from I/O port 【 从I/O端口读取数据 】 andi x1, mask ; 1 check for SE0 【 检查是否为SE0 】 breq end_of_packet ; 1 (1 cycle because branch not taken) 【 (因为分支并未跳转所以还是占一个周期) 】 eor x2, x1 ; 1 NRZI decoding 【 NRZI解码 】 ror x2 ; 1 assuming data bit in LSB -> carry 【 假设数据位是最低位并移入进位状态位C】 ror shift ; 1 collect data bits 【 从C中收集数据位 】 mov x2, x1 ; 1 store input for next cycle 【 把输入存起来在下个循环继续用 】 dec cnt ; 1 all 8 bits read? 【 8位读完了么? 】 brne loop ; 2 loop 8 times 【 跳回loop 】 ; ------------------------------------ ; 10 cycles 【 共10个周期 】
The first figure in the comment is the number of CPU cycles the instruction takes. As you can clearly see, we have already exceeded the limit of 8 instructions. And we have not even attempted to do bitstuff decoding yet!
注释里第一个数字是指令的周期数。就如你所清楚的看到,我们已经超过8条指令的限额,而且甚至还没有尝试进行位填充解码!
The obvious optimization is to unroll the loop. We can simply copy the code 8 times and save the loop construct. This saves us 3 cycles and we are back in the game with 7 cycles used so far. If we can find a way to do bit-unstuffing in 1 cycle, of course!
一个显然的优化就是展开循环。我们可以简单地复制代码8遍从而节省循环结构的开销。这为我们节约了3个周期让我们重新看到一线生机,至今只用了7个周期,假如我们可以找到一个1个周期的移除填充位的方法,这当然可以办到。
As a side-note: We have read the inverse of the data: Exor gives 1 if the data lines change and 0 of they don’t. But it is easy to compensate for that any time later.
注:我们读取的是数据位的去翻:Exror在数据线变化时给出1,不变时给出0。不过这个很容易在后续的任何时刻改回来。
The naive approach to bitstuff decoding is a counter which is decremented each no-change bit and set to 6 when a change is detected. If the counter reaches 0, we must destroy the next bit. This procedure involves at least one branch for the decision change or no-change, one decrement instruction, one constant load and one branch if the counter reaches zero. It is very hard to craft code with conditional branches into a form where each branch takes the same number of cycles. And it is even harder to pack all these instructions into one cycle!
位填充解码幼稚的处理方法时设置一个计数器,每当遇到一个不变的位时减一,遇到变的位的时候设置成6。假如计数器归零,我们必须和谐掉下一位。这个过程至少包含一条分支指令用于决定时变还是不变,一条自减指令,一条常量加载指令及一个分支指令判断计数器是否归零。想要把具有条件分支的代码魔改成每个分支所占机器周期一样时很难的。那么把这指令全部塞进一个周期简直是难上加难。
Breakthrough in Bitstuff Decoding
位填充解码的突破
What we really need is a completely different algorithm for bitstuff decoding which uses as much as possible of the information already acquired. There's no use in computing the same result twice.
我们真正要的是一个完全不同的位填充解码算法,这个算法应使用尽可能多的已获取的信息。相同的结果算两次是没必要的。
What we already have is the stream of bits read so far in the shift register. The last 6 bits received are certainly available there. Since we shift from MSB to LSB, we must find out whether the 6 most significant bits are all 0 (which means no-change; remember: we read the inverse bit stream). This can easily be done in a compare with the constant 4. The part for the first bit in an unrolled loop would now be:
我们到目前为止已经读到的位数据位流是存在移位寄存器(shift)中的。最后接收到的6位一定在那。由于我们从MSB移到LSB,我们就需要找出最高6位是否位0(这意味着没有变化;记住:我们流中数据位是取反的)。这可以简单的和常数4作比较来完成。这部分用于第一个位的循环展开的代码如下:
rxbit0: in x1, port ; 1 read data from I/O port 【 从I/O端口读取数据 】 andi x1, mask ; 1 check for SE0【 检查是否为SE0 】 breq end_of_packet ; 1 (1 cycle because branch not taken)【 (因为分支并未跳转所以还是占一个周期) 】 eor x2, x1 ; 1 NRZI decoding 【 NRZI解码 】 ror x2 ; 1 assuming data bit in LSB -> carry【 假设数据位是最低位并移入进位状态位C】 ror shift ; 1 collect data bits【 从C中收集数据位 】 cpi shift, 4 ; 1 check for 6 consecutive 0 bits【 检查是否有6个连续的0 】 brlo do_unstuff ; 1 (branch not taken)【 (还没实现的分支) 】 mov x2, x1 ; 1 store input for next cycle【 把输入存起来在下个循环继续用 】 ; ------------------------------------ ; 9 cycles
We are now at 9 cycles and have detected the bitstuffing. Almost done. We need to save another cycle and get some spare cycles for saving the data and checking for buffer overflow.
我们现在用了9个周期还做了位填充检测,差不多快好了。我们需要再节省一个周期并节约出一些剩余的周期用于保存数据和检查缓冲区溢出。
Since the loop is already unrolled, we can save the mov instruction. If we exchange the meaning of x1and x2 every bit, we don’t need to move data around. This saves one cycle and we can now read all the bits in time. When we further take into account that an End of Packet state is two bits long, we can save the SE0 check every second bit and win enough spare cycles to store the data and do loop control. Does this mean we are ready? Not quite! The code at do_unstuff is not yet written. But let us write down what we have so far:
由于循环已经展开了,我们可以节省mov指令。假如我们每隔一个位交换x1和x2的用途,那么就不需要移动数据了。这节省了一个周期而且能及时读取所有位。当我们进一步把一个包结束状态占用2个位长这样一个事实考虑在内,我们就能节省第SE0的第2位检查并赢得最够多的周期用于存储数据和进行循环控制。那么这就意味着我们已经快好了?并不是!这代码的do_unstuff还没写,但是让我们把至今能做的先写下来:
rxbit0: in x1, port ; 1 read data from I/O port【 从I/O端口读取数据 】 andi x1, mask ; 1 check for SE0【 检查是否为SE0 】 breq end_of_packet ; 1 (1 cycle because branch not taken)【 (因为分支并未跳转所以还是占一个周期) 】 eor x2, x1 ; 1 NRZI decoding【 NRZI解码 】 ror x2 ; 1 assuming data bit in LSB -> carry【 假设数据位是最低位并移入进位状态位C】 ror shift ; 1 collect data bits【 从C中收集数据位 】 cpi shift, 4 ; 1 check for 6 consecutive 0 bits【 检查是否有6个连续的0 】 brlo do_unstuff0 ; 1 (branch not taken)【 (还没实现的分支) 】 rxbit1: in x2, port ; 1 read data from I/O port andi x2, mask ; 1 check for SE0 breq end_of_packet ; 1 (1 cycle because branch not taken) eor x1, x2 ; 1 NRZI decoding ror x1 ; 1 assuming data bit in LSB -> carry ror shift ; 1 collect data bits cpi shift, 4 ; 1 check for 6 consecutive 0 bits brlo do_unstuff1 ; 1 (branch not taken) ; ------------------------------------ ; 16 cycles
Removing the Stuffed Bit
移除填充位
Destroying one bit should be easy, at first glance. Just do a dummy-read and wait:
移除一个位应该是容易的,乍一想,只要做空读和等待:
do_unstuff0: ; 1 (1 extra cycle: branch was taken)【 (额外的一个周期由于分支跳转所致) 】 in x1, port ; 1 read data from I/O port【 从I/O端口读取数据 】 andi x1, mask ; 1 check for SE0【 检查是否为SE0 】 breq end_of_packet ; 1 (1 cycle because branch not taken)【 (因为分支并未跳转所以还是占一个周期) 】 nop ; 1 nop ; 1 rjmp rxbit1 ; 2 (branch taken)【 (分支确认跳转)】 ; ------------------------------------ ; 8 cycles
For the first time we have 2 spare cycles! Now do the necessary copy/paste and replace SE0 checks where we need loop control and we are done. But wait! What happens if the stuffed bit is followed by a no-change bit? Our shift register contains the data we store and it would have 7 leading zeros. The next bit would therefore be destroyed, although a bit stuffing has only just occurred. A bug!
终于第一次我们有了2个空余的周期!现在我们在所需的循环控制做必要的复制粘贴并且替换SE0检查。但是等下!假如填充位跟着一个不变位会发生什么事?我们的移位寄存器包含着所存的数据和且会有7个前导0。因此下一个位也会被和谐掉,即使只有一个位填充。Bug无误。
This time it’s tough. We must prevent that there are more than 5 leading zeros in the shift register, but this register should contain the data we store. And we have no control over the data we have to store. We could duplicate the register and use the copy for bitstuff detection. But where are the spare cycles for taking care of the copy? And keeping redundant data can’t be efficient, after all. What we need is another Good Idea.
这下就很难办了。我们必须防止有超过5个的前导0出现在移位寄存器中,但是这个寄存器又要包含我们所存的数据。并且我们并不对数据存储进行控制。我们也许能复制一份并做位填充检测。但是哪有剩余的周期用来做拷贝呢?毕竟留着冗余数据也不高效。我们要的是另一个好主意。
If there is a solution, it must consist of modifying the shift register. This is the only way how the tightly packed code for reading bits can survive without modification. OK. So we have to set at least the MSB in shift during do_unstuff. But how should we reconstruct the received byte then?
假如存在这样一个方案,那么这个方案必须包含对于移位寄存器的修改。这是仅有的方法用于紧凑代码读取数据位并会牺牲对数据位的不变性。好吧,所以我们至少要在进行do_unstuff时置移位寄存器中的MSB为1。但是之后如何重建我们接收到的字节呢?
The key to a solution is that we know what we modify. We set bits to "1" which are known to be "0". And we know where they will land because the loop is unrolled and the number of shift operations following until the data is stored is fixed. We simply collect the bits we have modified in a separate register and bring them back just before we store. Luckily we had two spare cycles in do_unstuff:
解决方案的关键就是我们要知道改了什么。我把已知为逻辑“0”的位设置成逻辑“1”,并且我们知道这些位的位置因为循环是展开的并且移位的位数在数据存储之前是不变的。我们简单的收集这些已经修改的位在不同的寄存器里,然后在我们存起来之前恢复。幸运的是在do_unstuff中我们还有2个周期可用:
do_unstuff0: ; 1 (1 extra cycle: branch was taken)【 (额外的一个周期由于分支跳转所致) 】 in x1, port ; 1 read data from I/O port【 从I/O端口读取数据 】 andi x1, mask ; 1 check for SE0【 检查是否为SE0 】 breq end_of_packet ; 1 (1 cycle because branch not taken)【 (因为分支并未跳转所以还是占一个周期) 】 ori shift, 0xfc ; 1 mask out 6 recently received bits【 用掩码去除最近接受的6位数据 】 andi x3, 0xfe ; 1 the bits we masked shifted right 7【 用掩码去除最近接受的6位数据 】 rjmp rxbit1 ; 2 (branch taken)【 (分支确认跳转) 】 ; ------------------------------------ ; 8 cycles
x3 must be pre-initialized to 0xff at the start of each byte and we must store the value shift & x3. These two operations replace another SE0 check. Now we are really done decoding the stream. And we have not a single spare cycle left!
在每个字节接收开始之前x3必须预初始化成0xff并且我们必须拿shift & x3的值的用于存储。这两个操作又替换了另一个SE0检查。现在我们真正完成了数据流的解码而且一个多余周期都没剩!
The production code has to take care of some other minor problems, e.g. how to accumulate cycles from spared SE0 checks where we need them, which SE0 checks to spare without breaking standards compliance and so on. The code has evolved over the years and has little in common with the snippets shown above. See the assembler module of the driver for more details.
生产环境的代码还必须关心其他细节问题,例如:如何
References
参考文献
- USB in a Nutshell by Craig Peackock.
- Universal Serial Bus Revision 1.1 Specification.
- AVR Instruction Set.
- Atmel's Application Note AVR309 (not yet online, preview available at www.cesko.host.sk).