In light of NV's recent addition of fp4, I'm once again curious about the bottom line for LLM, at least for inference; let's go back to this BitNet paper from Microsoft, featuring 1 bit LLM, with 1-bit weights trained from scatch, and later on another featuring 1.58b or a trit.
Sources
a blog for short summary of 1.58b LLM: https://medium.com/@jelkhoury880/bitnet-1-58b-0c2ad4752e4f
==> a sick comment from which I derived the title
While I am excited by these developments, I really wish they would stop calling it 1-bit. It's not 1 binary bit. It's 1 balanced ternary trit.
==> though do note I think the author's take on its influence on hardware is not quite sound.
One can very well argue that without mmad operations, GPUs or specifically their core SIMT components are once again back in driver seat. It's possible some highly optimized SIMD architecture can utilize the highly structured computation pattern better, but there is no theoretical guarantee for the best case and some misaligned and lopesided shapes will probably favor SIMT instead.
Aiming for better SIMT PPA is challenging NV on its home turf, and it won't be easy to say the least.
Perhaps more importantly, in the next 3 to 5 years at least, BitNet like structures are more likely to be incorporated into full/half precision networks for partial or inference-only accelerations, than shipped as standalone backbones for main server networks. This reality means a general purpose processor with massive parallelism and a tensor processing unit would still be dominant.
BitNet: Scaling 1-bit Transformers for Large Language Models: https://arxiv.org/pdf/2310.11453.pdf
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits:
https://arxiv.org/pdf/2402.17764.pdf
1-bit Transformers
Breakdown
Background
The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption.
As the size of these models grows, the memory bandwidth required for accessing and processing the model parameters becomes a major bottleneck, limiting the overall inference performance. Moreover, when deploying these models on distributed systems or multi-device platforms, the inter-device communication overhead can significantly impact the inference latency and energy consumption
Proposal
BitNet, a scalable and stable 1-bit Transformer architecture for LLMs.
Specifically, BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch.
BitNet employs low-precision binary weights and quantized activations, while maintaining high precision for the optimizer states and gradients during training.
Claim
0. as of 2023.10, "first to investigate quantization-aware training for 1-bit large language models")
1. BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines.
2. BitNet exhibits a scaling law akin to full-precision Transformers.
3. (minor) better training stability than fp16; can use much larger learning rate for faster convergence
Key statistics
next 4 from figure 1

per datum compr

最低0.47元/天 解锁文章
2566

被折叠的 条评论
为什么被折叠?



