1 min readfrom Machine Learning

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Hi everyone, I’ve been working on optimizing a transformer-based neural network for both inference speed and model size, but I feel like I’ve hit a plateau and would appreciate some guidance. So far I’ve converted weights to FP16 (about 2× size reduction), exported and optimized with ONNX Runtime for inference speed, and tried both unstructured and structured pruning as well as ONNX graph optimizations, but none of these gave significant additional gains, and I’m still around ~162 MB per model. At this point I’m considering next steps like low-rank factorization (SVD/LoRA-style compression), more aggressive quantization (INT8/INT4 like GPTQ, AWQ, or SmoothQuant), knowledge distillation into a smaller student model, or more hardware/runtime-specific optimizations like TensorRT or FlashAttention, but I’m not sure which of these actually gives meaningful real-world improvements after FP16 + pruning. I’d really appreciate advice on what approaches tend to work best in practice for transformer compression beyond what I’ve already tried, and whether low-rank methods are actually effective post-training or if distillation/quantization is usually the only real win at this stage.

submitted by /u/Fragrant_Rate_2583
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#real-time collaboration
#rows.com
#cloud-based spreadsheet applications
#financial modeling with spreadsheets
#Transformer model
#inference speed
#model size
#ONNX Runtime
#quantization
#FP16
#graph optimization
#low-rank factorization
#knowledge distillation
#compression
#unstructured pruning
#structured pruning
Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]