EfficientViT always gives wrong predictions after MXQ compilation

Hi,I’m trying to deploy EfficientViT-B0/B1 (MIT-Han Lab) on a Mobilint MLA100 NPU (Jetson Orin, qbcompiler v1.1.0). The float32 ONNX model works correctly, but the compiled `.mxq` gives completely wrong predictions on every image (effectively 0% top-1 accuracy).

What I’ve tried:

- All quantization method/mode combinations (`WChALayer`, `WChAMulti`, `max`, `histogram`, `kl-divergence`) with both `backend=“onnx”` and `backend=“torch”` — all wrong.

- `cpu_offload=True` — predictions still wrong.

- `BitConfig` with all transformer bits set to 16 + `mixed_precision_apply=True` — compiler output showed `Attn=0`, so EfficientViT’s attention wasn’t recognized as an Attention block. Everything stayed at 8-bit.

- To rule out a qbcompiler-specific issue, I ran ONNX Runtime INT8 quantization on the same model. Every method (per-tensor, per-channel, MinMax, Entropy, Percentile) gave 0/10. UINT16 activations improved to 10/10. ONNX Runtime also logged warnings about `elem_type: 7` (INT64) tensors inside `context_module` — the ReLU-based linear attention has `Gather`/`Slice`/`Div` ops that produce INT64 tensors, and these seem to break INT8 quantization.

- Sanity check: ResNet50 through the same pipeline → 99% accuracy. So the pipeline itself is fine.

My questions:

1. Architecture support — Is EfficientViT’s `context_module` (ReLU linear attention with dynamic `Gather`/`Slice` shape ops) actually supported on Aries2? Are there known issues with this kind of attention under INT8?

2. INT16 activation — Does Aries2 actually support INT16 activation at runtime? ONNX Runtime results suggest this model needs at least 16-bit activations. The v1.1.0 API has `BitConfig.LayerOverrides.activation_16bits`, but we can’t tell if the hardware executes it or silently ignores it.

3. activation_16bits` not working — We extracted the 141 float-op node names from `context_module` directly from the ONNX graph (e.g. `stages/stages.2/blocks/blocks.1/context_module/main/kernel_func/Relu`) and passed them to `activation_16bits` + `weight_16bits` with `backend=“onnx”`. Compilation succeeded with no warnings, but the output still showed `8b=1.00` and `Average Bit Width: 8.00 bits` — no mention of 16-bit anywhere. Is the feature silently a no-op on this hardware, or are we using the wrong name format?

4. cpu_offload verification — Is there a log or runtime check to confirm `cpu_offload=True` is actually active? The PDF notes it requires a compatible qbruntime version.

Any help or pointers to working examples with attention-based models would be really appreciated. Thanks!

Hello,

Thanks for the detailed questions — happy to help. Here are our responses to each point:

1. Yes. The context_module block in EfficientViT is internally converted into a combination of NPU-supported operations, so it should run on Aries2 without issues. INT8 support itself should also be fine, though depending on the model, severe outliers may require switching specific layers to 16-bit. In addition, our next release will include a compiler version with improved EfficientViT performance, so please keep an eye out for that as well.

2, 3. If you specify a path in BitConfig.SaveInfo.save_path, you can inspect the quantization bit assignments that were actually applied. Note that layer names can change during the onnx→mblt conversion (or while mblt is processed internally), so the reliable approach is to check the saved bit-config file (the one written to save_path) and set your 16-bit layers using the names found there.

Regarding your question of whether this is a silent no-op or a wrong name format — it’s closer to the latter. The onnx node names you extracted were likely renamed during conversion, so none of the names you passed to activation_16bits matched any actual layer. As a result the override applied to zero layers and everything silently stayed at 8-bit, which is why you saw 8b=1.00. If you take the real internal layer names from the save_path file, re-specify them, and recompile, you should be able to confirm the 16-bit assignment is reflected.

4. For EfficientViT, all internal layers are supported on the NPU, so CPU offloading does not apply — CPU offloading only kicks in for unsupported layers. There is no API to directly confirm CPU offloading at the code level, but as an indirect check, you can look at the output of getModelSummary: if it shows 2 or more layers, CPU offload has been applied.

P.S. Please feel free to ask any questions in Korean if that is more comfortable for you.