Enabling MOE Quantization using linear decomposition [WIP] #2043

HDCharles · 2025-04-11T12:16:01Z

Summary: This PR is a first step at optimizing moe inference using torchAO. The goal for this step is to enable existing quantization kernels and workflows to work for moe quantization by decomposing the group gemm into a sequence of unbalanced linear ops that can use the existing quantized kernels. To enable this we had to add support for quantizing these 3D tensors as well as slicing and indexing.

current tests are running locally but will be added once working.

currently int8wo and int8dq are working for multi and single token moe inference while int4wo is being finished up.

TODO move test set into ao, move quantizable moe module code to ao test on hf model definition.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-04-11T12:16:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2043

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: This PR is a first step at optimizing moe inference using torchAO. The goal for this step is to enable existing quantization kernels and workflows to work for moe quantization by decomposing the group gemm into a sequence of unbalanced linear ops that can use the existing quantized kernels. To enable this we had to add support for quantizing these 3D tensors as well as slicing and indexing. current tests are running locally but will be added once working. currently int8wo and int8dq are working for multi and single token moe inference while int4wo is being finished up. TODO move test set into ao, move quantizable moe module code to ao test on hf model definition. Test Plan: Reviewers: Subscribers: Tasks: Tags: testing Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 11, 2025

HDCharles force-pushed the moe_quant branch from fce1e18 to 22e19a4 Compare April 11, 2025 19:49

HDCharles force-pushed the moe_quant branch from 22e19a4 to 4583d99 Compare April 22, 2025 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling MOE Quantization using linear decomposition [WIP] #2043

Enabling MOE Quantization using linear decomposition [WIP] #2043

HDCharles commented Apr 11, 2025

pytorch-bot bot commented Apr 11, 2025 •

edited

Loading

Enabling MOE Quantization using linear decomposition [WIP] #2043

Are you sure you want to change the base?

Enabling MOE Quantization using linear decomposition [WIP] #2043

Conversation

HDCharles commented Apr 11, 2025

pytorch-bot bot commented Apr 11, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2043

pytorch-bot bot commented Apr 11, 2025 •

edited

Loading