You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: intermediate_source/pt2e_quant_xpu_inductor.rst
+7-11
Original file line number
Diff line number
Diff line change
@@ -16,19 +16,15 @@ This tutorial introduces XPUInductorQuantizer aiming for serving the quantized m
16
16
utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.
17
17
18
18
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
19
-
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
19
+
This approach is expected to have significantly higher model coverage with better programmability and a simplified user experience.
20
20
TorchInductor is the compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
21
21
22
22
The quantization flow mainly includes three steps:
23
23
24
24
- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
25
-
- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
25
+
- Step 2: Apply the quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
26
26
performing the prepared model's calibration, and converting the prepared model into the quantized model.
27
-
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
28
-
29
-
During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>`_ contains
30
-
highly-optimized quantized Conv/GEMM kernels for both CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
31
-
Besides oneDNN kernels, triton would be responsible for generating kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`_
27
+
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``, which would call triton kernels or oneDNN GEMM/Conv kernels.
32
28
33
29
34
30
The high-level architecture of this flow could look like this:
@@ -39,13 +35,13 @@ The high-level architecture of this flow could look like this:
39
35
Post Training Quantization
40
36
----------------------------
41
37
42
-
Static quantization is the only method we support currently. QAT and dynamic quantization will be available in later versions.
38
+
Static quantization is the only method we support currently.
43
39
44
40
The dependencies packages are recommended to be installed through Intel GPU channel as follows
0 commit comments