description, code change

ZhiweiYan-96 · ZhiweiYan-96 · commit 6a4f74815bb1 · 2025-03-24T03:23:14.000Z
diff --git a/index.rst b/index.rst
@@ -679,6 +679,13 @@ Welcome to PyTorch Tutorials
    :link: intermediate/transformer_building_blocks.html
    :tags: Transformer
 
+.. customcarditem::
+   :header: PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
+   :card_description: Learn how to export quantized models on Intel GPU backend
+   :image: _static/img/thumbnails/cropped/pytorch-logo.png
+   :link: intermediate/pt2e_quant_xpu_inductor.html
+   :tags: Quantization,Model-Optimization
+
 .. Parallel-and-Distributed-Training
 
 
diff --git a/intermediate_source/pt2e_quant_xpu_inductor.rst b/intermediate_source/pt2e_quant_xpu_inductor.rst
@@ -16,19 +16,15 @@ This tutorial introduces XPUInductorQuantizer aiming for serving the quantized m
 utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.
 
 The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
-This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
+This approach is expected to have significantly higher model coverage with better programmability and a simplified user experience.
 TorchInductor is the compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
 
 The quantization flow mainly includes three steps:
 
 - Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
-- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
+- Step 2: Apply the quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
   performing the prepared model's calibration, and converting the prepared model into the quantized model.
-- Step 3: Lower the quantized model into inductor with the API ``torch.compile``. 
-
-During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>`_ contains 
-highly-optimized quantized Conv/GEMM kernels for both CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
-Besides oneDNN kernels, triton would be responsible for generating kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`_
+- Step 3: Lower the quantized model into inductor with the API ``torch.compile``, which would call triton kernels or oneDNN GEMM/Conv kernels.
 
 
 The high-level architecture of this flow could look like this:
@@ -39,13 +35,13 @@ The high-level architecture of this flow could look like this:
 Post Training Quantization
 ----------------------------
 
-Static quantization is the only method we support currently. QAT and dynamic quantization will be available in later versions.
+Static quantization is the only method we support currently.
 
 The dependencies packages are recommended to be installed through Intel GPU channel as follows
 
 ::
 
-    pip install torchvision pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu
+    pip3 install torch torchvision torchaudio pytorch-triton-xpu --index-url https://download.pytorch.org/whl/xpu
 
 1. Capture FX Graph
 ^^^^^^^^^^^^^^^^^^^^^
@@ -63,7 +59,7 @@ We will start by performing the necessary imports, capturing the FX Graph from t
 
     # Create the Eager Model
     model_name = "resnet18"
-    model = models.__dict__[model_name](pretrained=True)
+    models.__dict__[model_name](weights=models.ResNet18_Weights.DEFAULT)
 
     # Set the model to eval mode
     model = model.eval().to("xpu")
@@ -75,7 +71,7 @@ We will start by performing the necessary imports, capturing the FX Graph from t
 
     # Capture the FX Graph to be quantized
     with torch.no_grad():
-        export_model = export_for_training(
+        exported_model = export_for_training(
             model,
             example_inputs,
         ).module()
diff --git a/prototype_source/prototype_index.rst b/prototype_source/prototype_index.rst
@@ -96,13 +96,6 @@ Prototype features are not available as part of binary distributions like PyPI o
    :link: ../prototype/pt2e_quant_x86_inductor.html
    :tags: Quantization
 
-.. customcarditem::
-   :header: PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
-   :card_description: Learn how to use PT2 Export Quantization with Intel GPU Backend through Inductor.
-   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
-   :link: ../prototype/pt2e_quant_xpu_inductor.html
-   :tags: Quantization
-
 .. Sparsity
 
 .. customcarditem::