Skip to content

Commit bb3c2f8

Browse files
Merge remote-tracking branch 'origin/main' into dl/fx/openvino_quantizer
2 parents b7d2781 + 26066b7 commit bb3c2f8

13 files changed

+895
-133
lines changed

.ci/docker/requirements.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ datasets
3636
transformers
3737
torchmultimodal-nightly # needs to be updated to stable as soon as it's avaialable
3838
onnx
39-
onnxscript
39+
onnxscript>=0.2.2
4040
onnxruntime
4141
evaluate
4242
accelerate>=0.20.1
@@ -69,5 +69,5 @@ pycocotools
6969
semilearn==0.3.2
7070
torchao==0.5.0
7171
segment_anything==1.0
72-
torchrec==1.0.0; platform_system == "Linux"
72+
torchrec==1.1.0; platform_system == "Linux"
7373
fbgemm-gpu==1.1.0; platform_system == "Linux"

.jenkins/build.sh

+3-1
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@ sudo apt-get install -y pandoc
2626
# sudo pip3 install torch==2.6.0 torchvision --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu124
2727
# sudo pip uninstall -y fbgemm-gpu torchrec
2828
# sudo pip3 install fbgemm-gpu==1.1.0 torchrec==1.0.0 --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu124
29-
29+
sudo pip uninstall -y torch torchvision torchaudio torchtext torchdata torchrl tensordict
30+
pip3 install torch==2.7.0 torchvision torchaudio --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu126
31+
#sudo pip uninstall -y fbgemm-gpu
3032
# Install two language tokenizers for Translation with TorchText tutorial
3133
python -m spacy download en_core_web_sm
3234
python -m spacy download de_core_news_sm

.jenkins/validate_tutorials_built.py

+8-1
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,14 @@
5151
"intermediate_source/text_to_speech_with_torchaudio",
5252
"intermediate_source/tensorboard_profiler_tutorial", # reenable after 2.0 release.
5353
"advanced_source/semi_structured_sparse", # reenable after 3303 is fixed.
54-
"recipes_source/recipes/reasoning_about_shapes"
54+
"intermediate_source/mario_rl_tutorial", # reenable after 3302 is fixed
55+
"intermediate_source/reinforcement_ppo", # reenable after 3302 is fixed
56+
"intermediate_source/pinmem_nonblock", # reenable after 3302 is fixed
57+
"intermediate_source/dqn_with_rnn_tutorial", # reenable after 3302 is fixed
58+
"advanced_source/pendulum", # reenable after 3302 is fixed
59+
"advanced_source/coding_ddpg", # reenable after 3302 is fixed
60+
"intermediate_source/torchrec_intro_tutorial", # reenable after 3302 is fixed
61+
"recipes_source/recipes/reasoning_about_shapes" # reenable after 3326 is fixed
5562
]
5663

5764
def tutorial_source_dirs() -> List[Path]:

_static/img/install_msvc.png

131 KB
Loading
117 KB
Loading

prototype_source/context_parallel.rst

+228
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
Introduction to Context Parallel
2+
======================================
3+
**Authors**: `Xilun Wu <https://github.com/XilunWu>`_, `Chien-Chin Huang <https://github.com/fegin>`__
4+
5+
.. note::
6+
|edit| View and edit this tutorial in `GitHub <https://github.com/pytorch/tutorials/blob/main/prototype_source/context_parallel.rst>`__.
7+
8+
.. grid:: 2
9+
10+
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
11+
:class-card: card-prerequisites
12+
13+
* `Context Parallel APIs <https://pytorch.org/docs/stable/distributed.tensor.html#torch.distributed.tensor.experimental.context_parallel>`__
14+
* `1M sequence training in TorchTitan with Context Parallel <https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082>`__
15+
16+
17+
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
18+
:class-card: card-prerequisites
19+
20+
* PyTorch 2.7 or later
21+
22+
23+
Introduction
24+
------------
25+
26+
Context Parallel is an approach used in large language model training to reduce peak activation size by sharding the long input sequence across multiple devices.
27+
It breaks the constraint on input sequence length resulting from peak memory usage on storing activations in Transformer blocks.
28+
29+
Ring Attention, a novel parallel implementation of the Attention layer, is critical to performant Context Parallel.
30+
Ring Attention shuffles the KV shards and calculates the partial attention scores, repeats until all KV shards have been used on each device.
31+
Two Ring Attention variants have been implemented: `the all-gather based pass-KV <https://arxiv.org/abs/2407.21783>`__ and `the all-to-all based pass-KV <https://openreview.net/forum?id=WsRHpHH4s0>`__:
32+
33+
1. The all-gather based pass-KV algorithm is used in Llama3 training, which initially performs an all-gather on the key and value tensors, followed by computing the attention output for the
34+
local query tensor chunk. Our modified all-gather based pass-KV algorithm concurrently all-gathers KV shards and computes attention output for the local query tensor chunk
35+
using local key and value tensor chunks, followed by a final computation of attention output for the local query tensor and remaining KV shards. This allows some degree of
36+
overlap between the attention computation and the all-gather collective. For example, in the case of Llama3 training, we also shard ``freq_cis`` over the sequence dimension.
37+
2. The all-to-all approach uses interleaved all-to-all collectives to ring shuffle KV shards to overlap the SDPA (Scaled Dot Product Attention) computation and the all-to-all communication
38+
necessary for the next SDPA.
39+
40+
The Context Parallel APIs consist of two parts:
41+
42+
1. ``context_parallel()`` allows users to create a Python context where the SDPA function (``torch.nn.functional.scaled_dot_product_attention``)
43+
will be automatically replaced with Ring Attention. To shard Tensors along a dimension, simply pass the Tensors and their sharding dimensions to
44+
argument ``buffers`` and ``buffer_seq_dims`` respectively. We recommend that users add tensors computing along the sequence dimension to ``buffers``
45+
and shard them along this dimension. Taking Llama3 training as an example, missing ``freq_cis`` in ``buffers`` will result in a miscalculated rotary embedding.
46+
2. ``set_rotate_method()`` allows users to choose between the all-gather based pass-KV approach and the all-to-all based pass-KV approach.
47+
48+
49+
Setup
50+
---------------------
51+
52+
With ``torch.distributed.tensor.experimental.context_parallel()``, users can easily shard the Tensor input and parallelize the execution of the SDPA function.
53+
To better demonstrate the usage of this API, we start with a simple code snippet doing SDPA and then parallelize it using the API:
54+
55+
.. code:: python
56+
57+
import torch
58+
import torch.nn.functional as F
59+
60+
from torch.nn.attention import sdpa_kernel, SDPBackend
61+
62+
63+
def sdpa_example():
64+
assert torch.cuda.is_available()
65+
torch.cuda.set_device("cuda:0")
66+
torch.cuda.manual_seed(0)
67+
68+
batch = 8
69+
nheads = 8
70+
qkv_len = 8192
71+
dim = 32
72+
backend = SDPBackend.FLASH_ATTENTION
73+
dtype = (
74+
torch.bfloat16
75+
if backend == SDPBackend.FLASH_ATTENTION
76+
or backend == SDPBackend.CUDNN_ATTENTION
77+
else torch.float32
78+
)
79+
80+
qkv = [
81+
torch.rand(
82+
(batch, nheads, qkv_len, dim),
83+
dtype=dtype,
84+
requires_grad=True,
85+
device='cuda',
86+
)
87+
for _ in range(3)
88+
]
89+
# specify the SDPBackend to use
90+
with sdpa_kernel(backend):
91+
out = F.scaled_dot_product_attention(*qkv, is_causal=True)
92+
93+
94+
if __name__ == "__main__":
95+
sdpa_example()
96+
97+
98+
Enable Context Parallel
99+
-----------------------
100+
101+
Now, let's first adapt it to a distributed program where each rank has the same tensor input. Then we apply the context parallel API to
102+
shard to input and distribute the computation across ranks:
103+
104+
.. code:: python
105+
106+
# file: cp_sdpa_example.py
107+
import os
108+
109+
import torch
110+
import torch.distributed as dist
111+
import torch.nn.functional as F
112+
from torch.distributed.device_mesh import init_device_mesh
113+
from torch.distributed.tensor.experimental import context_parallel
114+
from torch.distributed.tensor.experimental._attention import context_parallel_unshard
115+
from torch.nn.attention import sdpa_kernel, SDPBackend
116+
117+
118+
def context_parallel_sdpa_example(world_size: int, rank: int):
119+
assert torch.cuda.is_available()
120+
assert dist.is_nccl_available()
121+
torch.cuda.set_device(f"cuda:{rank}")
122+
torch.cuda.manual_seed(0)
123+
124+
dist.init_process_group(
125+
backend="nccl",
126+
init_method="env://",
127+
world_size=world_size,
128+
rank=rank,
129+
)
130+
device_mesh = init_device_mesh(
131+
device_type="cuda", mesh_shape=(world_size,), mesh_dim_names=("cp",)
132+
)
133+
134+
batch = 8
135+
nheads = 8
136+
qkv_len = 64
137+
dim = 32
138+
backend = SDPBackend.FLASH_ATTENTION
139+
dtype = (
140+
torch.bfloat16
141+
if backend == SDPBackend.FLASH_ATTENTION
142+
or backend == SDPBackend.CUDNN_ATTENTION
143+
else torch.float32
144+
)
145+
146+
qkv = [
147+
torch.rand(
148+
(batch, nheads, qkv_len, dim),
149+
dtype=dtype,
150+
requires_grad=True,
151+
device='cuda',
152+
)
153+
for _ in range(3)
154+
]
155+
# specify the SDPBackend to use
156+
with sdpa_kernel(backend):
157+
out = F.scaled_dot_product_attention(*qkv, is_causal=True)
158+
159+
# make a clean copy of QKV for output comparison
160+
cp_qkv = [t.detach().clone() for t in qkv]
161+
162+
with sdpa_kernel(backend):
163+
# This `context_parallel()` performs two actions:
164+
# 1. Shard the tensor objects in `buffers` in-place along the dimension
165+
# specified in `buffer_seq_dims`, the tensors in `buffers` and their
166+
# sharding dims in `buffer_seq_dims` are organized in the same order.
167+
# 2. Replace the execution of `F.scaled_dot_product_attention` with a
168+
# context-paralleled-enabled Ring Attention.
169+
with context_parallel(
170+
device_mesh, buffers=tuple(cp_qkv), buffer_seq_dims=(2, 2, 2)
171+
):
172+
cp_out = F.scaled_dot_product_attention(*cp_qkv, is_causal=True)
173+
174+
# The output `cp_out` is still sharded in the same way as QKV
175+
# the `context_parallel_unshard` API allows users to easily
176+
# unshard to gain the full tensor.
177+
(cp_out,) = context_parallel_unshard(device_mesh, [cp_out], [2])
178+
179+
assert torch.allclose(
180+
cp_out,
181+
out,
182+
atol=(1e-08 if dtype == torch.float32 else 1e-03 * world_size),
183+
)
184+
185+
186+
if __name__ == "__main__":
187+
rank = int(os.environ["RANK"])
188+
world_size = int(os.environ["WORLD_SIZE"])
189+
190+
try:
191+
context_parallel_sdpa_example(world_size, rank)
192+
finally:
193+
dist.barrier()
194+
dist.destroy_process_group()
195+
196+
197+
You can use the command ``torchrun --standalone --nnodes=1 --nproc-per-node=4 cp_sdpa_example.py`` to launch the above context parallel
198+
SDPA on 4 GPUs. We demonstrate the numeric correctness by comparing the output of Ring Attention to that of SDPA on a single GPU.
199+
200+
201+
Select Rotation Approach
202+
------------------------
203+
204+
You can choose the desired shards rotation approach in Ring Attention by using ``torch.distributed.tensor.experimental._attention.set_rotate_method()``:
205+
206+
.. code:: python
207+
208+
# file: cp_sdpa_example.py
209+
from torch.distributed.tensor.experimental._attention import set_rotate_method
210+
211+
set_rotate_method("alltoall") # rotate shards using all-to-all
212+
213+
with sdpa_kernel(backend):
214+
with context_parallel(
215+
device_mesh, buffers=tuple(cp_qkv), buffer_seq_dims=(2, 2, 2)
216+
):
217+
cp_out = F.scaled_dot_product_attention(*cp_qkv, is_causal=True)
218+
219+
220+
The default rotation approach is the all-gather based pass-KV.
221+
222+
223+
Conclusion
224+
----------
225+
226+
In this tutorial, we have learned how to parallelize the SDPA computation along the sequence dimension easily with our Context Parallel APIs. For
227+
design and implementation details, performance analysis, and an end-to-end training example in `TorchTitan <https://github.com/pytorch/torchtitan>`__,
228+
see our post on `PyTorch native long-context training <https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082>`__.

prototype_source/inductor_windows.rst

+103
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
How to use ``torch.compile`` on Windows CPU/XPU
2+
===============================================
3+
4+
**Author**: `Zhaoqiong Zheng <https://github.com/ZhaoqiongZ>`_, `Xu, Han <https://github.com/xuhancn>`_
5+
6+
7+
Introduction
8+
------------
9+
10+
TorchInductor is the new compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
11+
12+
This tutorial introduces the steps for using TorchInductor via ``torch.compile`` on Windows CPU/XPU.
13+
14+
15+
Software Installation
16+
---------------------
17+
18+
Now, we will walk you through a step-by-step tutorial for how to use ``torch.compile`` on Windows CPU/XPU.
19+
20+
Install a Compiler
21+
^^^^^^^^^^^^^^^^^^
22+
23+
C++ compiler is required for TorchInductor optimization, let's take Microsoft Visual C++ (MSVC) as an example.
24+
25+
1. Download and install `MSVC <https://visualstudio.microsoft.com/downloads/>`_.
26+
27+
1. During Installation, select **Workloads** and then **Desktop & Mobile**.
28+
1. Select a checkmark on **Desktop Development with C++** and install.
29+
30+
.. image:: ../_static/img/install_msvc.png
31+
32+
33+
.. note::
34+
35+
Windows CPU inductor also support C++ compiler `LLVM Compiler <https://github.com/llvm/llvm-project/releases>`_ and `Intel Compiler <https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler-download.html>`_ for better performance.
36+
Please check `Alternative Compiler for better performance on CPU <#alternative-compiler-for-better-performance>`_.
37+
38+
Set Up Environment
39+
^^^^^^^^^^^^^^^^^^
40+
Next, let's configure our environment.
41+
42+
#. Open a command line environment via cmd.exe.
43+
#. Activate ``MSVC`` via below command::
44+
45+
"C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Auxiliary/Build/vcvars64.bat"
46+
#. Create and activate a virtual environment: ::
47+
#. Install `PyTorch 2.5 <https://pytorch.org/get-started/locally/>`_ or later for CPU Usage. Install PyTorch 2.7 or later refer to `Getting Started on Intel GPU <https://pytorch.org/docs/main/notes/get_start_xpu.html>`_ for XPU usage.
48+
#. Here is an example of how to use TorchInductor on Windows:
49+
.. code-block:: python
50+
51+
import torch
52+
device="cpu" # or "xpu" for XPU
53+
def foo(x, y):
54+
a = torch.sin(x)
55+
b = torch.cos(x)
56+
return a + b
57+
opt_foo1 = torch.compile(foo)
58+
print(opt_foo1(torch.randn(10, 10).to(device), torch.randn(10, 10).to(device)))
59+
60+
#. Below is the output of the above example::
61+
62+
tensor([[-3.9074e-02, 1.3994e+00, 1.3894e+00, 3.2630e-01, 8.3060e-01,
63+
1.1833e+00, 1.4016e+00, 7.1905e-01, 9.0637e-01, -1.3648e+00],
64+
[ 1.3728e+00, 7.2863e-01, 8.6888e-01, -6.5442e-01, 5.6790e-01,
65+
5.2025e-01, -1.2647e+00, 1.2684e+00, -1.2483e+00, -7.2845e-01],
66+
[-6.7747e-01, 1.2028e+00, 1.1431e+00, 2.7196e-02, 5.5304e-01,
67+
6.1945e-01, 4.6654e-01, -3.7376e-01, 9.3644e-01, 1.3600e+00],
68+
[-1.0157e-01, 7.7200e-02, 1.0146e+00, 8.8175e-02, -1.4057e+00,
69+
8.8119e-01, 6.2853e-01, 3.2773e-01, 8.5082e-01, 8.4615e-01],
70+
[ 1.4140e+00, 1.2130e+00, -2.0762e-01, 3.3914e-01, 4.1122e-01,
71+
8.6895e-01, 5.8852e-01, 9.3310e-01, 1.4101e+00, 9.8318e-01],
72+
[ 1.2355e+00, 7.9290e-02, 1.3707e+00, 1.3754e+00, 1.3768e+00,
73+
9.8970e-01, 1.1171e+00, -5.9944e-01, 1.2553e+00, 1.3394e+00],
74+
[-1.3428e+00, 1.8400e-01, 1.1756e+00, -3.0654e-01, 9.7973e-01,
75+
1.4019e+00, 1.1886e+00, -1.9194e-01, 1.3632e+00, 1.1811e+00],
76+
[-7.1615e-01, 4.6622e-01, 1.2089e+00, 9.2011e-01, 1.0659e+00,
77+
9.0892e-01, 1.1932e+00, 1.3888e+00, 1.3898e+00, 1.3218e+00],
78+
[ 1.4139e+00, -1.4000e-01, 9.1192e-01, 3.0175e-01, -9.6432e-01,
79+
-1.0498e+00, 1.4115e+00, -9.3212e-01, -9.0964e-01, 1.0127e+00],
80+
[ 5.7244e-04, 1.2799e+00, 1.3595e+00, 1.0907e+00, 3.7191e-01,
81+
1.4062e+00, 1.3672e+00, 6.8502e-02, 8.5216e-01, 8.6046e-01]])
82+
83+
Alternative Compiler for better performance on CPU
84+
--------------------------------------------------
85+
86+
To enhance performance for inductor on Windows CPU, you can use the Intel Compiler or LLVM Compiler. However, they rely on the runtime libraries from Microsoft Visual C++ (MSVC). Therefore, your first step should be to install MSVC.
87+
88+
Intel Compiler
89+
^^^^^^^^^^^^^^
90+
91+
#. Download and install `Intel Compiler <https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler-download.html>`_ with Windows version.
92+
#. Set Windows Inductor Compiler via environment variable ``set CXX=icx-cl``.
93+
94+
LLVM Compiler
95+
^^^^^^^^^^^^^
96+
97+
#. Download and install `LLVM Compiler <https://github.com/llvm/llvm-project/releases>`_ and choose win64 version.
98+
#. Set Windows Inductor Compiler via environment variable ``set CXX=clang-cl``.
99+
100+
Conclusion
101+
----------
102+
103+
In this tutorial, we introduce how to use Inductor on Windows CPU with PyTorch 2.5 or later, and on Windows XPU with PyTorch 2.7 or later. We can also use Intel Compiler or LLVM Compiler to get better performance on CPU.

0 commit comments

Comments
 (0)