@@ -5,30 +5,43 @@ This directory contains files for running the Pytorch example program,
5
5
using Pytorch module ` DistributedDataParallel ` for parallel training and
6
6
` PnetCDF-Python ` for reading data from a NetCDF files.
7
7
8
- ---
9
8
## Running the MNIST Example Program
10
9
11
- * Firstly, run commands below to generate the python program file and NetCDF file.
10
+ * Firstly, run command below to generate the python program file.
12
11
``` sh
13
- make mnist_main.py`
14
- make mnist_images.nc`
12
+ make mnist_main.py
15
13
```
16
14
* Run command below to train the model using 4 MPI processes.
17
15
``` sh
18
16
mpiexec -n 4 python mnist_main.py --batch-size 4 --test-batch-size 2 --epochs 3 --input-file mnist_images.nc
19
17
```
20
18
19
+ * ` mnist_main.py ` command-line options
20
+ ```
21
+ -h, --help show this help message and exit
22
+ --batch-size N input batch size for training (default: 64)
23
+ --test-batch-size N input batch size for testing (default: 1000)
24
+ --epochs N number of epochs to train (default: 14)
25
+ --lr LR learning rate (default: 1.0)
26
+ --gamma M Learning rate step gamma (default: 0.7)
27
+ --no-cuda disables CUDA training
28
+ --no-mps disables macOS GPU training
29
+ --dry-run quickly check a single pass
30
+ --seed S random seed (default: 1)
31
+ --log-interval N how many batches to wait before logging training status
32
+ --save-model For Saving the current Model
33
+ --input-file INPUT_FILE
34
+ NetCDF file storing train and test samples
35
+ ```
36
+
21
37
## Testing
22
38
* Command ` make check ` will do the following.
23
39
+ Downloads the python source codes
24
40
[ main.py] ( https://github.com/pytorch/examples/blob/main/mnist/main.py )
25
41
from [ Pytorch Examples] ( https://github.com/pytorch/examples ) as file
26
42
` mnist_main.py ` .
27
43
+ Applies patch file [ mnist.patch] ( ./mnist.patch ) to ` mnist_main.py ` .
28
- + Downloads the MNIST data sets from [ ] ( )
29
- + Run utility program [ create_mnist_netcdf.py] ( ./create_mnist_netcdf.py )
30
- to extract a subset of images into a NetCDF file.
31
- + Run the training program ` mnist_main.py ` .
44
+ + Run the training program ` mnist_main.py ` in parallel using 4 MPI processes.
32
45
33
46
* Testing output shown on screen.
34
47
```
@@ -51,25 +64,15 @@ using Pytorch module `DistributedDataParallel` for parallel training and
51
64
Test set: Average loss: 1.2531, Accuracy: 7/12 (58%)
52
65
```
53
66
54
- ## mnist_main.py command-line options
55
- ```
56
- -h, --help show this help message and exit
57
- --batch-size N input batch size for training (default: 64)
58
- --test-batch-size N input batch size for testing (default: 1000)
59
- --epochs N number of epochs to train (default: 14)
60
- --lr LR learning rate (default: 1.0)
61
- --gamma M Learning rate step gamma (default: 0.7)
62
- --no-cuda disables CUDA training
63
- --no-mps disables macOS GPU training
64
- --dry-run quickly check a single pass
65
- --seed S random seed (default: 1)
66
- --log-interval N how many batches to wait before logging training status
67
- --save-model For Saving the current Model
68
- --input-file INPUT_FILE
69
- NetCDF file storing train and test samples
70
- ```
71
-
72
- ## create_mnist_netcdf.py command-line options
67
+ ## Generate the Input NetCDF File From MNIST Datasets
68
+ * Utility program [ create_mnist_netcdf.py] ( ./create_mnist_netcdf.py )
69
+ can be used to extract a subset of images into a NetCDF file.
70
+ * Command ` make mnist_images.nc ` will first download the MNIST data files from
71
+ https://yann.lecun.com/exdb/mnist and extract 60 images as training samples
72
+ and 12 images as testing samples into a new file named ` mnist_images.nc ` .
73
+ * ` create_mnist_netcdf.py ` can also run individually to extract a different
74
+ number of images using command-line options shown below.
75
+ * ` create_mnist_netcdf.py ` command-line options:
73
76
```
74
77
-h, --help show this help message and exit
75
78
--verbose Verbose mode
@@ -83,9 +86,34 @@ using Pytorch module `DistributedDataParallel` for parallel training and
83
86
(Optional) input file name of testing data
84
87
--test-label-file TEST_LABEL_FILE
85
88
(Optional) input file name of testing labels
89
+ --out-file OUT_FILE (Optional) output NetCDF file name
90
+ ```
91
+ * The NetCDF file metadata can be obtained by running command "ncmpidump -h" or
92
+ "ncdump -h".
93
+ ``` sh
94
+ % ncmpidump -h mnist_images.nc
95
+ netcdf mnist_images {
96
+ // file format: CDF-5 (big variables)
97
+ dimensions:
98
+ height = 28 ;
99
+ width = 28 ;
100
+ train_num = 60 ;
101
+ test_num = 12 ;
102
+ variables:
103
+ ubyte train_samples(train_num, height, width) ;
104
+ train_samples:long_name = " training data samples" ;
105
+ ubyte train_labels(train_num) ;
106
+ train_labels:long_name = " labels of training samples" ;
107
+ ubyte test_samples(test_num, height, width) ;
108
+ test_samples:long_name = " testing data samples" ;
109
+ ubyte test_labels(test_num) ;
110
+ test_labels:long_name = " labels of testing samples" ;
111
+
112
+ // global attributes:
113
+ :url = " https://yann.lecun.com/exdb/mnist/" ;
114
+ }
86
115
```
87
116
88
- ---
89
117
## Files in this directory
90
118
* [ mnist.patch] ( ./mnist.patch ) --
91
119
a patch file to be applied on
@@ -103,7 +131,6 @@ using Pytorch module `DistributedDataParallel` for parallel training and
103
131
a utility python program that reads the MINST files, extract a subset of the
104
132
samples, and stores them into a newly created file in NetCDF format.
105
133
106
- ---
107
134
### Notes:
108
135
- The test set accuracy may vary slightly depending on how the data is distributed across the MPI processes.
109
136
- The accuracy and loss reported after each epoch are averaged across all MPI processes.
0 commit comments