You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `pmapreduce`-related functions are expected to be more performant than `@distributed` for loops. As an example, running the following on a Slurm cluster using 2 nodes with 28 cores on each leads to
47
+
48
+
```julia
49
+
julia>@time@distributed (+) for i=1:nworkers()
50
+
ones(10_000, 1_000)
51
+
end;
52
+
22.355047 seconds (7.05 M allocations:8.451 GiB, 6.73% gc time)
2.672838 seconds (52.83 k allocations:78.295 MiB, 0.53% gc time)
56
+
```
57
+
58
+
The difference becomes more apparent as larger data needs to be communicated across workers. This is because `ParallelUtilities.pmapreduce*` perform local reductions on each node before communicating across nodes.
59
+
44
60
# Usage
45
61
46
62
The package splits up a collection of ranges into subparts of roughly equal length, so that all the cores are approximately equally loaded. This is best understood using an example: let's say that we have a function `f` that is defined as
@@ -82,7 +98,7 @@ The first six processors receive 4 tuples of parameters each and the final four
82
98
83
99
The package provides versions of `pmap` with an optional reduction. These differ from the one provided by `Distributed` in a few key aspects: firstly, the iterator product of the argument is what is passed to the function and not the arguments by elementwise, so the i-th task will be `Iterators.product(args...)[i]` and not `[x[i] for x in args]`. Specifically the second set of parameters in the example above will be `(2,2,3)` and not `(2,3,4)`.
84
100
85
-
Secondly, the iterator is passed to the function in batches and not elementwise, and it is left to the function to iterate over the collection. Thirdly, the tasks are passed on to processors sorted by rank, so the first task is passed to the first processor and the last to the last active worker. The tasks are also approximately evenly distributed across processors. The function `pmapbatch_elementwise`is also exported that passes the elements to the function one-by-one as unwrapped tuples. This produces the same result as `pmap`where each worker is assigned batches of approximately equal sizes taken from the iterator product.
101
+
Secondly, the iterator is passed to the function in batches and not elementwise, and it is left to the function to iterate over the collection. Thirdly, the tasks are passed on to processors sorted by rank, so the first task is passed to the first processor and the last to the last active worker. The tasks are also approximately evenly distributed across processors. The exported function `pmapbatch_elementwise` passes the elements to the function one-by-one as splatted tuples. This produces the same result as `pmap`for a single range as the argument.
86
102
87
103
### pmapbatch and pmapbatch_elementwise
88
104
@@ -106,7 +122,7 @@ julia> Tuple(p)
106
122
107
123
### pmapsum and pmapreduce
108
124
109
-
Often a parallel execution is followed by a reduction (eg. a sum over the results). A reduction may be commutative (in which case the order of results do not matter), or non-commutative (in which the order does matter). There are two functions that are exported that carry out these tasks: `pmapreduce_commutative` and `pmapreduce`, where the former does not preserve ordering and the latter does. For convenience, the package also provides the function `pmapsum` that chooses `sum` as the reduction operator. The map-reduce operation is similar in many ways to the distributed `for` loop provided by julia, but the main difference is that the reduction operation is not binary for the functions in this package (eg. we need `sum` and not `(+)`to add the results). There is also the difference as above that the function gets the parameters in batches, with functions having the suffix `_elementwise` taking on parameters individually as unwrapped tuples as above. The function `pmapreduce` does not take on parameters elementwise at this point, although this might be implemented in the future.
125
+
Often a parallel execution is followed by a reduction (eg. a sum over the results). A reduction may be commutative (in which case the order of results do not matter), or non-commutative (in which the order does matter). There are two functions that are exported that carry out these tasks: `pmapreduce_commutative` and `pmapreduce`, where the former does not preserve ordering and the latter does. For convenience, the package also provides the function `pmapsum` that chooses `sum` as the reduction operator. The map-reduce operation is similar in many ways to the distributed `for` loop provided by julia, but the main difference is that the reduction operation is not binary for the functions in this package (eg. we need `sum` and not `(+)`to add the results). There is also the difference as above that the function gets the parameters in batches, with functions having the suffix `_elementwise` taking on parameters individually as splatted `Tuple`s. The function `pmapreduce` does not take on parameters elementwise at this point, although this might be implemented in the future.
110
126
111
127
As an example, to sum up a list of numbers in parallel we may call
112
128
```julia
@@ -137,7 +153,7 @@ julia> workers()
137
153
2
138
154
3
139
155
140
-
# The signature is pmapreduce(fmap,freduce,iterable)
156
+
# The signature is pmapreduce(fmap,freduce, range_or_tuple_of_ranges)
141
157
julia>pmapreduce(x ->ones(2).*myid(), x ->hcat(x...), 1:nworkers())
The functions `pmapreduce` produces the same result as `pmapreduce_commutative` if the reduction operator is commutative (ie. the order of results received from the children workers does not matter).
148
164
149
-
The function `pmapsum` sets the reduction operator to be a sum.
165
+
The function `pmapsum` sets the reduction function to `sum`.
It is possible to specify the return types of the map and reduce operations in these functions. To specify the return types use the following variants:
163
179
164
180
```julia
165
-
# Signature is pmapreduce(fmap, Tmap, freduce, Treduce, iterators)
181
+
# Signature is pmapreduce(fmap, Tmap, freduce, Treduce, range_or_tuple_of_ranges)
166
182
julia>pmapreduce(x ->ones(2).*myid(), Vector{Float64}, x ->hcat(x...), Matrix{Float64}, 1:nworkers())
167
183
2×2 Array{Float64,2}:
168
184
2.03.0
169
185
2.03.0
170
186
171
-
# Signature is pmapsum(fmap, Tmap, iterators)
187
+
# Signature is pmapsum(fmap, Tmap, range_or_tuple_of_ranges)
where the object loops over values of `(x,y,z)`, and the values are sorted in reverse lexicographic order (the last index increases the slowest while the first index increases the fastest). The ranges roll over as expected. The tasks are evenly distributed with the remainders being split among the first few processors. In this example the first six processors receive 4 tasks each and the last four receive 3 each. We can see this by evaluating the length of the `ProductSplit` operator on each processor
245
261
246
262
```julia
247
-
julia>Tuple(length(ProductSplit((xrange,yrange,zrange),10,i)) for i=1:10)
263
+
julia>Tuple(length(ProductSplit((xrange,yrange,zrange),10,i)) for i=1:10)
0 commit comments