-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Cranelift: towards representing control and effect dependencies in CLIF #10427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
V8 is moving the exact opposite direction as sea-of-nodes was found to have several disadvantages in practice for their use case at least. Not all of them would apply to Cranelift, but it may still be informative. https://v8.dev/blog/leaving-the-sea-of-nodes |
Thanks for the link. I'm glad they put their rationale into writing, because it was all tribal knowledge and rumors before this. I'm going to to through their list of downsides and respond to each of them in the context of CLIF and what I wrote up above. All this should be taken with some amount of salt though because they've obviously had much more hands-on experience with effect dependencies in their IR than I have! Manually/visually inspecting and understanding a Sea of Nodes graph is hardI think that what I sketched above can largely sidestep this issue because we would still organize things into blocks, and effectively always have an instruction schedule, rather than a big soup of nodes. We would still have a familiar text format, etc... Basically, this issue is proposing: "Let's add the control/effect dependencies, but not remove basic blocks. Then our goal is to do as much aggressive code motion in our existing CFG and basic blocks as we can while satisfying dependencies, rather than to reconstruct a CFG and schedule instructions in blocks from scratch." I think they are ultimately equivalent in theory, but I suspect that the former will be easier than the latter in practice (we basically already do the former during egraph elaboration). Too many nodes are on the effect chain and/or have a control inputQuite possible this would still be an issue. The source program (Wasm or rustc MIR) is going to constrain things somewhat. Some programs just don't have as much wiggle room as others. This is the nature of compiler optimization in general though. Many programs might not benefit from a particular analysis, but the ones that do really benefit. The trick is finding the best bang for your buck and complexity budget. It is worth mentioning that the source-to-CLIF frontend could also make things worse by unnecessarily introducing dependencies that aren't strictly needed (for example, by linearizing all effects during CLIF construction, even when they could have been a partial ordering) which would constrain the optimizer and give it less freedom for code motion. Memory operations do not float easilyLooking at their example, it seems like the difficulty is somewhat incidental and not fundamental. They have effect dependencies between their two loads, but neither load actually produces an effect AFAICT, and so they should both instead depend on
This is a really interesting/revealing statement because Cranelift already has "CFG where pure nodes are floating" and that is our egraph and side-effectful skeleton! And what I wrote up above is very much not just that because we would not impose a total ordering on anything that produces or consumes side effects, these operations would be partially ordered, and this is exactly the property that allows for many different schedulings / more aggressive code motion and ultimately better optimization. I think a key part of avoiding a single effect chain with a total ordering would be the bits I wrote up about generalizing our alias regions to effect dependencies. If you linearize all memory operations into a total order, that is equivalent to using a single alias region, and we know that we get tons of benefit from using multiple alias regions already today. Managing the effect and control chains manually is hardI believe it. Emitting partially-ordered effects puts even more or this kind of management responsibility on the frontend. This doesn't, by itself, mean the whole endeavor isn't worthwhile. The scheduler is too complexI believe it. But, as mentioned before, I think we can avoid that complexity by doing code motion on an existing schedule (something we already do!) rather than creating a new schedule from scratch. They use partial redundancy as an example: our egraph elaboration already avoids introducing partial redundancy and I don't see that getting any harder in this new world. Finding a good order to visit the graph is difficultAgain, we can avoid this issue since we would retain a CFG and a schedule of instructions. Cache unfriendlinessI'm skeptical of how generalizable these numbers are. A lot of this seems tied up in the exact data structures used for the IR. That said, CFGs and basic blocks have implicit ordering and dependencies between instructions, while SoN represents them with explicit edges. Those explicit edges need to have some kind of runtime representation, which will lead to larger in-memory structures and therefore more cache misses. So I believe it, but also think there is a lot of performance engineering that could be done to mitigate the effects. It seems like the potential code improvement could be worth the compile time overhead, especially for final-tier compilers. Sea of Nodes destroys any prior scheduling, by constructionWe already destroy any existing schedule with our egraph mid-end, so introducing effect dependencies doesn't change that for us. 🤷 |
I should add some thoughts here too -- I found the V8 post interesting, and obviously did a lot of thinking about this when initially researching and implementing our aegraphs mid-end (which has many parallels re: "free-floating nodes", as Nick mentions), so I certainly have some Opinions.
|
There are ways to have some of the "explicit, principled effect representation" cake without wrangling explicit SoN-style control and effect dependency edges through the entire pipeline. As described by Filip Pizlo (and implemented e.g. in WebKit's B3 JIT), one can pick a uniform representation for the effects of any instruction, including reads/writes of heap alias regions and control flow (branches "write control", control-dependent instructions "read control") and anything else that matters for code motion. This gives a more principled answer to "am I allowed to move this instruction past that one?" than a collection of ad-hoc flags like It's probably also easier for frontends to get such effect read/write sets right because they are mostly about what each instruction does (and some flat context like whether a load from wasm linear memory is gated under an explicit bounds check). Constructing the fully explicit def-use chains for each kind of effect seems harder to get right, similar to using mutable variables during CLIF construction vs. directly constructing everything in SSA form. Even if CLIF wanted explicit "effect values", it may be useful for In terms of the IR that optimization passes work on, only knowing read/write of each instruction is of course less convenient for doing aggressive code motion. You can easily tell which instructions have no effects at all, but moving an effectful instruction from program point A to program point B would be absurdly tricky with raw pairwise "can this instruction be swapped with a predecessor/successor" queries. I don't know if and how B3 solves it, but in my mind this is similar to def-use chains for data flow. You can build your entire IR around an explicit, sparse representation of which "writes" are relevant for which "reads" (i.e., SSA for data flow, SoN-style for effects/control), but you don't have to. It can also be handled by a supplementary analysis that's computed only in the part of the pipeline where it's most useful. For data flow dependencies, everyone seems to agree that SSA is a very good trade-off. For effects and control dependencies, no such consensus has emerged yet. But having a clear ground truth for "what effects/dependencies does this instruction have" seems good regardless of if/how you materialize the implied relations between different instructions. |
Agreed: there are multiple viable approaches to representing effects and control in an IR. We just have to try and figure out what is the best fit for Cranelift, and what is realistic given our engineering resources.
I see this approach as, roughly, extending our existing alias regions infrastructure from annotating only memory operations to any instruction. I do believe this would be an improvement over what we have today. That said, fully removing our ad-hoc flags would require roughly the same kind and amount of work as adding full effect edges to CLIF and I don't think it would solve the class of bug we hit that motivated my thinking on this topic, where a memory operation was moved past its safety checks, as effectively as adding effect dependencies would. On the one hand, we could make conditional branches clobber all alias regions, but this is overly conservative and would unnecessarily restrict code motion. We would end up generating worse code than we are able to today. We wouldn't be able to lift a Wasm linear memory's base pointer out of a loop anymore, for example. Alternatively, we could keep track of exactly which operations depend on that conditional branch, and make the branch clobber only those alias regions. This is equivalent to the work necessary to track and manage effect dependencies, however it has a few downsides:
In particular, I think (3) is the biggest downside of annotating any instruction with alias regions -- as well as our current state of things -- and conversely is the biggest potential benefit of adding effect dependencies. We aren't a large team and we don't have the maintainer-power to write and maintain more and more optimization passes and static analyses, brute-forcing our way to good codegen the traditional way. Therefore, we must pursue techniques that we believe can give us outsized performance-bang for their complexity-buck. With effect dependencies, I see the potential to address a scary class of bugs in a principled manner, simplify our internals, and allow for even more aggressive optimization than we can do today, all at once.
Yeah, there is definitely work we should be able to do in Off the top of my head, a good approach for frontends to start with would be to
This approach would be equivalent to, and no more difficult than, annotating every instruction with the alias regions it reads and clobbers in the frontend. We could wrap this pattern up in helper methods in |
Reusing the
But in this case, we might have a snippet like this: a store to wasm memory followed by a load from the same memory, with a constant address that's guaranteed in-bounds for the wasm memory type (v2 is the base address of that memory).
What prevents incorrectly moving the store before the load? Certainly not the SSA "defs dominate uses" requirement. This example could be fixed by making everything that reads some effect/alias region also produce a new effect value. But this flies in the face of the intuition I've tried to apply to these effect values, and needs some new tweak to make sure successive read-only operations aren't forced into a total order again by the effect chain. I suppose this gestures at the question in the back of my mind all this time: why don't the various SoN-style and (R)VSDG-style IRs do what's proposed in this issue? Why do they have separate kinds of edges for effects and for data flow? |
Ah, this is really interesting! In one sense, the memory-state values are linear (or affine): the store consumes the old memory/effect state (or clobbers it), so it is no longer available for the load to use. I see several ways to encode this into the IR:
I personally favor the last one, for simplicity reasons, though it is the most verbose encoding (hence the least space-efficient) if we have n readers for every writer, for example. |
Thanks for whipping up this example, it is indeed an interesting one. I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value after the operation (e.g. fanning in all effects to a So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.
I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value. The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further. I would lean towards making any operation that reads from the outside world also return a new effect value. |
Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing. But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a (There were already very good reasons given to not go this route, but it seems there’s even more complexity hiding there.) |
Hmm, while affine effect values lead to two different kinds of uses, an encoding into regular SSA def-use relationships may also need a similar distinction for defs: if you want to DCE a non-trapping load, you need to know that removing its effect result value (or replacing it with the effect inputs to the load?) is valid. In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone. |
I don't think this is actually too bad: we can annotate these instructions in the |
How do you distinguish loads that may trap from those that don’t, if both kinds of load take and produce effect values? |
We will still need to annotate loads with the |
In #10340, we fixed a bug where we were doing code motion of a non-trapping and read-only
load
operation such that its explicit dependencies were satisfied. However, CLIF only represents data dependencies between instructions explicitly, not control or effect dependencies. It turns out code motion was not safe because theload
's safety and non-trapping-ness implicitly depended upon control flow and its block's location in the CFG: specifically theload
was in a block that was dominated by abr_if
that performed a bounds check. Our optimizer could not see, and therefore did not abide by, that implicit constraint, and therefore performed "illegal" code motion leading to a miscompile. This bug could have been a very serious CVE, and basically was not due to a combination of luck and that it was lurking in less-exercised, off-by-default, not-yet-production-ready features that our fuzzers hadn't hammered on.We addressed this issue with an initial stop-gap solution by adding a new
can_move
flag toir::MemFlags
, effectively conservatively disabling code motion for all memory operations by default, and requiring that a memory operation set this flag to opt into dependency-based code motion in the optimizer. I remain concerned that we have similar misoptimization bugs still lurking, that we annotated a load withcan_move
when it should not have been, or that we will make an innocuous change in the future that, from a distance, invalidates the soundness of a downstream instruction'scan_move
annotation. A more-principled solution would be to eliminate implicit dependencies and explicitly represent all dependencies in the IR itself. Taken to its limit, where you no longer have a control-flow graph of basic blocks and instead only have control dependency edges between instructions, this is the "sea of nodes" approach to IR design. I believe that, after some offline discussions and brainstorming with Chris and Alex, I may have found somewhat of a sweet spot for Cranelift that doesn't require fully boiling the ocean to completely redefine CLIF, transforming it beyond recognition (we can keep block arguments, for example, which feel very CLIF-y) into that full-blown sea-of-nodes limit, but nonetheless makes all dependencies explicit.Benefits of Explicit Dependencies
First, let me enumerate the benefits we can reap by making all of an IR instruction's dependencies explicit:
We eliminate the whole class of misoptimization bugs that occur because the optimizer fails to satisfy constraints that it isn't privy to. Implicit constraints that the optimizer is ignorant of no longer exist.
We do not need to conservatively deny optimizations by default and require opting into, e.g., code motion. Our defaults can be to optimize exactly as much as the dependencies allow and no more than that. Optimization is more uniform. CLIF frontends no longer must consider whether a given memory operation is hot enough that it is worth opting into more aggressive optimizations for it and, if so, taking the time to verify whether doing so would even be sound.
Implementing the optimizer becomes easier. All IR passes and transformations must already take care to schedule instructions such that their dependencies are satisfied (i.e. that an instruction is dominated by the defs of any values it uses). However, right now we additionally encode an alias region in memory operations, and we must do an alias analysis in order to soundly perform redundant load elimination and store-to-load forwarding optimizations; we also must perform a state-coloring analysis during CLIF-to-MachInst lowering to determine whether fusing a particular memory operation into an instruction's operand (on architectures like
x86_64
where many instructions support both register and memory operands) is sound. By making all dependencies explicit, we can remove alias regions in memory operations and vastly simplify the alias and state-coloring analyses (to the point that, I believe, we would barely even consider them full-blown analyses anymore).Explicit side-effect dependencies generalize our small number of fixed alias regions into any number of logical alias regions: just depend on this logical alias region's most-recent side effect.
We can more easily implement new, more-aggressive optimizations, such as branch-folding and optimizing side-effectful instructions.
Our current architecture makes it difficult to optimize side-effectful and control-flow instructions in the egraph because the egraph works by rewriting values and organizing values into eclasses; side-effects and control-flow are not represented by values but are instead implicit in their associated instruction. The optimizer lifts everything it can into the egraph and leaves behind the "skeleton" of side-effectful and control-flow instructions in the CFG and its basic blocks. This skeleton's instruction's operands act to root demanded computations inside the egraph from the outside.
If we represent control-flow and side-effect dependencies as compile-time-only values, then we can rewrite and optimize these values in the egraph. We don't demand computations from the egraph via a side-effectful skeleton, but instead with a set of control-flow and side-effect value dependencies.
We can additionally do things like if we
call
a function that is known not to modify thevmctx
, then we can reuse an already-loaded value from thevmctx
across thatcall
.Sketch of the CLIF Changes
With that out of the way, here is a sketch of the changes I am imagining:
We introduce a new
ir::types::Effect
. Values of this type are used to represent side-effect and control (control-flow being a type of side-effect, if you squint hard enough) dependencies of an instruction at compile time. They are not materialized at run time.The function's entry block's first argument will be an effect value. This represents the control flow and state of the world (e.g. memory) upon entering the function.
Instructions that mutate external state, like stores, will both take and produce an effect value:
This means that the side-effect that
effect0
represents must have already happened before we perform thisstore
, and any instruction (such as aload
) that might depend on this store having been performed can takeeffect1
(or another effect value derived fromeffect1
) as an operand.Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:
This means that the side-effect that
effect
represents must have already happened before we perform thisload
.Similarly, instructions which may trap take and produce effect values:
Unconditional jumps take an effect value operand. Because effect values are just
ir::Value
s, they may also, but are not required to, pass effect values as arguments to the target block.Conditional branches take an effect value operand and additionally define a unique effect value result on each of their outgoing control edges. This value-definition-on-control-edge is similar to how
try_call
instructions define values on their outgoing control edges and we can use the same technique proposed there to pass these values to their target blocks: by syntactically passing thectrl
placeholder as an argument to the block call, rather than an already-defined value.We will probably want an instruction to combine multiple effect values into a single new effect value:
This can be used to make an instruction depend on both a previous
store
and abr_if
bounds check.More Implications
And here are some more random things that fall out from this design:
It still looks like CLIF! It remains familiar. Most instructions are pure, in fact, and therefore unchanged. We still have block args/params instead of phis. We still have a text format that is easy to parse and dump. It isn't just one big spaghetti mess of instructions. Instructions are still organized in blocks, we just have more freedom to do code motion.
That said, the order of instructions in a block, and the whole of
ir::Layout
, represents only a particular scheduling of instructions and does not encode any implicit constraints about where an instruction may or may not be placed. Placement constraints (i.e. how far and where we can code motion something) are solely defined by the instruction's explicit operands.Our existing optimizations that operate on values, uses, and defs, like GVN and unnecessary phi removal, will Just Work with effect values and will therefore automatically be extended from pure instructions to side effectful instructions (no more
side_effects_idempotent
flag needed for instructions!) and this will improve code generation. Removing unnecessary effect value phis, for example, lets us do even more code motion than we otherwise would have with the phi. GVN deduplicating identicalstore
instructions (same pointer, same value, and same effect operand) is equivalent to a basic redundant-store-elimination optimization -- without any additional alias analysis required.We can remove alias regions from
ir::MemFlags
and use effect values instead. Today, the optimizer is free to hoist the load past the store and out of the loop because they are annotated with different alias regions:But we only have a fixed number of alias regions, and we want many more.
With effect values, we don't have any limit on the number of logical alias regions we create. As long as we wire up the instruction dependencies correctly, and don't over-constrain instructions, then everything falls into place naturally:
Potential Challenges
Of course, any large change to CLIF like this will also face challenges. Here are some that I foresee:
Although we no longer need to track implicit dependencies in core Cranelift, CLIF frontends must track them and explicitly encode them when producing CLIF. This is additional work that the frontend must perform. That said, frontends are free to over-constrain instructions with false dependencies. That can make initial migration easier, and then they can start refining those dependencies, making them more fine-grained and relaxing over-approximated constraints towards their precise constraints, to unlock more aggressive optimizations as needed.
Incrementally implementing these changes in Cranelift might be difficult. I haven't thought too deeply about this yet. We might be able to introduce these dependencies in CLIF, but not take action on them yet, then start using and preserving them in the egraph, then start using them in the alias analysis, etc... until everything is migrated over, at which point we can remove alias regions from
ir::MemFlags
and finalize the migration and all of that. It may also be worth defining new copies of all of our relevant instructions that take and return effect values, alongside their current versions that don't take and return effect values, so that we don't have to move over, e.g., all memory operations in all of Cranelift all at once.Performance. Explicitly representing all control-flow and side-effect dependencies is a lot of new edges in the IR and additional operands and results. That said, we won't generally need to work with the
ir::Layout
or iterate over instructions inside a block any more (really just need that at the end of the mid-end when scheduling instructions and lowering to machine instructions). So in some sense we are "just" shifting some edges from inside their::Layout
to the instructions themselves. Of course, the devil will be in the details and we will definitely need to do a fair amount of profiling and performance work to preserve our current compile times.The text was updated successfully, but these errors were encountered: