Skip to content

Cranelift: towards representing control and effect dependencies in CLIF #10427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fitzgen opened this issue Mar 19, 2025 · 13 comments
Open

Cranelift: towards representing control and effect dependencies in CLIF #10427

fitzgen opened this issue Mar 19, 2025 · 13 comments
Labels
cranelift:area:clif cranelift:discussion cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift:mid-end clif-to-clif related passes, legalizations, etc... cranelift:moonshot cranelift Issues related to the Cranelift code generator

Comments

@fitzgen
Copy link
Member

fitzgen commented Mar 19, 2025

Note: pie-in-the-sky ideas incoming. I'm not proposing we stop everything and do this immediately, just that we consider it for Cranelift's long-term evolution. Apologies if anything here isn't super clear, I didn't sleep well last night and I'm really tired.

In #10340, we fixed a bug where we were doing code motion of a non-trapping and read-only load operation such that its explicit dependencies were satisfied. However, CLIF only represents data dependencies between instructions explicitly, not control or effect dependencies. It turns out code motion was not safe because the load's safety and non-trapping-ness implicitly depended upon control flow and its block's location in the CFG: specifically the load was in a block that was dominated by a br_if that performed a bounds check. Our optimizer could not see, and therefore did not abide by, that implicit constraint, and therefore performed "illegal" code motion leading to a miscompile. This bug could have been a very serious CVE, and basically was not due to a combination of luck and that it was lurking in less-exercised, off-by-default, not-yet-production-ready features that our fuzzers hadn't hammered on.

We addressed this issue with an initial stop-gap solution by adding a new can_move flag to ir::MemFlags, effectively conservatively disabling code motion for all memory operations by default, and requiring that a memory operation set this flag to opt into dependency-based code motion in the optimizer. I remain concerned that we have similar misoptimization bugs still lurking, that we annotated a load with can_move when it should not have been, or that we will make an innocuous change in the future that, from a distance, invalidates the soundness of a downstream instruction's can_move annotation. A more-principled solution would be to eliminate implicit dependencies and explicitly represent all dependencies in the IR itself. Taken to its limit, where you no longer have a control-flow graph of basic blocks and instead only have control dependency edges between instructions, this is the "sea of nodes" approach to IR design. I believe that, after some offline discussions and brainstorming with Chris and Alex, I may have found somewhat of a sweet spot for Cranelift that doesn't require fully boiling the ocean to completely redefine CLIF, transforming it beyond recognition (we can keep block arguments, for example, which feel very CLIF-y) into that full-blown sea-of-nodes limit, but nonetheless makes all dependencies explicit.

Benefits of Explicit Dependencies

First, let me enumerate the benefits we can reap by making all of an IR instruction's dependencies explicit:

  • We eliminate the whole class of misoptimization bugs that occur because the optimizer fails to satisfy constraints that it isn't privy to. Implicit constraints that the optimizer is ignorant of no longer exist.

  • We do not need to conservatively deny optimizations by default and require opting into, e.g., code motion. Our defaults can be to optimize exactly as much as the dependencies allow and no more than that. Optimization is more uniform. CLIF frontends no longer must consider whether a given memory operation is hot enough that it is worth opting into more aggressive optimizations for it and, if so, taking the time to verify whether doing so would even be sound.

  • Implementing the optimizer becomes easier. All IR passes and transformations must already take care to schedule instructions such that their dependencies are satisfied (i.e. that an instruction is dominated by the defs of any values it uses). However, right now we additionally encode an alias region in memory operations, and we must do an alias analysis in order to soundly perform redundant load elimination and store-to-load forwarding optimizations; we also must perform a state-coloring analysis during CLIF-to-MachInst lowering to determine whether fusing a particular memory operation into an instruction's operand (on architectures like x86_64 where many instructions support both register and memory operands) is sound. By making all dependencies explicit, we can remove alias regions in memory operations and vastly simplify the alias and state-coloring analyses (to the point that, I believe, we would barely even consider them full-blown analyses anymore).

  • Explicit side-effect dependencies generalize our small number of fixed alias regions into any number of logical alias regions: just depend on this logical alias region's most-recent side effect.

  • We can more easily implement new, more-aggressive optimizations, such as branch-folding and optimizing side-effectful instructions.

    Our current architecture makes it difficult to optimize side-effectful and control-flow instructions in the egraph because the egraph works by rewriting values and organizing values into eclasses; side-effects and control-flow are not represented by values but are instead implicit in their associated instruction. The optimizer lifts everything it can into the egraph and leaves behind the "skeleton" of side-effectful and control-flow instructions in the CFG and its basic blocks. This skeleton's instruction's operands act to root demanded computations inside the egraph from the outside.

    If we represent control-flow and side-effect dependencies as compile-time-only values, then we can rewrite and optimize these values in the egraph. We don't demand computations from the egraph via a side-effectful skeleton, but instead with a set of control-flow and side-effect value dependencies.

    We can additionally do things like if we call a function that is known not to modify the vmctx, then we can reuse an already-loaded value from the vmctx across that call.

Sketch of the CLIF Changes

With that out of the way, here is a sketch of the changes I am imagining:

  • We introduce a new ir::types::Effect. Values of this type are used to represent side-effect and control (control-flow being a type of side-effect, if you squint hard enough) dependencies of an instruction at compile time. They are not materialized at run time.

  • The function's entry block's first argument will be an effect value. This represents the control flow and state of the world (e.g. memory) upon entering the function.

  • Instructions that mutate external state, like stores, will both take and produce an effect value:

    effect1 = store.i32 effect0, value, pointer+offset

    This means that the side-effect that effect0 represents must have already happened before we perform this store, and any instruction (such as a load) that might depend on this store having been performed can take effect1 (or another effect value derived from effect1) as an operand.

  • Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:

    val = load.i32 effect, pointer+offset

    This means that the side-effect that effect represents must have already happened before we perform this load.

  • Similarly, instructions which may trap take and produce effect values:

    effect1 = trapnz effect0, condition, trap_code
  • Unconditional jumps take an effect value operand. Because effect values are just ir::Values, they may also, but are not required to, pass effect values as arguments to the target block.

    jump effect, block(v0, v1, ...)
  • Conditional branches take an effect value operand and additionally define a unique effect value result on each of their outgoing control edges. This value-definition-on-control-edge is similar to how try_call instructions define values on their outgoing control edges and we can use the same technique proposed there to pass these values to their target blocks: by syntactically passing the ctrl placeholder as an argument to the block call, rather than an already-defined value.

    block0(v0: effect, v1: i8, v2: i32, v3: i32):
      ...
      br_if v0, v1, block1(ctrl, v2), block2(v3, ctrl)
    
    block1(v4: effect, v5: i32):
      ...
    
    block2(v6: i32, v7: effect):
      ...
  • We will probably want an instruction to combine multiple effect values into a single new effect value:

    effect2 = combine_effects effect0, effect1, ...

    This can be used to make an instruction depend on both a previous store and a br_if bounds check.

More Implications

And here are some more random things that fall out from this design:

  • It still looks like CLIF! It remains familiar. Most instructions are pure, in fact, and therefore unchanged. We still have block args/params instead of phis. We still have a text format that is easy to parse and dump. It isn't just one big spaghetti mess of instructions. Instructions are still organized in blocks, we just have more freedom to do code motion.

  • That said, the order of instructions in a block, and the whole of ir::Layout, represents only a particular scheduling of instructions and does not encode any implicit constraints about where an instruction may or may not be placed. Placement constraints (i.e. how far and where we can code motion something) are solely defined by the instruction's explicit operands.

  • Our existing optimizations that operate on values, uses, and defs, like GVN and unnecessary phi removal, will Just Work with effect values and will therefore automatically be extended from pure instructions to side effectful instructions (no more side_effects_idempotent flag needed for instructions!) and this will improve code generation. Removing unnecessary effect value phis, for example, lets us do even more code motion than we otherwise would have with the phi. GVN deduplicating identical store instructions (same pointer, same value, and same effect operand) is equivalent to a basic redundant-store-elimination optimization -- without any additional alias analysis required.

  • We can remove alias regions from ir::MemFlags and use effect values instead. Today, the optimizer is free to hoist the load past the store and out of the loop because they are annotated with different alias regions:

    block0(v0, v1):
      jump block1(v1)
    
    block1(v2):
      store.i32 table v0, v2
      v3 = load.i32 vmctx, v1
      ...
      jump block1(...)
    
    ==>
    
    block0(v0, v1):
      v3 = load.i32 vmctx, v1
      jump block1(v1)
    
    loop(v2):
      store.i32 table v0, v2
      ...
      jump block1(...)

    But we only have a fixed number of alias regions, and we want many more.

    With effect values, we don't have any limit on the number of logical alias regions we create. As long as we wire up the instruction dependencies correctly, and don't over-constrain instructions, then everything falls into place naturally:

    block0(effect0, v0, v1):
      jump effect0, block1(effect0, v1)
    
    block1(effect1, v2):
      effect2 = store.i32 effect1, v0, v2
      v3 = load.i32 effect0, v1
      ...
      jump effect2, block1(effect2, ...)
    
    ==>
    
    block0(effect0, v0, v1):
      v3 = load.i32 effect0, v1
      jump effect0, block1(effect0, v1)
    
    block1(effect1, v2):
      effect2 = store.i32 effect1, v0, v2
      ...
      jump effect2, block1(effect2, ...)

Potential Challenges

Of course, any large change to CLIF like this will also face challenges. Here are some that I foresee:

  • Although we no longer need to track implicit dependencies in core Cranelift, CLIF frontends must track them and explicitly encode them when producing CLIF. This is additional work that the frontend must perform. That said, frontends are free to over-constrain instructions with false dependencies. That can make initial migration easier, and then they can start refining those dependencies, making them more fine-grained and relaxing over-approximated constraints towards their precise constraints, to unlock more aggressive optimizations as needed.

  • Incrementally implementing these changes in Cranelift might be difficult. I haven't thought too deeply about this yet. We might be able to introduce these dependencies in CLIF, but not take action on them yet, then start using and preserving them in the egraph, then start using them in the alias analysis, etc... until everything is migrated over, at which point we can remove alias regions from ir::MemFlags and finalize the migration and all of that. It may also be worth defining new copies of all of our relevant instructions that take and return effect values, alongside their current versions that don't take and return effect values, so that we don't have to move over, e.g., all memory operations in all of Cranelift all at once.

  • Performance. Explicitly representing all control-flow and side-effect dependencies is a lot of new edges in the IR and additional operands and results. That said, we won't generally need to work with the ir::Layout or iterate over instructions inside a block any more (really just need that at the end of the mid-end when scheduling instructions and lowering to machine instructions). So in some sense we are "just" shifting some edges from inside the ir::Layout to the instructions themselves. Of course, the devil will be in the details and we will definitely need to do a fair amount of profiling and performance work to preserve our current compile times.

@fitzgen fitzgen added cranelift Issues related to the Cranelift code generator cranelift:area:clif cranelift:discussion cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift:mid-end clif-to-clif related passes, legalizations, etc... cranelift:moonshot labels Mar 19, 2025
@bjorn3
Copy link
Contributor

bjorn3 commented Mar 25, 2025

V8 is moving the exact opposite direction as sea-of-nodes was found to have several disadvantages in practice for their use case at least. Not all of them would apply to Cranelift, but it may still be informative. https://v8.dev/blog/leaving-the-sea-of-nodes

@fitzgen
Copy link
Member Author

fitzgen commented Mar 25, 2025

Thanks for the link. I'm glad they put their rationale into writing, because it was all tribal knowledge and rumors before this.

I'm going to to through their list of downsides and respond to each of them in the context of CLIF and what I wrote up above. All this should be taken with some amount of salt though because they've obviously had much more hands-on experience with effect dependencies in their IR than I have!

Manually/visually inspecting and understanding a Sea of Nodes graph is hard

I think that what I sketched above can largely sidestep this issue because we would still organize things into blocks, and effectively always have an instruction schedule, rather than a big soup of nodes. We would still have a familiar text format, etc...

Basically, this issue is proposing: "Let's add the control/effect dependencies, but not remove basic blocks. Then our goal is to do as much aggressive code motion in our existing CFG and basic blocks as we can while satisfying dependencies, rather than to reconstruct a CFG and schedule instructions in blocks from scratch." I think they are ultimately equivalent in theory, but I suspect that the former will be easier than the latter in practice (we basically already do the former during egraph elaboration).

Too many nodes are on the effect chain and/or have a control input

Quite possible this would still be an issue.

The source program (Wasm or rustc MIR) is going to constrain things somewhat. Some programs just don't have as much wiggle room as others. This is the nature of compiler optimization in general though. Many programs might not benefit from a particular analysis, but the ones that do really benefit. The trick is finding the best bang for your buck and complexity budget.

It is worth mentioning that the source-to-CLIF frontend could also make things worse by unnecessarily introducing dependencies that aren't strictly needed (for example, by linearizing all effects during CLIF construction, even when they could have been a partial ordering) which would constrain the optimizer and give it less freedom for code motion.

Memory operations do not float easily

Looking at their example, it seems like the difficulty is somewhat incidental and not fundamental.

They have effect dependencies between their two loads, but neither load actually produces an effect AFAICT, and so they should both instead depend on start, not on each other. With that change, they should each be able to float to either side of the if, since all their dependencies will be satisfied at that point. It seems like this might be an instance of the issue I was just mentioning where the frontend introduces unnecessary dependencies. They talk about doing analyses to determine when they can relax dependencies and how this is no harder on a CFG than SoN. I don't see any reason that is false, but would prefer to avoid these analyses completely and investing in better dependency construction in the frontend.

SoN is just CFG where pure nodes are floating

This is a really interesting/revealing statement because Cranelift already has "CFG where pure nodes are floating" and that is our egraph and side-effectful skeleton! And what I wrote up above is very much not just that because we would not impose a total ordering on anything that produces or consumes side effects, these operations would be partially ordered, and this is exactly the property that allows for many different schedulings / more aggressive code motion and ultimately better optimization. I think a key part of avoiding a single effect chain with a total ordering would be the bits I wrote up about generalizing our alias regions to effect dependencies. If you linearize all memory operations into a total order, that is equivalent to using a single alias region, and we know that we get tons of benefit from using multiple alias regions already today.

Managing the effect and control chains manually is hard

I believe it. Emitting partially-ordered effects puts even more or this kind of management responsibility on the frontend.

This doesn't, by itself, mean the whole endeavor isn't worthwhile.

The scheduler is too complex

I believe it.

But, as mentioned before, I think we can avoid that complexity by doing code motion on an existing schedule (something we already do!) rather than creating a new schedule from scratch. They use partial redundancy as an example: our egraph elaboration already avoids introducing partial redundancy and I don't see that getting any harder in this new world.

Finding a good order to visit the graph is difficult

Again, we can avoid this issue since we would retain a CFG and a schedule of instructions.

Cache unfriendliness

I'm skeptical of how generalizable these numbers are. A lot of this seems tied up in the exact data structures used for the IR.

That said, CFGs and basic blocks have implicit ordering and dependencies between instructions, while SoN represents them with explicit edges. Those explicit edges need to have some kind of runtime representation, which will lead to larger in-memory structures and therefore more cache misses. So I believe it, but also think there is a lot of performance engineering that could be done to mitigate the effects. It seems like the potential code improvement could be worth the compile time overhead, especially for final-tier compilers.

Sea of Nodes destroys any prior scheduling, by construction

We already destroy any existing schedule with our egraph mid-end, so introducing effect dependencies doesn't change that for us. 🤷

@cfallin
Copy link
Member

cfallin commented Mar 26, 2025

I should add some thoughts here too -- I found the V8 post interesting, and obviously did a lot of thinking about this when initially researching and implementing our aegraphs mid-end (which has many parallels re: "free-floating nodes", as Nick mentions), so I certainly have some Opinions.

  • The kernel of the writeup seems to be that the SoN approach is too complex to debug or introspect; see e.g. all the notes about visualization. I certainly experienced some of this while implementing and debugging aegraphs, and we'd see more complexity for sure if we added deps to CLIF outside of the aegraph phase, but to some degree I think this is philosophical: either one represents this dimension of the IR or one does not, and if the representation is needed to enable some transform, it's usually better to buy into it (and then focus on making the ergonomics for debugging and development better) than to construct an ad-hoc approximation of it.

  • The writeup mentions difficulties with code de-duplication / re-duplication and avoiding creating redundant computations; we solved this with our aegraph "elaboration" algorithm, which ends up giving exactly equivalent results to classical CSE/GVN, and it just isn't really an issue. In fact de-duplication helps with compile time because it means that rewrites only need to happen once for a given node.

  • The writeup states that there was less opportunity than expected to take advantage of the partial-order freedom to reschedule code. I don't doubt that in the context of their JIT backend, but we're motivating this issue in our context with increasingly fine-grained alias information that we know should unlock optimization potential. It seems that messy JavaScript semantics played a large part for V8 here: the "looking at a binary operator wrong could run any code and explode the universe" aspect of the language's dynamism means that code is mostly linearized in practice, whereas with Wasm GC operators' static typing and more constrained semantics, there might be more to do. All we need is a few examples of e.g. LICM-ability of loads unlocked by the partial deps to make the framework worth it.

  • I suspect there is some "project lifecycle" aspect here too. V8 is a very mature and complex compiler backend, and so they have implemented many of the optimizations that they want to build and will be looking for ways to simplify. On the other hand we are still relatively young in some ways and are looking for ways to enable more performance. That doesn't mean that we have to pursue complexity, or can't learn from others; but it does mean that the calculus is a bit different, and simple implementations of ideas like this one (without going all-in or removing blocks completely -- again as Nick describes above) might be appropriate ways to find this performance, or at least to experiment with.

@hanna-kruppe
Copy link

There are ways to have some of the "explicit, principled effect representation" cake without wrangling explicit SoN-style control and effect dependency edges through the entire pipeline. As described by Filip Pizlo (and implemented e.g. in WebKit's B3 JIT), one can pick a uniform representation for the effects of any instruction, including reads/writes of heap alias regions and control flow (branches "write control", control-dependent instructions "read control") and anything else that matters for code motion. This gives a more principled answer to "am I allowed to move this instruction past that one?" than a collection of ad-hoc flags like can_move, readonly, notrap, and so on.

It's probably also easier for frontends to get such effect read/write sets right because they are mostly about what each instruction does (and some flat context like whether a load from wasm linear memory is gated under an explicit bounds check). Constructing the fully explicit def-use chains for each kind of effect seems harder to get right, similar to using mutable variables during CLIF construction vs. directly constructing everything in SSA form. Even if CLIF wanted explicit "effect values", it may be useful for cranelift-frontend to construct these from per-instruction read/write sets in a generic fashion, instead of leaving producers entirely to their own devices.

In terms of the IR that optimization passes work on, only knowing read/write of each instruction is of course less convenient for doing aggressive code motion. You can easily tell which instructions have no effects at all, but moving an effectful instruction from program point A to program point B would be absurdly tricky with raw pairwise "can this instruction be swapped with a predecessor/successor" queries. I don't know if and how B3 solves it, but in my mind this is similar to def-use chains for data flow. You can build your entire IR around an explicit, sparse representation of which "writes" are relevant for which "reads" (i.e., SSA for data flow, SoN-style for effects/control), but you don't have to. It can also be handled by a supplementary analysis that's computed only in the part of the pipeline where it's most useful. For data flow dependencies, everyone seems to agree that SSA is a very good trade-off. For effects and control dependencies, no such consensus has emerged yet. But having a clear ground truth for "what effects/dependencies does this instruction have" seems good regardless of if/how you materialize the implied relations between different instructions.

@fitzgen
Copy link
Member Author

fitzgen commented Mar 31, 2025

You can build your entire IR around an explicit, sparse representation of which "writes" are relevant for which "reads" (i.e., SSA for data flow, SoN-style for effects/control), but you don't have to. It can also be handled by a supplementary analysis that's computed only in the part of the pipeline where it's most useful. For data flow dependencies, everyone seems to agree that SSA is a very good trade-off. For effects and control dependencies, no such consensus has emerged yet. But having a clear ground truth for "what effects/dependencies does this instruction have" seems good regardless of if/how you materialize the implied relations between different instructions.

Agreed: there are multiple viable approaches to representing effects and control in an IR. We just have to try and figure out what is the best fit for Cranelift, and what is realistic given our engineering resources.

There are ways to have some of the "explicit, principled effect representation" cake without wrangling explicit SoN-style control and effect dependency edges through the entire pipeline. As described by Filip Pizlo (and implemented e.g. in WebKit's B3 JIT), one can pick a uniform representation for the effects of any instruction, including reads/writes of heap alias regions and control flow (branches "write control", control-dependent instructions "read control") and anything else that matters for code motion. This gives a more principled answer to "am I allowed to move this instruction past that one?" than a collection of ad-hoc flags like can_move, readonly, notrap, and so on.

I see this approach as, roughly, extending our existing alias regions infrastructure from annotating only memory operations to any instruction. I do believe this would be an improvement over what we have today. That said, fully removing our ad-hoc flags would require roughly the same kind and amount of work as adding full effect edges to CLIF and I don't think it would solve the class of bug we hit that motivated my thinking on this topic, where a memory operation was moved past its safety checks, as effectively as adding effect dependencies would.

On the one hand, we could make conditional branches clobber all alias regions, but this is overly conservative and would unnecessarily restrict code motion. We would end up generating worse code than we are able to today. We wouldn't be able to lift a Wasm linear memory's base pointer out of a loop anymore, for example.

Alternatively, we could keep track of exactly which operations depend on that conditional branch, and make the branch clobber only those alias regions. This is equivalent to the work necessary to track and manage effect dependencies, however it has a few downsides:

  1. It is a little weird since you are not adding a dependency to the instruction that relies on that branch having happened, but instead going back to that branch that was already emitted and mutating its alias region clobber set after the fact. If some optimization pass later removes the instruction that added the branch's clobber for a particular alias region, the pass must either do some kind of analysis to see if any instruction still depends on that branch before it can safely remove the clobber, or it must leave the clobber in place, which adds unnecessary constraints and potentially pessimizes our generated code.

  2. It results in each alias region having a linear side effect chain, rather than an arbitrary partial ordering. This, again, imposes artificial constraints on the optimizer.

  3. While it does unify side-effects and control at the IR level, it does not unify with pure data dependencies, which means that we don't ultimately unify and simplify dependency-management in our compiler/optimizer: we have to consider both an instruction's pure data-flow dependencies and we must also infer its side-effectful dependencies from which alias regions it uses and which predecessors clobbered that region.

In particular, I think (3) is the biggest downside of annotating any instruction with alias regions -- as well as our current state of things -- and conversely is the biggest potential benefit of adding effect dependencies. We aren't a large team and we don't have the maintainer-power to write and maintain more and more optimization passes and static analyses, brute-forcing our way to good codegen the traditional way. Therefore, we must pursue techniques that we believe can give us outsized performance-bang for their complexity-buck. With effect dependencies, I see the potential to address a scary class of bugs in a principled manner, simplify our internals, and allow for even more aggressive optimization than we can do today, all at once.

It's probably also easier for frontends to get such effect read/write sets right because they are mostly about what each instruction does (and some flat context like whether a load from wasm linear memory is gated under an explicit bounds check). Constructing the fully explicit def-use chains for each kind of effect seems harder to get right, similar to using mutable variables during CLIF construction vs. directly constructing everything in SSA form. Even if CLIF wanted explicit "effect values", it may be useful for cranelift-frontend to construct these from per-instruction read/write sets in a generic fashion, instead of leaving producers entirely to their own devices.

Yeah, there is definitely work we should be able to do in cranelift-frontend to help improve things on this front.

Off the top of my head, a good approach for frontends to start with would be to

  • define a cranelift_frontend::Variable for each logical alias region they have,
  • initialize those variables to the function's initial, starting effect inside the entry block as the first thing they do before emitting any instructions,
  • and then a store to Wasm memory i (for example) would get the value of the ith Wasm memory's effect Variable, depend on that value when emitting the store instruction, and then redefine that variable with the new effect value returned from the inserted store instruction.

This approach would be equivalent to, and no more difficult than, annotating every instruction with the alias regions it reads and clobbers in the frontend. We could wrap this pattern up in helper methods in cranelift-frontend.

@hanna-kruppe
Copy link

hanna-kruppe commented Mar 31, 2025

Reusing the Variable infrastructure for "this instruction reads/writes these heap alias regions" is a very neat unifying idea, and a clear advantage of treating them as just Values. I've started thinking about the implications for control dependencies and other effects that aren't just heap alias regions (e.g., potential trapping). But now I've convinced myself there's something missing even from the description of heap alias regions! The OP said:

Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:

But in this case, we might have a snippet like this: a store to wasm memory followed by a load from the same memory, with a constant address that's guaranteed in-bounds for the wasm memory type (v2 is the base address of that memory).

block1(v0: effect, v1: i8, v2: i64):
  v3: i8 = load.i8 v0, v2+123
  v4: effect = store.i8 v0, v1, v2+123
  ...

What prevents incorrectly moving the store before the load? Certainly not the SSA "defs dominate uses" requirement. This example could be fixed by making everything that reads some effect/alias region also produce a new effect value. But this flies in the face of the intuition I've tried to apply to these effect values, and needs some new tweak to make sure successive read-only operations aren't forced into a total order again by the effect chain.

I suppose this gestures at the question in the back of my mind all this time: why don't the various SoN-style and (R)VSDG-style IRs do what's proposed in this issue? Why do they have separate kinds of edges for effects and for data flow?

@cfallin
Copy link
Member

cfallin commented Mar 31, 2025

What prevents incorrectly moving the store before the load?

Ah, this is really interesting! In one sense, the memory-state values are linear (or affine): the store consumes the old memory/effect state (or clobbers it), so it is no longer available for the load to use.

I see several ways to encode this into the IR:

  • Create a notion of linear/affine values and consume them (thus destroying them) when the memory state is overwritten. Not too hard to check, but may have weird/unknown implications on other transforms, because that is definitely not a usual feature of SSA. So the failure mode here is that we have a long tail of IR verification failures.
  • Create a notion of "clobbers" -- a variation on the above -- at the SSA level. An op is allowed to indicate that some SSA value is no longer available in points dominated. It's kind of an "anti-def".
  • When lowering from the "variable" level to SSA, track all readers of a given version of an effect, and fan-in deps to the next update. This encodes the deps in a classical SSA way with explicit edges.

I personally favor the last one, for simplicity reasons, though it is the most verbose encoding (hence the least space-efficient) if we have n readers for every writer, for example.

@fitzgen
Copy link
Member Author

fitzgen commented Mar 31, 2025

What prevents incorrectly moving the store before the load?

Thanks for whipping up this example, it is indeed an interesting one.


I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value after the operation (e.g. fanning in all effects to a call and then fanning out to recover the new version of each logical alias region). Agreed with Chris that it might have weird implications on various transforms and that we would need a bunch of annoying/specialized CLIF validation rules.

So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.

block1(v0: effect, v1: i8, v2: i64):
  v3: i8, v4: effect = load.i8 v0, v2+123
  v5: effect = store.i8 v4, v1, v2+123
  ...

I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value.

The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further.

I would lean towards making any operation that reads from the outside world also return a new effect value.

@hanna-kruppe
Copy link

hanna-kruppe commented Mar 31, 2025

Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing.

But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a combine_effects operation, is that a consuming use of the input effects? If yes, you have to artificially introduce order between operations that depend on intersecting but non-identical sets of effects. If not, this is a way to duplicate effect values and must be accounted for when constructing the consuming use(s) of later writes.

(There were already very good reasons given to not go this route, but it seems there’s even more complexity hiding there.)

@hanna-kruppe
Copy link

hanna-kruppe commented Mar 31, 2025

Hmm, while affine effect values lead to two different kinds of uses, an encoding into regular SSA def-use relationships may also need a similar distinction for defs: if you want to DCE a non-trapping load, you need to know that removing its effect result value (or replacing it with the effect inputs to the load?) is valid. In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.

@fitzgen
Copy link
Member Author

fitzgen commented Mar 31, 2025

In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.

I don't think this is actually too bad: we can annotate these instructions in the cranelift-meta crate's instruction definitions/builders to automatically generate predicate methods for use in cranelift-codegen, similar to our side_effects_idempotent flag, for example.

@hanna-kruppe
Copy link

hanna-kruppe commented Mar 31, 2025

How do you distinguish loads that may trap from those that don’t, if both kinds of load take and produce effect values?

@fitzgen
Copy link
Member Author

fitzgen commented Mar 31, 2025

We will still need to annotate loads with the ir::TrapCode of their trap, so we can use that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:clif cranelift:discussion cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift:mid-end clif-to-clif related passes, legalizations, etc... cranelift:moonshot cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

No branches or pull requests

4 participants