This repository was archived by the owner on Sep 15, 2025. It is now read-only.
forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 16
ELF note types for LLPC cache hash #2
Open
jaebaek
wants to merge
171
commits into
GPUOpen-Drivers:amd-gfx-gpuopen-dev
Choose a base branch
from
jaebaek:llpc_cache_hash_note
base: amd-gfx-gpuopen-dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
ELF note types for LLPC cache hash #2
jaebaek
wants to merge
171
commits into
GPUOpen-Drivers:amd-gfx-gpuopen-dev
from
jaebaek:llpc_cache_hash_note
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pass attempts to merge s_buffer_load_dword instructions into larger sizes. Constraints are that the resource descriptor, glc flag and size are all the same for contiguous load instructions. TODO: it should be possible to add different instructions to this as well such as s_buffer_store and the vector variants buffer_load/buffer_store V2: Fixes to sbuff-merge test V3: Fixed smrd test for smem merging V4: Fixed for gfx10 changes Change-Id: Ie98107ad3b27b3c8c2da7afc3d129336f924895a
Summary: Added intrinsics for the instructions: - buffer_load_ubyte - buffer_load_ushort - buffer_store_byte - buffer_store_short Added test cases to existing buffer load/store tests. Now that upstream supports byte/short overloads of the normal load/store intrinsics, this change is only needed until LLPC stops using the intrinsics added here. Change-Id: I1b8a0910c508c9520a84b74a14d0aea8293cef38
Vulkan exposed an issue with this for a case with v_mad_mixlo_f16 where the upper 16 bits were not cleared. Modifying this to clear the bits instead of just copying fixed the problem. V2: Fixed up "Fix issue for zext of f16 to i32" V3: Fixed fcanonicalize-elimination test Change-Id: I6128deaf8ebc3489fdd3e0caead837410bb47160
Readlane should have a uniform index - but in some cases it can be non-uniform. The original implementation of readlane defaulted to using a readfirstlane of the vgpr index on the assumption that the index would be uniform across all lanes. However, there are some cases where we might want a readlane like operation that is lowered to a waterfall that performs a readlane for each distinct index value across lanes. Essentially we form a loop that uses readfirstlane to get the first enabled lane's index - enable all lanes with the same index, perform a readlane and put the result into the vgpr result register, then disable those lanes and repeat. The pathological case will form a loop of 64 - but in most cases it will be fewer. In the case where the index IS uniform across all lanes, it will loop once, which is slightly more expensive than the original code, but not a lot more. V2: Initialize accumulator register in waterfall code V3: Fixed for gfx10 changes Change-Id: Ic52dfb14b053db7713b61c85e8ac766991aecdb8
Even though writelane doesn't have the same constraints as other valu instructions it still can't violate the >1 SGPR operand constraint Due to later register propagation (e.g. fixing up vgpr operands via readfirstlane) changing writelane to only have a single SGPR is tricky. This implementation puts a new check after SIFixSGPRCopies that prevents multiple SGPRs being used in any writelane instructions. The algorithm used is to check for trivial copy prop of constants into one of the SGPR operands and perform that if possible. If this isn't possible put an explicit copy of Src1 SGPR into M0 and use that instead (this is allowable for writelane as the constraint is for SGPR read-port and not constant-bus access). V2: Fix-up cases where writelane has 2 SGPR operands - bug fixes Update to previous commit to fix an issue where immediates are tested for copy propagation but aren't suitable as inline constants. The old method caused issues that resulted in seg faults in later cse phases. Even though writelane doesn't have the same constraints as other valu instructions it still can't violate the >1 SGPR operand constraint Due to later register propagation (e.g. fixing up vgpr operands via readfirstlane) changing writelane to only have a single SGPR is tricky. This implementation puts a new check after SIFixSGPRCopies that prevents multiple SGPRs being used in any writelane instructions. The algorithm used is to check for trivial copy prop of constants into one of the SGPR operands and perform that if possible. If this isn't possible put an explicit copy of Src1 SGPR into M0 and use that instead (this is allowable for writelane as the constraint is for SGPR read-port and not constant-bus access). Change-Id: Ic3045df22db738ca6572af00138520fd28563990
Implements 3 new intrinsics for implementing waterfall code for regions of code. Waterfall is implemented as a loop that iterates over an index of values for active lanes. For each iteration an index is picked (the first active lane) and all lanes with the same index are left active (the rest are disabled). The body of the code is then executed and the result accumulated into a result vector register. The active lane is then disabled and the next index chosen for the next iteration. The worst case for waterfall is one iteration per lane - but it is usually far less than this. The implementation uses 3 intrinsics to mark a region: - llvm.amdgcn.waterfall.begin - llvm.amdgcn.waterfall.readfirstlane - llvm.amdgcn.waterfall.end The group must contain one begin and at least one readfirstlane and end intrinsic - if the readfirstlane uses the begin index, then there will only be a single readfirstlane and not 2. The readfirstlane and end intrinsics are not limited to single dword values. The waterfall loop will enclose all instructions from the begin to the final end intrinsic. See the test case for specific examples of use. V2: [AMDGPU] Updates to waterfall intrinsic support Some fixes for waterfall support. 1. Intrinsics are better defined so enforce type correctness 2. Creation of the waterfall loop requires removing kill flag for operands moved into the loop, this can otherwise cause an assert during verification 3. Fixed issue causing a tablegen failure due to float return type being used for a scalar return value (incorrectly). The matching for intrinsic to pseudo instruction now uses 2 template parameter types to disambiguate when using float src types. 4. Ensure that any analyses are invalidated due to insertion of new basic blocks. This should have been the default anyway. 5. Updated test in light of these changes. These are good exemplars for implementors to check against. V3: [AMDGCN] Fixed waterfall intrinsic groups for multiple groups per BB For cases with more than one waterfall group per basic block, the implementation would go wrong after creating the first waterfall loop (incorrectly tracking current BB). Also added a test case. V4: [AMDGCN] Waterfall enhancements Extra waterfall test for multiple readfirstlane intrinsics. Make sure that the waterfall intrinsic clauses support multiple readfirstlane intrinsics. Extended support for begin to allow multi-dword indices for the waterfall loop. This allows the use of multiple indexes (and multiple non-uniform values) in the same waterfall loop by combining the individual indices into a single multi-dword index. This has the same worst-case of 64 iterations (wave size) as before. Added a new test case to demonstrate using multi-dword indices. Added support for i16/f16 Changed the implementation to prevent CSE from merging clauses (by tagging the intrinsics as having side effects and enhancing uniform detection to work in this case as well). Left in some support added to partially support EarlyCSE merging as it is still valid, but probably won't be triggered much now merging is disabled. Added support for a last.use intrinsic. See the test for an example of this. This works in the same way as the end intrinsic, but you tag a last_use instead (so it includes the next use in the waterfall loop). More than one use is acceptable as long as it is in the same BB. All instructions up to the last last.use will be included in the waterfall loop. An important rule for waterfall clauses (that isn't detected if violated) is that you can't have something inside the loop that is dependent on an end intrinsic in the loop - this restriction is logically consistent - you have to use two loops instead. V5: [AMDGPU] Handle unexpected uniform inputs in waterfall intrinsics Waterfall intrinsic input operands can sometimes be uniform. The waterfall handling pass should deal with this gracefully, even though this could be sub-optimal. This is an initial fix for a problem observed in a game. Subsequent better fix being worked on which should produce optimal code (by removing redundant waterfall loops entirely if all the elements turn out to be uniform anyway). V6: [AMDGPU] Improved handling of uniform waterfalls This builds on a previous fix for uniform values being specified in waterfall loop intrinsic groups. Previous fix would just remove any waterfall.readfirstlane intrinsics if the input was uniform. This fix improves on this by spotting when ALL waterfall.readfirstlanes are uniform and removing the waterfall loop entirely. The tests have been extended to test this mode. The update also handles the case where not all waterfall.readfirstlanes are removed. This is also covered in the associated tests. V7: [AMDGCN] Re-order waterfall and wholequadmode passes Running WholeQuadMode pass after waterfall can have bad consequences. Changing the order improves the outcome - but there may still be problems in some unusual cases. For the known uses of the waterfall intrinsics this change will work for now. V8: waterfall test fixup Change-Id: I95b4391b8c0570bd399d70ed64af355d13ef4c84
Summary: The test shows a case where a full use is not in a subrange because the subreg is undefined at the use point, but, after a coalesce, the subrange incorrectly still does not include the use even though the subreg is now defined at that point. The incorrect live range causes a "subrange join unreachable" assert on a later coalesce. This commit ensures that a subrange is extended to all uses that can be reached from a definition. V2: Completely different fix, instead of working round the problem in the later coalesce that actually asserted. V5: Fixed to not extend liveness to an undef use. Spun second test out into its own D51257 as it is a completely different problem with the same "subrange join unreachable" symptom. V6: Ignore debug uses. V7: Set up undefs correctly, to avoid generating invalid subranges. Differential Revision: https://reviews.llvm.org/D49097 Change-Id: I172cfe16d360690e921ebe606d3e90dd1cdd1b71
eliminateUndefCopy was incorrectly eliminating a copy that was only a partial write of the target. In the test case, the resulting incorrect live range caused a subrange join unreachable later. The test case does not fail for me until I have D49097. Differential Revision: https://reviews.llvm.org/D51257 Change-Id: Ibb767d8a016934431660764b0194634e2113fea2
Summary: findReachingDefs was incorrectly using its fast path of just blitting in the live range extension when it did not see an undef itself but encountered an undef live out of a predecessor block. Now RegisterCoalescing calls extendToIndices (D49097), the bug was causing an incorrect subrange which led to a "Couldn't join subrange!" assert in a later coalesce. Subscribers: MatzeB, jvesely, nhaehnle, llvm-commits Differential Revision: https://reviews.llvm.org/D51574 Change-Id: I01c942b41ab5a145348289c8b221c45a2ec7200f
…vePartialRedundancy Summary: removePartialRedundancy was using extendToIndices on each subrange without passing in an Undefs vector containing main reg defs that are undef in the subrange, causing the above assert. Unfortunately I can only reproduce this in an ll test. Turning it into a mir test makes the problem go away. Subscribers: MatzeB, qcolombet, jvesely, nhaehnle, llvm-commits Differential Revision: https://reviews.llvm.org/D51849 Change-Id: I4f3ba2afb20d79f9467afb3d882f0328d38531be
…wCopyChain valuesIdentical is called to determine whether a def can be considered an "identical erase" because following copy chains proves that the value being copied is the same value as the value of the other register. The tests show up problems where the main range decides that a def is an "identical erase" but a subrange disagrees, because following the copy chain leads to a def that does not define all subregs in the subrange (in the case of one test) or a different def to that found in the main range (in the case of the other test). The fix here is to only detect an "identical erase" in the main range. If it is found, it is automatically inherited by the corresponding def in any subrange, without further analysis of that subrange value. This fix supersedes D49535/rL338070 and D51848. Change-Id: I059b5b2273ed6f186d134b98c118fef0494e02f8
This is a temporary fix to avoid a breakage where the export(s) for a pixel shader are in control flow (due to a kill). A pixel shader must execute an export done, even with exec=0. A better fix would be to have a pixel-shader-only pass that inserts a null export done in the final basic block if there is not one already there. Change-Id: I22f201f95d52ff699aa8cac26bb019717b20432e
LLPC has .ll file library functions that break this verifier check. We need to fix those before re-enabling the check. Change-Id: I4c000793a805a50f1d91d1ae55c567a356e72519
This commit fixes a really fun bug with inst combine where you have a vector-of-pointers and only one element of this vector is used. InstCombine correctly notes that an element is not used, and then will go back through a gep into that vector-of-pointers and set all vector elements to undef. This is fine in all cases except that the langref (and a ton of places in the optimizer) requires that indexing into a struct has the same index for each element of the vector-of-pointers. To fix the bug I've just made the gep/undef propagation check that the current type being indexed into is not a struct when doing the undef propagation. Differential Revision: https://reviews.llvm.org/D60600 Change-Id: I6eba34b1cde9c14751c39f04627e39425639ab3b
…d non-divergent Summary: This fixes an issue where values are uniform inside of a loop but the uses of that value outside the loop can be divergent. This is a temporary fix until the library linker issues can be resolved in llvm. Change-Id: I94d3d2e30cc2a6ae8d59e92cadf6f1b6cb7e708b
Implement a pass, enabled by -amdgpu-scratch-bounds-checking, or the subtarget feature "enable-scratch-bounds-checks", which adds bounds checks to scratch accesses. When this pass is enabled, out-of-bounds writes have no effect and out-of-bounds reads return zero. This is useful for GFX9 where hardware no longer performs bounds checking on scratch accesses and hence page faults result for out-of-bounds accesses by generated by shaders. Change-Id: Id2ee4b1f32e70b6bde2541db755727b6a407721b
stripValuesNotDefiningMask was asserting that it did not leave an empty subrange. However that was bogus because the subrange could have been empty to start with, in the case that LiveRangeCalc::calculate saw a subreg use that caused a subrange to be created empty. Differential Revision: https://reviews.llvm.org/D63510 Change-Id: Ibf862415ea422198975cc7a2ca2d98531beec08d
…REL_OFFSET is 0" This reverts commit 58b3837. That commit was causing an unnecessary reloc when accessing a global in the same section. On PAL, that breaks our use of a read-only global variable as an optimization of a local variable with constant initialization, because the PAL loader ignores the reloc. Change-Id: I2792404227d08d98baee69504509828de7e9b36c
Implement appropriate register allocation and exec mask usage for wave32 scenarios. Change-Id: I844018f6c07fdda46366af51654017fb4b654d8a
Extended test to check wave32 and gfx10 Change-Id: I620c6d9e737ddccdfd3494abf72632d7ecc4a53f
Change-Id: I3557477344135f7dbece33a2945bcb9f4063c10f
Extend LoadStoreOptimizer to handle IMAGE_LOAD and IMAGE_SAMPLE. Change-Id: Id49c992b9781254e39e1352125f5ffb212fe4f24
Summary: Backend defaults to setting WGP mode to 0 or 1 depending on the cumode feature. For PAL clients we want PAL to set this mode. Adding check to prevent the backend overriding whatever the front end has set. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D63639 Change-Id: Ia3bac3e562fb47089f05c11cae47d1c92f7195ee
Fix a breakage in areMemAccessesTriviallyDisjoint, caused by trying to merge a non-load MIMG instruction IMAGE_GET_RESINFO. Only MIMG that loads from memory should be considered. Change-Id: I012435534b3e6161ee7feb686d4711097d2ab980
Currently the atomic optimizer does not support wave32, and may generate DPP which is incompatible with GFX10. This change allows the atomic optimizer to be enabled unconditionally by the LLPC for all ASICs. Change-Id: Ia26b527675b6e07319f4299b3f82ad405916bad6
Summary: This fixes B42473 and B42706. This patch makes the SDA propagate branch divergence until the end of the RPO traversal. Before, the SyncDependenceAnalysis propagated divergence only until the IPD in rpo order. RPO is incompatible with post dominance in the presence of loops. This made the SDA crash because blocks were missed in the propagation. Reviewers: foad, nhaehnle Subscribers: jvesely, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D65274 Change-Id: Ic3fcc51e5850fde3a4649ec3dd84a040a07aaca9
Summary: Add support for gfx10, where all DPP operations are confined to work within a single row of 16 lanes, and wave32. Reviewers: arsenm, sheredom, critson, rampitec Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, jfb, dstuttard, tpr, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D65644 Change-Id: I1f8991e9cf732e79a139e79e70a2bf62650693d4
…DD_REL_OFFSET is 0" This reverts commit ad3ea6fe0eedc9fe48b96453cfed51f8c1dd79de.
…ord in SI_PC_ADD_REL_OFFSET is 0" (From Jay) D61491 caused us to use relocs when they're not strictly necessary, to refer to symbols in the text section. This is a pessimization and it's a problem for some loaders that don't support relocs yet. Differential Revision: https://reviews.llvm.org/D65813 Change-Id: Ia84de963ad099b7d2fcfb9cd2b908fa2e5a4788b
Change-Id: I3dddc18d5dea0fd1050633e35d0ac463a9ed26ca
|
This PR can be closed now. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds two new note types: NT_AMD_LLPC_CACHE_HASH and
NT_AMD_LLPC_VERSION. They will be used for note entries for LLPC cache
hash and LLPC version. This commit updates ELFDumper to show two note
types clearly.