Skip to content
This repository was archived by the owner on Sep 15, 2025. It is now read-only.

Conversation

@jaebaek
Copy link

@jaebaek jaebaek commented Jan 21, 2021

This commit adds two new note types: NT_AMD_LLPC_CACHE_HASH and
NT_AMD_LLPC_VERSION. They will be used for note entries for LLPC cache
hash and LLPC version. This commit updates ELFDumper to show two note
types clearly.

dstutt and others added 30 commits October 10, 2019 14:25
This pass attempts to merge s_buffer_load_dword instructions into larger
sizes.
Constraints are that the resource descriptor, glc flag and size are all the same
for contiguous load instructions.

TODO: it should be possible to add different instructions to this as well such
as s_buffer_store and the vector variants buffer_load/buffer_store

V2: Fixes to sbuff-merge test
V3: Fixed smrd test for smem merging
V4: Fixed for gfx10 changes

Change-Id: Ie98107ad3b27b3c8c2da7afc3d129336f924895a
Summary:
Added intrinsics for the instructions:
- buffer_load_ubyte
- buffer_load_ushort
- buffer_store_byte
- buffer_store_short

Added test cases to existing buffer load/store tests.

Now that upstream supports byte/short overloads of the normal load/store
intrinsics, this change is only needed until LLPC stops using the
intrinsics added here.

Change-Id: I1b8a0910c508c9520a84b74a14d0aea8293cef38
Vulkan exposed an issue with this for a case with v_mad_mixlo_f16 where the
upper 16 bits were not cleared.

Modifying this to clear the bits instead of just copying fixed the problem.

V2: Fixed up "Fix issue for zext of f16 to i32"
V3: Fixed fcanonicalize-elimination test

Change-Id: I6128deaf8ebc3489fdd3e0caead837410bb47160
Readlane should have a uniform index - but in some cases it can be non-uniform.
The original implementation of readlane defaulted to using a readfirstlane of
the vgpr index on the assumption that the index would be uniform across all
lanes.

However, there are some cases where we might want a readlane like operation that
is lowered to a waterfall that performs a readlane for each distinct index value
across lanes.

Essentially we form a loop that uses readfirstlane to get the first enabled
lane's index - enable all lanes with the same index, perform a readlane and put
the result into the vgpr result register, then disable those lanes and repeat.

The pathological case will form a loop of 64 - but in most cases it will be
fewer.
In the case where the index IS uniform across all lanes, it will loop once,
which is slightly more expensive than the original code, but not a lot more.

V2: Initialize accumulator register in waterfall code
V3: Fixed for gfx10 changes

Change-Id: Ic52dfb14b053db7713b61c85e8ac766991aecdb8
Even though writelane doesn't have the same constraints as other valu
instructions it still can't violate the >1 SGPR operand constraint

Due to later register propagation (e.g. fixing up vgpr operands via
readfirstlane) changing writelane to only have a single SGPR is tricky.

This implementation puts a new check after SIFixSGPRCopies that prevents
multiple SGPRs being used in any writelane instructions.

The algorithm used is to check for trivial copy prop of constants into one of
the SGPR operands and perform that if possible. If this isn't possible put an
explicit copy of Src1 SGPR into M0 and use that instead (this is allowable for
writelane as the constraint is for SGPR read-port and not constant-bus access).

V2: Fix-up cases where writelane has 2 SGPR operands - bug fixes

Update to previous commit to fix an issue where immediates are tested for copy
propagation but aren't suitable as inline constants. The old method caused
issues that resulted in seg faults in later cse phases.

Even though writelane doesn't have the same constraints as other valu
instructions it still can't violate the >1 SGPR operand constraint

Due to later register propagation (e.g. fixing up vgpr operands via
readfirstlane) changing writelane to only have a single SGPR is tricky.

This implementation puts a new check after SIFixSGPRCopies that prevents
multiple SGPRs being used in any writelane instructions.

The algorithm used is to check for trivial copy prop of constants into one of
the SGPR operands and perform that if possible. If this isn't possible put an
explicit copy of Src1 SGPR into M0 and use that instead (this is allowable for
writelane as the constraint is for SGPR read-port and not constant-bus access).

Change-Id: Ic3045df22db738ca6572af00138520fd28563990
Implements 3 new intrinsics for implementing waterfall code for regions of code.

Waterfall is implemented as a loop that iterates over an index of values for
active lanes. For each iteration an index is picked (the first active lane) and
all lanes with the same index are left active (the rest are disabled).
The body of the code is then executed and the result accumulated into a result
vector register.
The active lane is then disabled and the next index chosen for the next
iteration.

The worst case for waterfall is one iteration per lane - but it is usually far
less than this.

The implementation uses 3 intrinsics to mark a region:
  - llvm.amdgcn.waterfall.begin
  - llvm.amdgcn.waterfall.readfirstlane
  - llvm.amdgcn.waterfall.end

The group must contain one begin and at least one readfirstlane and end
intrinsic - if the readfirstlane uses the begin index, then there will only be a
single readfirstlane and not 2.

The readfirstlane and end intrinsics are not limited to single dword values.

The waterfall loop will enclose all instructions from the begin to the final end
intrinsic.

See the test case for specific examples of use.

V2:
[AMDGPU] Updates to waterfall intrinsic support

Some fixes for waterfall support.
1. Intrinsics are better defined so enforce type correctness
2. Creation of the waterfall loop requires removing kill flag for operands moved
into the loop, this can otherwise cause an assert during verification
3. Fixed issue causing a tablegen failure due to float return type being used
for a scalar return value (incorrectly). The matching for intrinsic to pseudo
instruction now uses 2 template parameter types to disambiguate when using float
src types.
4. Ensure that any analyses are invalidated due to insertion of new basic
blocks. This should have been the default anyway.
5. Updated test in light of these changes. These are good exemplars for
implementors to check against.

V3:
[AMDGCN] Fixed waterfall intrinsic groups for multiple groups per BB

For cases with more than one waterfall group per basic block, the implementation
would go wrong after creating the first waterfall loop (incorrectly tracking
current BB).

Also added a test case.

V4:
[AMDGCN] Waterfall enhancements

Extra waterfall test for multiple readfirstlane intrinsics. Make sure that the
waterfall intrinsic clauses support multiple readfirstlane intrinsics.

Extended support for begin to allow multi-dword indices for the waterfall
loop. This allows the use of multiple indexes (and multiple non-uniform values)
in the same waterfall loop by combining the individual indices into a single
multi-dword index.
This has the same worst-case of 64 iterations (wave size) as before.

Added a new test case to demonstrate using multi-dword indices.

Added support for i16/f16

Changed the implementation to prevent CSE from merging clauses (by tagging the
intrinsics as having side effects and enhancing uniform detection to work in
this case as well).
Left in some support added to partially support EarlyCSE merging as it is still
valid, but probably won't be triggered much now merging is disabled.

Added support for a last.use intrinsic. See the test for an example of
this. This works in the same way as the end intrinsic, but you tag a last_use
instead (so it includes the next use in the waterfall loop). More than one use
is acceptable as long as it is in the same BB. All instructions up to the last
last.use will be included in the waterfall loop.

An important rule for waterfall clauses (that isn't detected if violated) is
that you can't have something inside the loop that is dependent on an end
intrinsic in the loop - this restriction is logically consistent - you have to use
two loops instead.

V5:
[AMDGPU] Handle unexpected uniform inputs in waterfall intrinsics

Waterfall intrinsic input operands can sometimes be uniform. The waterfall
handling pass should deal with this gracefully, even though this could be
sub-optimal.

This is an initial fix for a problem observed in a game.
Subsequent better fix being worked on which should produce optimal code (by
removing redundant waterfall loops entirely if all the elements turn out to be
uniform anyway).

V6:
[AMDGPU] Improved handling of uniform waterfalls

This builds on a previous fix for uniform values being specified in waterfall
loop intrinsic groups.
Previous fix would just remove any waterfall.readfirstlane intrinsics if the
input was uniform.

This fix improves on this by spotting when ALL waterfall.readfirstlanes are
uniform and removing the waterfall loop entirely.

The tests have been extended to test this mode.

The update also handles the case where not all waterfall.readfirstlanes are
removed. This is also covered in the associated tests.

V7:
[AMDGCN] Re-order waterfall and wholequadmode passes

Running WholeQuadMode pass after waterfall can have bad consequences.
Changing the order improves the outcome - but there may still be problems in
some unusual cases.
For the known uses of the waterfall intrinsics this change will work for now.

V8:
waterfall test fixup

Change-Id: I95b4391b8c0570bd399d70ed64af355d13ef4c84
Summary:
The test shows a case where a full use is not in a subrange because the
subreg is undefined at the use point, but, after a coalesce, the
subrange incorrectly still does not include the use even though the
subreg is now defined at that point. The incorrect live range causes a
"subrange join unreachable" assert on a later coalesce.

This commit ensures that a subrange is extended to all uses that can be
reached from a definition.

V2: Completely different fix, instead of working round the problem in
the later coalesce that actually asserted.
V5: Fixed to not extend liveness to an undef use. Spun second test out
into its own D51257 as it is a completely different problem with the
same "subrange join unreachable" symptom.
V6: Ignore debug uses.
V7: Set up undefs correctly, to avoid generating invalid subranges.

Differential Revision: https://reviews.llvm.org/D49097

Change-Id: I172cfe16d360690e921ebe606d3e90dd1cdd1b71
eliminateUndefCopy was incorrectly eliminating a copy that was only a
partial write of the target. In the test case, the resulting incorrect
live range caused a subrange join unreachable later.

The test case does not fail for me until I have D49097.

Differential Revision: https://reviews.llvm.org/D51257

Change-Id: Ibb767d8a016934431660764b0194634e2113fea2
Summary:
findReachingDefs was incorrectly using its fast path of just blitting in
the live range extension when it did not see an undef itself but
encountered an undef live out of a predecessor block.

Now RegisterCoalescing calls extendToIndices (D49097), the bug was
causing an incorrect subrange which led to a "Couldn't join subrange!"
assert in a later coalesce.

Subscribers: MatzeB, jvesely, nhaehnle, llvm-commits

Differential Revision: https://reviews.llvm.org/D51574

Change-Id: I01c942b41ab5a145348289c8b221c45a2ec7200f
…vePartialRedundancy

Summary:
removePartialRedundancy was using extendToIndices on each subrange
without passing in an Undefs vector containing main reg defs that are
undef in the subrange, causing the above assert.

Unfortunately I can only reproduce this in an ll test. Turning it into a
mir test makes the problem go away.

Subscribers: MatzeB, qcolombet, jvesely, nhaehnle, llvm-commits

Differential Revision: https://reviews.llvm.org/D51849

Change-Id: I4f3ba2afb20d79f9467afb3d882f0328d38531be
…wCopyChain

valuesIdentical is called to determine whether a def can be considered
an "identical erase" because following copy chains proves that the value
being copied is the same value as the value of the other register.

The tests show up problems where the main range decides that a def is
an "identical erase" but a subrange disagrees, because following the
copy chain leads to a def that does not define all subregs in the
subrange (in the case of one test) or a different def to that found in
the main range (in the case of the other test).

The fix here is to only detect an "identical erase" in the main range.
If it is found, it is automatically inherited by the corresponding def
in any subrange, without further analysis of that subrange value.

This fix supersedes D49535/rL338070 and D51848.

Change-Id: I059b5b2273ed6f186d134b98c118fef0494e02f8
This is a temporary fix to avoid a breakage where the export(s) for a
pixel shader are in control flow (due to a kill). A pixel shader must
execute an export done, even with exec=0.

A better fix would be to have a pixel-shader-only pass that inserts a
null export done in the final basic block if there is not one already
there.

Change-Id: I22f201f95d52ff699aa8cac26bb019717b20432e
LLPC has .ll file library functions that break this verifier check. We
need to fix those before re-enabling the check.

Change-Id: I4c000793a805a50f1d91d1ae55c567a356e72519
This commit fixes a really fun bug with inst combine where you have a
vector-of-pointers and only one element of this vector is used.
InstCombine correctly notes that an element is not used, and then will
go back through a gep into that vector-of-pointers and set all vector
elements to undef. This is fine in all cases except that the langref
(and a ton of places in the optimizer) requires that indexing into a
struct has the same index for each element of the vector-of-pointers.

To fix the bug I've just made the gep/undef propagation check that the
current type being indexed into is not a struct when doing the undef
propagation.

Differential Revision: https://reviews.llvm.org/D60600

Change-Id: I6eba34b1cde9c14751c39f04627e39425639ab3b
…d non-divergent

Summary:
This fixes an issue where values are uniform inside of a loop but
the uses of that value outside the loop can be divergent.

This is a temporary fix until the library linker issues can be resolved
in llvm.

Change-Id: I94d3d2e30cc2a6ae8d59e92cadf6f1b6cb7e708b
Implement a pass, enabled by -amdgpu-scratch-bounds-checking,
or the subtarget feature "enable-scratch-bounds-checks",
which adds bounds checks to scratch accesses.
When this pass is enabled, out-of-bounds writes have no effect
and out-of-bounds reads return zero.
This is useful for GFX9 where hardware no longer performs
bounds checking on scratch accesses and hence page faults
result for out-of-bounds accesses by generated by shaders.

Change-Id: Id2ee4b1f32e70b6bde2541db755727b6a407721b
stripValuesNotDefiningMask was asserting that it did not leave an empty
subrange. However that was bogus because the subrange could have been
empty to start with, in the case that LiveRangeCalc::calculate saw a
subreg use that caused a subrange to be created empty.

Differential Revision: https://reviews.llvm.org/D63510

Change-Id: Ibf862415ea422198975cc7a2ca2d98531beec08d
…REL_OFFSET is 0"

This reverts commit 58b3837.

That commit was causing an unnecessary reloc when accessing a global in
the same section. On PAL, that breaks our use of a read-only global
variable as an optimization of a local variable with constant
initialization, because the PAL loader ignores the reloc.

Change-Id: I2792404227d08d98baee69504509828de7e9b36c
Implement appropriate register allocation and exec mask usage
for wave32 scenarios.

Change-Id: I844018f6c07fdda46366af51654017fb4b654d8a
Extended test to check wave32 and gfx10

Change-Id: I620c6d9e737ddccdfd3494abf72632d7ecc4a53f
Change-Id: I3557477344135f7dbece33a2945bcb9f4063c10f
Extend LoadStoreOptimizer to handle IMAGE_LOAD and IMAGE_SAMPLE.

Change-Id: Id49c992b9781254e39e1352125f5ffb212fe4f24
Summary:
Backend defaults to setting WGP mode to 0 or 1 depending on the cumode
feature. For PAL clients we want PAL to set this mode.
Adding check to prevent the backend overriding whatever the front end has set.

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63639

Change-Id: Ia3bac3e562fb47089f05c11cae47d1c92f7195ee
Fix a breakage in areMemAccessesTriviallyDisjoint, caused
by trying to merge a non-load MIMG instruction IMAGE_GET_RESINFO.

Only MIMG that loads from memory should be considered.

Change-Id: I012435534b3e6161ee7feb686d4711097d2ab980
Currently the atomic optimizer does not support wave32,
and may generate DPP which is incompatible with GFX10.
This change allows the atomic optimizer to be enabled
unconditionally by the LLPC for all ASICs.

Change-Id: Ia26b527675b6e07319f4299b3f82ad405916bad6
Summary:
This fixes B42473 and B42706.

This patch makes the SDA propagate branch divergence until the end of the RPO traversal. Before, the SyncDependenceAnalysis propagated divergence only until the IPD in rpo order. RPO is incompatible with post dominance in the presence of loops. This made the SDA crash because blocks were missed in the propagation.

Reviewers: foad, nhaehnle

Subscribers: jvesely, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65274

Change-Id: Ic3fcc51e5850fde3a4649ec3dd84a040a07aaca9
Summary:
Add support for gfx10, where all DPP operations are confined to work
within a single row of 16 lanes, and wave32.

Reviewers: arsenm, sheredom, critson, rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, jfb, dstuttard, tpr, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65644

Change-Id: I1f8991e9cf732e79a139e79e70a2bf62650693d4
…DD_REL_OFFSET is 0"

This reverts commit ad3ea6fe0eedc9fe48b96453cfed51f8c1dd79de.
…ord in SI_PC_ADD_REL_OFFSET is 0"

(From Jay)

D61491 caused us to use relocs when they're not strictly necessary, to
refer to symbols in the text section. This is a pessimization and it's a
problem for some loaders that don't support relocs yet.

Differential Revision: https://reviews.llvm.org/D65813

Change-Id: Ia84de963ad099b7d2fcfb9cd2b908fa2e5a4788b
Change-Id: I3dddc18d5dea0fd1050633e35d0ac463a9ed26ca
@s-perron
Copy link

This PR can be closed now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.