Skip to content

[SYCL][Driver] Enable time tracing capability for SYCL applications.#21207

Merged
uditagarwal97 merged 38 commits intointel:syclfrom
srividya-sundaram:enable-time-trace-sycl
Mar 27, 2026
Merged

[SYCL][Driver] Enable time tracing capability for SYCL applications.#21207
uditagarwal97 merged 38 commits intointel:syclfrom
srividya-sundaram:enable-time-trace-sycl

Conversation

@srividya-sundaram
Copy link
Copy Markdown
Contributor

@srividya-sundaram srividya-sundaram commented Feb 3, 2026

This PR implements -ftime-trace support for SYCL offloading compilation, enabling trace generation for both host and device compilation phases. The implementation handles various compilation modes and ensures trace files are generated with clear, predictable naming conventions.

@srividya-sundaram srividya-sundaram marked this pull request as ready for review February 3, 2026 23:22
@srividya-sundaram srividya-sundaram requested a review from a team as a code owner February 3, 2026 23:22
Comment thread clang/test/Driver/sycl-time-trace.cpp Outdated
@Maetveis
Copy link
Copy Markdown
Contributor

Maetveis commented Feb 9, 2026

Hi @srividya-sundaram, this is an area I've explored previously, and I remember that -ftime-trace has usuability problems when it comes to offloading. At that time I was looking at cuda offloading, but I think they also apply to SYCL.

Have you checked the comment at https://reviews.llvm.org/D150282 and https://reviews.llvm.org/D133662, and the github issue llvm/llvm-project#55455?

It would be ideal to resolve this in upstream clang, and do so for all offloading models, not just SYCL.

@srividya-sundaram
Copy link
Copy Markdown
Contributor Author

Hi @srividya-sundaram, this is an area I've explored previously, and I remember that -ftime-trace has usuability problems when it comes to offloading. At that time I was looking at cuda offloading, but I think they also apply to SYCL.

Have you checked the comment at https://reviews.llvm.org/D150282 and https://reviews.llvm.org/D133662, and the github issue llvm/llvm-project#55455?

It would be ideal to resolve this in upstream clang, and do so for all offloading models, not just SYCL.

Hi @Maetveis
Thanks for pointing these out.
I’ve gone through the referenced reviews briefly but haven't done an in-depth studying of the comments. I will go through them as well as the linked issue.

I remember that -ftime-trace has usuability problems when it comes to offloading.

Could you please share the usability problems you encountered?

Some questions I have are:
When compilation and linkage are done in one compiler driver invocation, which actions should produce traces.
My understanding is both the SYCL host compilation and SYCL device compilation invocations should produce the trace files (not in the /tmp folder)

@Maetveis
Copy link
Copy Markdown
Contributor

Maetveis commented Feb 10, 2026

Could you please share the usability problems you encountered?

Sure :). This was a while ago, and at the time for a different toolchain (AMD's HIP) but I think they mostly still apply.

  1. The fact that there are multiple traces for compiling a single offloading source file goes counter to tooling and user expectations.

    • If a user does clang++ -ftime-trace source.cpp -o source.o clang also produces source.json. IMO it is reasonable to expect that clang++ -fsycl -fsycl-targets=A,B -ftime-trace sycl-source.cpp -o sycl-source.o should produce a single sycl-source.json and for this file to contain the traces for host and all device compilation steps.
    • Tools used for visualizing / understanding build times like ninjatracing assume a model like this. I think it's more fruitful to try match non-offload behavior by the compiler rather than require tooling to adapt for the niche use-case that is offloading.
  2. The traces for device side compiles have very poor discoverability:

    • They are written to somewhere in /tmp, but the only way to know that is by looking at -v output
    • For some compilations they are not written to /tmp, but rather use the same file-name as the host-trace and are therefore overwritten by it.

To frame these a bit more, I think it's useful to think about the following use-cases:

Use-case A:

As a developer of the library libFoo which uses an offloading API for (some of) its sources, I want to analyze the overall build-time and look for "hot-spots" where I can reduce it the most. In order to do this, I use tools like ninjatracing and pass -ftime-trace to all compilations through the build-system. This gives me an overview of which files take the longest to compile, and also gives me the ability to dig down into details to spot patterns like a particular template instantiation that slows down multiple compilation units.

Use-case B:

I have identified that the file A.cpp takes a long time to compile for offload target foo. In order to understand why I pass the -ftime-trace option to clang invoked from the command line.

The second case is already reasonably well served by what clang can do for -ftime-trace with offloading ifwe simply "re-enable" it. Yes it's annoying to look for files in /tmp, but it's not the end of the world for a single file.

The first case basically breaks down, the level of detail is reduced to the object file level instead of fine-grain we would have without offloading. We don't get any information about which step of the combined offload "compilation" took longest.

When compilation and linkage are done in one compiler driver invocation, which actions should produce traces.
My understanding is both the SYCL host compilation and SYCL device compilation invocations should produce the trace files (not in the /tmp folder)

In an ideal world in my opinion there should be just one trace and that includes traces for every step: host and device compilation and linking too, assuming the linker is capable of producing compatible traces.
Combined compilation and linking isn't common though, because most projects of moderate complexity will use a build-system, therefore -ftime-trace doesn't accommodate it (even for non-offload compilations). I think this is fine, advanced users who are looking at build time should be able to break down clang++ foo.cpp into clang++ -ftime-trace -c foo.cpp and time clang++ foo.o.

@srividya-sundaram
Copy link
Copy Markdown
Contributor Author

srividya-sundaram commented Feb 10, 2026

Thank you for the detailed explanation. Super helpful to understand your POV with the example use cases.
I have some follow up questions and observations:

If a user does clang++ -ftime-trace source.cpp -o source.o clang also produces source.json. IMO it is reasonable to expect that clang++ -fsycl -fsycl-targets=A,B -ftime-trace sycl-source.cpp -o sycl-source.o should produce a single sycl-source.json and for this file to contain the traces for host and all device compilation steps.

For regular/non-offload compilation like this clang++ -ftime-trace source.cpp -o source.o , we currently produce a trace corresponding to a single clang -cc1 invocation.

When -ftime-trace is enabled for offload compilations, Clang could generate one time-trace JSON file per compiler invocation. (one JSON file for the host compilation and one JSON file for each device compilation)

This design aligns with the existing semantics of -ftime-trace, which today produces a trace corresponding to a single clang -cc1 invocation.

Also, with my current patch, I was able to generate the SYCL host compilation trace file and SYCL device compilation trace file separately and both the traces appear to be quite different. Example: different targets, different passes, different toolchains/backends. Combining the SYCL host and device jsons into a single one might need namespacing everywhere and proper grouping of the events/passes etc. Seems to blur the host/device invocation boundries.

I used chrome://tracing/ to load the generated SYCL host and device JSON files and I believe it expects a single timeline from a single process (clang-22) I am not sure if the tool would adopt if we were to emit one big json file for host and all the device compilations. As device count grows, our single JSON file will also scale bigger. Please see attached image, top right, processes tab.

sycl-device-json

Given these observations, I was wondering if we could instead generate descriptive json filenames in user detectable directories like you have mentioned.
Example:

clang++ -fsycl -fsycl-targets=spir64,nvptx64-nvidia-cuda  -ftime-trace source.cpp -o source.o
source.json                (host)
source-sycl-spir64.json    (device)
source-sycl-nvptx64.json   (device)

WDYT?

@Maetveis
Copy link
Copy Markdown
Contributor

For regular/non-offload compilation like this clang++ -ftime-trace source.cpp -o source.o , we currently produce a trace corresponding to a single clang -cc1 invocation.

When -ftime-trace is enabled for offload compilations, Clang could generate one time-trace JSON file per compiler invocation. (one JSON file for the host compilation and one JSON file for each device compilation)

This design aligns with the existing semantics of -ftime-trace, which today produces a trace corresponding to a single clang -cc1 invocation.

I don't think that was an intentional design choice for -ftime-trace, probably offloading simply wasn't a consideration back then. Allow me to ask this slightly provocative question: clang++ -fsycl source.cpp also produces a single object file: downstream tools like the linker are shielded from the complexity of multiple compiler invocations by the offloading toolchain. What's the difference between object files and compile-time traces that justifies the difference in behaviour?

Also, with my current patch, I was able to generate the SYCL host compilation trace file and SYCL device compilation trace file separately and both the traces appear to be quite different. Example: different targets, different passes, different toolchains/backends. Combining the SYCL host and device jsons into a single one might need namespacing everywhere and proper grouping of the events/passes etc. Seems to blur the host/device invocation boundries.

There are already separate high-level categories in the traces like "Frontend" and "Backend", I don't see why an additional level of "Offload Host", "Offload Device (nvptx)" etc couldn't be added.

I used chrome://tracing/ to load the generated SYCL host and device JSON files and I believe it expects a single timeline from a single process (clang-22) I am not sure if the tool would adopt if we were to emit one big json file for host and all the device compilations.
As device count grows, our single JSON file will also scale bigger. Please see attached image, top right, processes tab.

Perfetto is the successor of the chrome-tracing visualizer; it supports binary traces (much smaller sizes), is designed with multi-process traces in mind.

Given these observations, I was wondering if we could instead generate descriptive json filenames in user detectable directories like you have mentioned.
WDYT?

I think your suggestion improves the status quo for at least the simpler use case, so SGTM. I understand that implementing a single trace is a significantly more work, and there might not be a big enough motivation to do that.

@mdtoguchi
Copy link
Copy Markdown
Contributor

For short term usability, having separate traces for each compilation (host/targetA/targetB) with different unique file names sounds reasonable to me. The perspective of having a single time-trace file when offloading enabled with all target embedded does make sense as from a general user perspective there is one binary generated - at least when generating an object. This of course goes beyond the scope of just modifying the driver.

Documentation should be updated in the SYCL space to show generated file expectations.

@srividya-sundaram srividya-sundaram marked this pull request as draft February 21, 2026 01:17
Comment thread clang/lib/Driver/Driver.cpp
Comment thread clang/lib/Driver/ToolChains/SYCL.cpp Outdated
Comment thread sycl-jit/jit-compiler/lib/rtc/RTC.cpp
@srividya-sundaram srividya-sundaram marked this pull request as ready for review March 18, 2026 22:53
@srividya-sundaram srividya-sundaram requested review from a team and cperkinsintel as code owners March 18, 2026 22:53
@srividya-sundaram
Copy link
Copy Markdown
Contributor Author

SYCL Pre commit failures are un-related to this patch.

Comment thread sycl/test-e2e/KernelCompiler/sycl_time_trace.cpp
Comment thread clang/test/Driver/sycl-time-trace.cpp
Copy link
Copy Markdown
Contributor

@jopperm jopperm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTC-related changes LGTM.

Add test with actual sycl compilation
Add COW test case.
Comment thread clang/test/Driver/sycl-time-trace-actual.cpp
Comment thread clang/lib/Driver/Driver.cpp Outdated
Comment thread clang/lib/Driver/ToolChains/SYCL.cpp Outdated
Comment thread clang/lib/Driver/Driver.cpp
Comment thread clang/test/Driver/sycl-time-trace-actual.cpp
@github-actions
Copy link
Copy Markdown
Contributor

@intel/llvm-gatekeepers please consider merging

@uditagarwal97 uditagarwal97 merged commit 46f2ff5 into intel:sycl Mar 27, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants