-
Notifications
You must be signed in to change notification settings - Fork 141
[PATCH v1] linux-gen: queue: add loop unrolling for event conversions #2211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -515,20 +515,52 @@ static inline void event_index_from_hdr(uint32_t event_index[], | |
| _odp_event_hdr_t *event_hdr[], int num) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since |
||
| { | ||
| int i; | ||
| int idx = 0; | ||
|
|
||
| for (i = 0; i < (num & ~0x3); i += 4, idx += 4) { | ||
| event_index[i] = event_hdr[idx]->index.u32; | ||
| event_index[i+1] = event_hdr[idx+1]->index.u32; | ||
| event_index[i+2] = event_hdr[idx+2]->index.u32; | ||
| event_index[i+3] = event_hdr[idx+3]->index.u32; | ||
| } | ||
| switch (num & 0x3) { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like it is going to increase code size and add a few new branches (although I did not look at any generated assembly). It is not clear to me that the change is a net positive in actual applications and not just in microbenchmarks that presumably always enqueue long bursts. I believe some ODP apps have high instruction cache pressure. Another possible optimization might be to get rid of the whole header-pointer-to-index conversion and just store the pointers in rings (but that would then use more cache lines in the rings, so I do not know if that is good or not).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, cache pressure is something we need to understand. Is there any benchmarking application (not a micro benchmark like we used here) that can be executed to confirm the behaviour? |
||
| case 3: | ||
| event_index[i++] = event_hdr[idx++]->index.u32; | ||
| __attribute__((fallthrough)); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Project convention has been to use simply |
||
| case 2: | ||
| event_index[i++] = event_hdr[idx++]->index.u32; | ||
| __attribute__((fallthrough)); | ||
| case 1: | ||
| event_index[i++] = event_hdr[idx++]->index.u32; | ||
| } | ||
|
|
||
| for (i = 0; i < num; i++) | ||
| event_index[i] = event_hdr[i]->index.u32; | ||
| } | ||
|
|
||
| static inline void event_index_to_hdr(_odp_event_hdr_t *event_hdr[], | ||
| uint32_t event_index[], int num) | ||
| { | ||
| int i; | ||
| int idx = 0; | ||
|
|
||
| for (i = 0; i < num; i++) { | ||
| event_hdr[i] = _odp_event_hdr_from_index_u32(event_index[i]); | ||
| odp_prefetch(event_hdr[i]); | ||
| for (i = 0; i < (num & ~0x3); idx += 4, i += 4) { | ||
| event_hdr[i] = _odp_event_hdr_from_index_u32(event_index[idx]); | ||
| event_hdr[i+1] = _odp_event_hdr_from_index_u32(event_index[idx+1]); | ||
| event_hdr[i+2] = _odp_event_hdr_from_index_u32(event_index[idx+2]); | ||
| event_hdr[i+3] = _odp_event_hdr_from_index_u32(event_index[idx+3]); | ||
| } | ||
|
|
||
| switch (num & 0x3) { | ||
| case 3: | ||
| event_hdr[i++] = _odp_event_hdr_from_index_u32(event_index[idx++]); | ||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unnecessary newline. |
||
| __attribute__((fallthrough)); | ||
| case 2: | ||
| event_hdr[i++] = _odp_event_hdr_from_index_u32(event_index[idx++]); | ||
| __attribute__((fallthrough)); | ||
| case 1: | ||
| event_hdr[i++] = _odp_event_hdr_from_index_u32(event_index[idx++]); | ||
| } | ||
|
|
||
| } | ||
|
|
||
| static inline int _plain_queue_enq_multi(odp_queue_t handle, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While testing this I noticed that now we could have separate optimized implementations for single and multi event enq/deq functions (e.g.
odp_queue_enq(),odp_queue_enq_multi()). Currently, the single event variants e.g.plain_queue_enq()) share the same code and I measured some performance penalty from these changes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this makes sense.