Skip to content

Conversation

@yakra
Copy link
Contributor

@yakra yakra commented Feb 1, 2026

These graph generation speedups finally open up the FreeBSD bottleneck, bringing its performance in line with Linux.
Noreaster is gonna love this.

What's going on under the hood is discussed in separate posts for each of the 3 commits below.
Here are the charts for how performance is affected by all 3 commits in this pull request combined.

4-f2f9
4-f2f9-log

The second image is the same thing on a logarithmic scale.
This can help more easily differentiate the before & after lines and the different machines, with everything less squished together in the lower left corner.
It also helps more easily visualize where the sweet spot is for each machine, before efficiency starts decreasing.

Benchmarks are performed using a RAM disk now.
For those who remember the #626 fiasco (which led to adopting {fmt} in the 1st place), this avoids both the inconsistently slow times that can result from writing to disk, and the falsely fast times recorded when writing to /dev/null.

Never thought I'd see this, but we're just shy of breaking the 1 second barrier.
For graph generation proper (not counting the now-separate "Formatting vertex coordinate strings" task), lab6 averages 1.0707 s @ 7 threads. Individual passes have taken as little as 1.0191 s (6 threads in that case).
I have a few more tweaks in the pipeline long-term that can get us there -- at least, as long as I use old enough HighwayData & UserData revisions. At some point the data will increase to the point that sub-1s graph generation will be permanently out of reach.
At some point returns will diminish and there will be no more efficiency to wring out of the process.

@yakra
Copy link
Contributor Author

yakra commented Feb 1, 2026

b2d5fc7 use segment_name when no GraphListEntry system restrictions

Of the three ideas here, this one's been kicking around the longest, and has the most straightforward diff.

Each graph edge is already constructed with a segment_name based on all its concurrent active/preview routes.
Except it doesn't actually get used. Instead, when writing graph files we call a function that computes a name on the fly based on whether there are system restrictions for system/multisystem/fullcustom graphs. The vast majority of the time, there are no such restrictions, and what gets written is the same as the segment_name.
We should either use it or lose it, whatever's more efficient.
While I'd love to lose it and cram each HGEdge object down into half a cache line (32 B), using it performs better.
If there are system restrictions, call HighwaySegment::write_label like we already do. Else, just write segment_name.
1-b2d5
1-b2d5-log

@yakra
Copy link
Contributor Author

yakra commented Feb 1, 2026

28be0f0 store+retrieve formatted vertex coord strings

The last big round of graph generation speedups revealed a bottleneck in formatting numbers into text strings. Switching to the {fmt} library from the C++-native stuff (like sprintf and std::ostream's << operator) helped considerably, but this is still done too much though; there's a lot of redundancy to lose.

Every vertex's coordinates are put into at least 9 graph files. There are 2 dimensions to this:

  • Each goes in at least 3 categories: master, continent & region graphs.
    Many additionally go into (multi-region) country graphs, and other custom graphs described by the CSVs.
    The coordinate string is reformatted every time.
  • Each goes in 3 formats: simple, collapsed & traveled.
    This is the only time some of the redundancy is exploited -- the string is formatted once, and goes into the simple graph, and the traveled & collapsed graphs as needed -- if the vertex is a vertex.
    When is a vertex not a vertex? When it's omitted from a collapsed or traveled graph. In these cases the coords will still show up as intermediate points along a collapsed (or traveled) edge. This always gets formatted separately.

How's that work out for redundancy in formatting the coordinate strings?

  • ~71% of vertices get formatted 3 (or more) times. Once for every graph category it's in, which will be at least 3 as noted above.
  • ~29% of vertices get formatted 9 (or more) times, 3x for every category.
    These are not visible in the collapsed or traveled graphs, and get formatted (as a vertex) for the simple graphs only.
    They then get formatted separately as intermediate points for the collapsed & traveled graphs, even if it's the same HGEdge object in both.
  • A tiny minority (like 33 out of almost a million) get formatted 6 (or more) times, 2x for every category.
    These are visible in the traveled, but not collapsed, graphs. They get formatted once as a vertex for both the simple & traveled graphs, and once as an intermediate point for the collapsed graph only.

The thing to do is, in a separate (threaded!) pass before graph generation proper, format each string once and store it in a small buffer in the HGVertex object, later retrieving it as needed to write to the files.
This separate task included in the benchmarks below,as it provides the most apples-to-apples comparison to what graph generation was doing before.

This commit has the greatest impact out of the three on my Linux machines.
2-28be
2-28be-log

@yakra
Copy link
Contributor Author

yakra commented Feb 1, 2026

e76ad22 vertex nums cache locality & fmt::print

This is the silver bullet for FreeBSD performance!

Like the commit message says, this one actually has 2 components:

  • Cache locality
    This component on its own had little effect on FreeBSD; fmt::print had more impact on 2 of the 7 Linux machines, this did better on 3, and lab5 was a toss-up.
    • Before: Vertex numbers were stored in small fragments all over the heap.
      Each HGVertex has 3 separate pointers to a small array of vertex numbers for each thread. This is triply inefficient...
      • The stuff that's stored together, numbers for different threads, we don't necessarily need at the same time.
      • The stuff we do need at the same time, numbers for the same simple/collapsed/traveled set, are fragmented all over the heap, not necessarily in the same place.
      • The pointers themselves are a waste of space.
    • After: Each thread gets one big array storing numbers for all vertices. The simple, collapsed & traveled numbers for each vertex are stored sequentially.
      When one vertex number is pulled from main memory into cache, adjacent numbers on the same cache line come with it. These are likely to be geographically nearby, and thus needed for the same graph.
  • fmt::print
    Finally, this very simple change had the biggest impact on FreeBSD.
    Instead of using the << operator to insert a formatted integer into a file, fmt::print is used instead.
    A little surprising how much this helped -- FreeBSD's standard libraries seem especially slow formatting integers, compared to Linux.

Another thing I tried out was keeping an array of pre-formatting strings similar to 28be0f0, but that performed worse.
I think formatting integers is less expensive than floats, and it's not worth the cache misses to retrieve strings from a giant buffer vs just computing them on the fly.
That, and there's less redundancy to lose. The majority of vertices only have their number listed twice in tm-master-simple.tmg.

3-e76a
3-e76a-log

Holy blap! FreeBSD performs better than CentOS now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant