Skip to content

Draft replace rapidjson with simdjson#47

Open
rok wants to merge 92 commits intomainfrom
simdjson
Open

Draft replace rapidjson with simdjson#47
rok wants to merge 92 commits intomainfrom
simdjson

Conversation

@rok
Copy link
Owner

@rok rok commented Jan 9, 2026

This is a discussion PR. It will likely be broken up in multiple PRs.

@rok
Copy link
Owner Author

rok commented Jan 28, 2026

@Taepper this currently in a bit of a cmake limbo :).

@Taepper
Copy link

Taepper commented Jan 28, 2026

@Taepper this currently in a bit of a cmake limbo :).

Hehe :) it is quite the big change, I will also take a look tomorrow!

Copy link

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool nice work. Some suggestions to improve the meson integration, which hopefully can help while the cmake issues get figured out (?)

'RapidJSON',
fallback: ['rapidjson', 'rapidjson_dep'],
)
simdjson_dep = dependency('simdjson', fallback: ['simdjson', 'simdjson_dep'])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
simdjson_dep = dependency('simdjson', fallback: ['simdjson', 'simdjson_dep'])
simdjson_dep = dependency('simdjson')

The fallback as written is not correct (the variable name in the simdjson module is actually still just simdjson), but I also don't think its required in this case

source_hash = b9290a9a6d444c8e049bd589ab804e0ccf2b05dc5984a19ed5ae75d090064806
source_fallback_url = https://apache.jfrog.io/artifactory/arrow/thirdparty/7.0.0/rapidjson-232389d4f1012dddec4ef84861face2d2ba85709.tar.gz
patch_directory = rapidjson
directory = simdjson-3.12.3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a wrap entry for 4.2.2 at this point - maybe worth using instead of 3.12? Or, if we wanted to stay in the 3 series, there's also 3.13

# under the License.

project('rapidjson', 'cpp')
project('simdjson', 'cpp', version: '3.12.3', default_options: ['cpp_std=c++17'])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to remove this file altogether. subprojects/packagefiles contain user-specific overrides for packages (i.e. you want to create a meson.build file that is "better" than what you get upstream).

In the case of simdjson, we should probably trust the configuration provided by the wrapdb, unless we know better

@rok
Copy link
Owner Author

rok commented Feb 4, 2026

Cool nice work. Some suggestions to improve the meson integration, which hopefully can help while the cmake issues get figured out (?)

Thanks @WillAyd ! I've not had a chance to properly look into meson config yet and this helps!

Copy link

@Taepper Taepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments – impressive how fast you went through the code-base! I would have thought that it would first only replace a few rapidjson usages.

Generally, we can choose whether to use the simdjson::dom or simdjson::ondemand frontend for arrow.

simdjson::ondemand can be faster and less memory-intensive. [1]

Its api is a little more difficult to use, but might be fine for our use-case, as we do not intend to parse multiple values in the document in parallel.

I see that you are using ondemand frontend for json/parser.cc, and dom for all other use-cases. It seems sensible to me at first glance, but it might make sense to document the reasoning for this decision

[1] https://arxiv.org/pdf/2312.17149

py3-numpy-dev \
python3-dev \
rapidjson-dev \
simdjson-dev \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required when you bundle simdjson in all cases?

add_arrow_example(rapidjson_row_converter EXTRA_LINK_LIBS RapidJSON)
add_arrow_example(from_json_string_example EXTRA_LINK_LIBS RapidJSON)
endif()
add_arrow_example(from_json_string_example)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simdjson is not always enabled. Should this stay conditional?

if(ARROW_JSON OR ARROW_FLIGHT_SQL_ODBC)
  set(ARROW_WITH_SIMDJSON ON)
endif()

)

externalproject_add(rapidjson_ep
set(SIMDJSON_CMAKE_ARGS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to set this option to OFF:
option(SIMDJSON_EXCEPTIONS "Enable simdjson's exception-throwing interface" ON)
https://github.com/simdjson/simdjson/blob/6fffc99dac4029a67dd5a794d2f59ec6ba84a1b1/cmake/exception-flags.cmake#L1

This disables any exception-throwing API-functions from simdjson

python3-rados \
rados-objclass-dev \
rapidjson-dev \
libsimdjson-dev \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

pkg-config \
protobuf-compiler \
rapidjson-dev \
libsimdjson-dev \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simdjson is bundled again, right?

@rok
Copy link
Owner Author

rok commented Feb 4, 2026

Thanks for the review @Taepper ! As a general comment, I thought I would first create a PR here and work on a POC "in secret" to see if it's feasible and then split it up into manageable PR against main.
I'm travelling today and will probably have time to work on this by end of week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants