Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Feb 9, 2026

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Preview URL: https://pitrou.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

@pitrou pitrou force-pushed the arrow-10-years branch 3 times, most recently from 32313b0 to 097250b Compare February 9, 2026 13:23
@pitrou pitrou force-pushed the arrow-10-years branch 2 times, most recently from 77e83f8 to 418acd0 Compare February 9, 2026 17:46
example of how building on top of existing Arrow formats and implementations can
enable groundbreaking efficiency improvements in a very non-trivial problem space.

It should also be noted that Arrow is often used hand in hand with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to mention the donation of parquet-cpp and native parquet implementations on rust and go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, how would you word it? It's more of a rationalization of existing development practices rather than a "donation" (i.e. a gift).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum, it's probably worth mentioning that nearly all official parquet libraries live in Arrow repositories

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think readers may be interested in the story of the "donation" which is less common. I asked Gemini to write the anecdote as below.

Did you know that the Parquet C++ code used to live in its own repository? In the early years, developers found themselves in a 'circular dependency morass.' To fix a bug in PyArrow's Parquet support, you often had to submit a patch to one project, wait for a release, and then update the other. In 2018, the community decided to stop fighting the logistics and merged the C++ and Python development of both into the Apache Arrow mono-repo. This move streamlined the project and solidified the tight-knit relationship between the world's best on-disk and in-memory columnar formats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum, it's probably worth mentioning that nearly all official parquet libraries live in Arrow repositories

Definitely, will do.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this -- thank you @pitrou for writing it. I left a bunch of editorial comments, but I don't think any of them are required (all "nice to have")

integration tests that are routinely checked against multiple implementations of
Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1).

## The lost Union validity bitmap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally suggest removing this section -- it is already mentioned above and I think it distracts from the main narrative about Arrow's stability and widespread adoption.

Union types cannot have a top-level validity bitmap anymore.

I suggest adding a link to the mailing list discussion in that text https://lists.apache.org/thread/przo99rtpv4rp66g1h4gn0zyxdq56m27 and then removing this section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, it might be a bit of interesting trivia for the reader. What do other people thing? @ianmcook @paleolimbot @raulcd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Andrew here, although the post is great either way. This particular piece of trivia caused me mild personal discontent since compatibility with this and previous versions is still exercised in integration testing; however, it does interrupt the narrative a bit and I'm not sure there are many implementors of IPC readers out there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I as a reader feel this part interesting. And I take this as a callback to the previously mentioning words. But I think it might be better to highlight the rationale of this change in a short sentence.

Copy link
Member

@raulcd raulcd Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with both keeping it or removing it. I personally find it interesting and I also find that adding it as a section reinforces the message that the format is stable, there hasn't been any breaking changes on the formats since then and we are very careful and aware of those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Piggybacking on @raulcd 's comment...

The section could potentially be retitled as "No breaking changes (almost)" or something to that effect. This fits in with the overall narrative while still giving a spot to talk about the trivia.

Since then, there has been precisely zero breaking change in the Arrow Columnar and IPC
formats.

## Apache Arrow 1.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally suggest moving this paragraph up to the top (right after the introduction)

I realize that the current blog structure is chronological, but I think ordering it in descending order of importance would improve the flow -- if we moved this paragraph to the start, the blog would start with a victory lap about the stability and wide reaching impact (Arrow 1.0) and then discuss some of the path to get there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the "today" part would still be at the end, or? That might read awkwardly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right -- it would make sense to move the "today" section to the top as well if we go in this direction

Comment on lines +149 to +161
Beyond these subprojects, many third-party efforts have adopted the Arrow formats
for efficient interoperability. [GeoArrow](https://geoarrow.org/) is an impressive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worthwhile to mention polaris (built on top of arrow), NVIDIA Rapids and cuDF (use the Arrow Format on the gpu), duckdb (zero-copy interoperable with Arrow), Dremio (Arrow-native internally, uses FlightSQL), InfluxDB (FlightSQL and Arrow-native), Snowflake returning Arrow, Google BigQuery returning Arrow, Spark Connect using Arrow, etc....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, we'll always be misrepresenting reality since there are so many projects that could be mentioned. I thought GeoArrow is interesting to mention because they are adding their own datatypes to address a particular problem space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we link to Powered By or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do above :)

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the epic post!

integration tests that are routinely checked against multiple implementations of
Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1).

## The lost Union validity bitmap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I as a reader feel this part interesting. And I take this as a callback to the previously mentioning words. But I think it might be better to highlight the rationale of this change in a short sentence.

Comment on lines +36 to +45
## How it started

From the start, Arrow has been a joint effort between practitioners of various
horizons looking to build common grounds to efficiently exchange columnar data
between different libraries and systems.
In [this blog post](https://sympathetic.ink/2024/02/06/Chapter-2-From-Parquet-to-Arrow.html),
Julien Le Dem recalls how some of the founders of the [Apache Parquet](https://parquet.apache.org/)
project participated in the early days of the Arrow design phase. The idea of Arrow
as an in-memory format was meant to address the over half of the interoperability
problem, the natural complement to Parquet as a persistent storage format.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem Would you like to do a quick read here, in case I'm misrepresenting things?

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pitrou for working on this. I think it's great! We should celebrate more!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great article, thanks for writing this!

In [this blog post](https://sympathetic.ink/2024/02/06/Chapter-2-From-Parquet-to-Arrow.html),
Julien Le Dem recalls how some of the founders of the [Apache Parquet](https://parquet.apache.org/)
project participated in the early days of the Arrow design phase. The idea of Arrow
as an in-memory format was meant to address the over half of the interoperability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
as an in-memory format was meant to address the over half of the interoperability
as an in-memory format was meant to address the other half of the interoperability

integration tests that are routinely checked against multiple implementations of
Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1).

## The lost Union validity bitmap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Piggybacking on @raulcd 's comment...

The section could potentially be retitled as "No breaking changes (almost)" or something to that effect. This fits in with the overall narrative while still giving a spot to talk about the trivia.

Comment on lines +174 to +175
participate constructively. While the specifications are stable, they may still
welcome additions to cater for new use cases, as they have done in the past.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
participate constructively. While the specifications are stable, they may still
welcome additions to cater for new use cases, as they have done in the past.
participate constructively. While the specifications are stable, they still
welcome additions to cater for new use cases, as they have done in the past.

Feel free to ignore, I don't have any formal justification for this suggestion, the wording just seemed a little off

welcome additions to cater for new use cases, as they have done in the past.

The Arrow implementations are actively maintained, gaining new features, bug fixes,
performance improvements. We encourage people to contribute to their implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
performance improvements. We encourage people to contribute to their implementation
and performance improvements. We encourage people to contribute to their implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants