-
Notifications
You must be signed in to change notification settings - Fork 122
Add 10-year anniversary blog post #756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Preview URL: https://pitrou.github.io/arrow-site If the preview URL doesn't work, you may forget to configure your fork repository for preview. |
32313b0 to
097250b
Compare
77e83f8 to
418acd0
Compare
| example of how building on top of existing Arrow formats and implementations can | ||
| enable groundbreaking efficiency improvements in a very non-trivial problem space. | ||
|
|
||
| It should also be noted that Arrow is often used hand in hand with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to mention the donation of parquet-cpp and native parquet implementations on rust and go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, how would you word it? It's more of a rationalization of existing development practices rather than a "donation" (i.e. a gift).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At minimum, it's probably worth mentioning that nearly all official parquet libraries live in Arrow repositories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think readers may be interested in the story of the "donation" which is less common. I asked Gemini to write the anecdote as below.
Did you know that the Parquet C++ code used to live in its own repository? In the early years, developers found themselves in a 'circular dependency morass.' To fix a bug in PyArrow's Parquet support, you often had to submit a patch to one project, wait for a release, and then update the other. In 2018, the community decided to stop fighting the logistics and merged the C++ and Python development of both into the Apache Arrow mono-repo. This move streamlined the project and solidified the tight-knit relationship between the world's best on-disk and in-memory columnar formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At minimum, it's probably worth mentioning that nearly all official parquet libraries live in Arrow repositories
Definitely, will do.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this -- thank you @pitrou for writing it. I left a bunch of editorial comments, but I don't think any of them are required (all "nice to have")
| integration tests that are routinely checked against multiple implementations of | ||
| Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1). | ||
|
|
||
| ## The lost Union validity bitmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally suggest removing this section -- it is already mentioned above and I think it distracts from the main narrative about Arrow's stability and widespread adoption.
Union types cannot have a top-level validity bitmap anymore.
I suggest adding a link to the mailing list discussion in that text https://lists.apache.org/thread/przo99rtpv4rp66g1h4gn0zyxdq56m27 and then removing this section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, it might be a bit of interesting trivia for the reader. What do other people thing? @ianmcook @paleolimbot @raulcd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Andrew here, although the post is great either way. This particular piece of trivia caused me mild personal discontent since compatibility with this and previous versions is still exercised in integration testing; however, it does interrupt the narrative a bit and I'm not sure there are many implementors of IPC readers out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I as a reader feel this part interesting. And I take this as a callback to the previously mentioning words. But I think it might be better to highlight the rationale of this change in a short sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with both keeping it or removing it. I personally find it interesting and I also find that adding it as a section reinforces the message that the format is stable, there hasn't been any breaking changes on the formats since then and we are very careful and aware of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Piggybacking on @raulcd 's comment...
The section could potentially be retitled as "No breaking changes (almost)" or something to that effect. This fits in with the overall narrative while still giving a spot to talk about the trivia.
| Since then, there has been precisely zero breaking change in the Arrow Columnar and IPC | ||
| formats. | ||
|
|
||
| ## Apache Arrow 1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally suggest moving this paragraph up to the top (right after the introduction)
I realize that the current blog structure is chronological, but I think ordering it in descending order of importance would improve the flow -- if we moved this paragraph to the start, the blog would start with a victory lap about the stability and wide reaching impact (Arrow 1.0) and then discuss some of the path to get there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the "today" part would still be at the end, or? That might read awkwardly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right -- it would make sense to move the "today" section to the top as well if we go in this direction
418acd0 to
cfb6a57
Compare
| Beyond these subprojects, many third-party efforts have adopted the Arrow formats | ||
| for efficient interoperability. [GeoArrow](https://geoarrow.org/) is an impressive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worthwhile to mention polaris (built on top of arrow), NVIDIA Rapids and cuDF (use the Arrow Format on the gpu), duckdb (zero-copy interoperable with Arrow), Dremio (Arrow-native internally, uses FlightSQL), InfluxDB (FlightSQL and Arrow-native), Snowflake returning Arrow, Google BigQuery returning Arrow, Spark Connect using Arrow, etc....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, we'll always be misrepresenting reality since there are so many projects that could be mentioned. I thought GeoArrow is interesting to mention because they are adding their own datatypes to address a particular problem space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we link to Powered By or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do above :)
zanmato1984
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the epic post!
| integration tests that are routinely checked against multiple implementations of | ||
| Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1). | ||
|
|
||
| ## The lost Union validity bitmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I as a reader feel this part interesting. And I take this as a callback to the previously mentioning words. But I think it might be better to highlight the rationale of this change in a short sentence.
cfb6a57 to
22000ba
Compare
| ## How it started | ||
|
|
||
| From the start, Arrow has been a joint effort between practitioners of various | ||
| horizons looking to build common grounds to efficiently exchange columnar data | ||
| between different libraries and systems. | ||
| In [this blog post](https://sympathetic.ink/2024/02/06/Chapter-2-From-Parquet-to-Arrow.html), | ||
| Julien Le Dem recalls how some of the founders of the [Apache Parquet](https://parquet.apache.org/) | ||
| project participated in the early days of the Arrow design phase. The idea of Arrow | ||
| as an in-memory format was meant to address the over half of the interoperability | ||
| problem, the natural complement to Parquet as a persistent storage format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienledem Would you like to do a quick read here, in case I'm misrepresenting things?
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pitrou for working on this. I think it's great! We should celebrate more!
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great article, thanks for writing this!
| In [this blog post](https://sympathetic.ink/2024/02/06/Chapter-2-From-Parquet-to-Arrow.html), | ||
| Julien Le Dem recalls how some of the founders of the [Apache Parquet](https://parquet.apache.org/) | ||
| project participated in the early days of the Arrow design phase. The idea of Arrow | ||
| as an in-memory format was meant to address the over half of the interoperability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| as an in-memory format was meant to address the over half of the interoperability | |
| as an in-memory format was meant to address the other half of the interoperability |
| integration tests that are routinely checked against multiple implementations of | ||
| Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1). | ||
|
|
||
| ## The lost Union validity bitmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Piggybacking on @raulcd 's comment...
The section could potentially be retitled as "No breaking changes (almost)" or something to that effect. This fits in with the overall narrative while still giving a spot to talk about the trivia.
| participate constructively. While the specifications are stable, they may still | ||
| welcome additions to cater for new use cases, as they have done in the past. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| participate constructively. While the specifications are stable, they may still | |
| welcome additions to cater for new use cases, as they have done in the past. | |
| participate constructively. While the specifications are stable, they still | |
| welcome additions to cater for new use cases, as they have done in the past. |
Feel free to ignore, I don't have any formal justification for this suggestion, the wording just seemed a little off
| welcome additions to cater for new use cases, as they have done in the past. | ||
|
|
||
| The Arrow implementations are actively maintained, gaining new features, bug fixes, | ||
| performance improvements. We encourage people to contribute to their implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| performance improvements. We encourage people to contribute to their implementation | |
| and performance improvements. We encourage people to contribute to their implementation |
No description provided.