|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Apache Arrow is 10 years old 🎉" |
| 4 | +date: "2026-02-09 00:00:00" |
| 5 | +author: pmc |
| 6 | +categories: [arrow] |
| 7 | +--- |
| 8 | +<!-- |
| 9 | +{% comment %} |
| 10 | +Licensed to the Apache Software Foundation (ASF) under one or more |
| 11 | +contributor license agreements. See the NOTICE file distributed with |
| 12 | +this work for additional information regarding copyright ownership. |
| 13 | +The ASF licenses this file to you under the Apache License, Version 2.0 |
| 14 | +(the "License"); you may not use this file except in compliance with |
| 15 | +the License. You may obtain a copy of the License at |
| 16 | +
|
| 17 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 18 | +
|
| 19 | +Unless required by applicable law or agreed to in writing, software |
| 20 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 21 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 22 | +See the License for the specific language governing permissions and |
| 23 | +limitations under the License. |
| 24 | +{% endcomment %} |
| 25 | +--> |
| 26 | + |
| 27 | +The Apache Arrow project was officially established and had its |
| 28 | +[first git commit](https://github.com/apache/arrow/commit/d5aa7c46692474376a3c31704cfc4783c86338f2) |
| 29 | +on February 5th 2016, and we are therefore enthusiastic to announce its 10-year |
| 30 | +anniversary! |
| 31 | + |
| 32 | +Looking back over these 10 years, the project has developed in many unforeseen |
| 33 | +ways and we believe to have delivered on our objective of providing agnostic, |
| 34 | +efficient, durable standards for the exchange of columnar data. |
| 35 | + |
| 36 | +## Apache Arrow 0.1.0 |
| 37 | + |
| 38 | +The first Arrow release, numbered 0.1.0, was tagged on October 7th 2016. It already |
| 39 | +featured the main data types that are still the bread-and-butter of most Arrow datasets, |
| 40 | +as evidenced in this [Flatbuffers declaration](https://github.com/apache/arrow/blob/e7080ef9f1bd91505996edd4e4b7643cc54f6b5f/format/Message.fbs#L96-L115): |
| 41 | + |
| 42 | +```flatbuffers |
| 43 | +
|
| 44 | +/// ---------------------------------------------------------------------- |
| 45 | +/// Top-level Type value, enabling extensible type-specific metadata. We can |
| 46 | +/// add new logical types to Type without breaking backwards compatibility |
| 47 | +
|
| 48 | +union Type { |
| 49 | + Null, |
| 50 | + Int, |
| 51 | + FloatingPoint, |
| 52 | + Binary, |
| 53 | + Utf8, |
| 54 | + Bool, |
| 55 | + Decimal, |
| 56 | + Date, |
| 57 | + Time, |
| 58 | + Timestamp, |
| 59 | + Interval, |
| 60 | + List, |
| 61 | + Struct_, |
| 62 | + Union |
| 63 | +} |
| 64 | +``` |
| 65 | + |
| 66 | +The [release announcement](https://lists.apache.org/thread/6ow4r2kq1qw1rxp36nql8gokgoczozgw) |
| 67 | +made the bold claim that **"the metadata and physical data representation should |
| 68 | +be fairly stable as we have spent time finalizing the details"**. Does that promise |
| 69 | +hold? The short answer is: yes, almost! But let us analyse that in a bit more detail: |
| 70 | + |
| 71 | +* the [Columnar format](https://arrow.apache.org/docs/format/Columnar.html), for |
| 72 | + the most part, has only seen additions of new datatypes since 2016. |
| 73 | + **One single breaking change** occurred: Union types cannot have a |
| 74 | + top-level validity bitmap anymore. |
| 75 | + |
| 76 | +* the [IPC format](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) |
| 77 | + has seen several minor evolutions of its framing and metadata format; these |
| 78 | + evolutions are encoded in the `MetadataVersion` field which ensures that new |
| 79 | + readers can read data produced by old writers. The single breaking change is |
| 80 | + related to the same Union validity change mentioned above. |
| 81 | + |
| 82 | +## First cross-language integration tests |
| 83 | + |
| 84 | +Arrow 0.1.0 had two implementations: C++ and Java, with bindings of the former |
| 85 | +to Python. There were also no integration tests to speak of, that is, no automated |
| 86 | +assessment that the two implementations were in sync (what could go wrong?). |
| 87 | + |
| 88 | +Integration tests had to wait for [November 2016](https://issues.apache.org/jira/browse/ARROW-372) |
| 89 | +to be designed, and the first [automated CI run](https://github.com/apache/arrow/commit/45ed7e7a36fb2a69de468c41132b6b3bbd270c92) |
| 90 | +probably occurred in December of the same year. Its results cannot be fetched anymore, |
| 91 | +so we can only assume the tests passed successfully. 🙂 |
| 92 | + |
| 93 | +From that moment, integration tests have grown to follow additions to the Arrow format, |
| 94 | +while ensuring that older data can still be read successfully. For example, the |
| 95 | +integration tests that are routinely checked against multiple implementations of |
| 96 | +Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1). |
| 97 | + |
| 98 | +## The lost Union validity bitmap |
| 99 | + |
| 100 | +As mentioned above, at some point the Union type lost its top-level validity bitmap, |
| 101 | +breaking compatibility for the workloads that made use of this feature. |
| 102 | + |
| 103 | +This change was [proposed back in June 2020](https://lists.apache.org/thread/przo99rtpv4rp66g1h4gn0zyxdq56m27) |
| 104 | +and enacted shortly thereafter. It elicited no controversy and doesn't seem to have |
| 105 | +caused any significant discontent among users, signaling that the feature was |
| 106 | +probably not widely used (if at all). |
| 107 | + |
| 108 | +Since then, there has been precisely zero breaking change in the Arrow Columnar and IPC |
| 109 | +formats. |
| 110 | + |
| 111 | +## Apache Arrow 1.0.0 |
| 112 | + |
| 113 | +We have been extremely cautious with version numbering and waited |
| 114 | +[until July 2020](https://arrow.apache.org/blog/2020/07/24/1.0.0-release/) |
| 115 | +before finally switching away from 0.x version numbers. This was signalling |
| 116 | +to the world that Arrow had reached its "adult phase" of making formal compatibility |
| 117 | +promises, and that the Arrow formats were ready for wide consumption amongst |
| 118 | +the data ecosystem. |
| 119 | + |
| 120 | +## Apache Arrow, today |
| 121 | + |
| 122 | +Describing the breadth of the Arrow ecosystem today would take a full-fledged |
| 123 | +article of its own, or perhaps even multiple Wikipedia pages. |
| 124 | + |
| 125 | +As for the Arrow project, we will merely refer you to our official documentation: |
| 126 | + |
| 127 | +1. [The various specifications](https://arrow.apache.org/docs/format/index.html#) |
| 128 | + that cater to multiple aspects of sharing Arrow data, such as |
| 129 | + [in-process zero-copy sharing](https://arrow.apache.org/docs/format/CDataInterface.html) |
| 130 | + between producers and consumers that know nothing about each other, or |
| 131 | + [executing database queries](https://arrow.apache.org/docs/format/ADBC.html) |
| 132 | + that efficiently return their results in the Arrow format. |
| 133 | + |
| 134 | +2. [The implementation status page](https://arrow.apache.org/docs/status.html) |
| 135 | + that lists the implementations developed officially under the Apache Arrow |
| 136 | + umbrella; but keep in mind that multiple third-party implementations exist |
| 137 | + in non-Apache projects, either open source or proprietary. |
| 138 | + |
| 139 | +However, that is only a small part of the landscape. The Arrow project hosts |
| 140 | +several official subprojects, such as [ADBC](https://arrow.apache.org/adbc) |
| 141 | +and [nanoarrow](https://arrow.apache.org/nanoarrow). A notable success story is |
| 142 | +[Apache DataFusion](https://datafusion.apache.org/), which began as an Arrow |
| 143 | +subproject and later graduated to become an independent top-level project in the |
| 144 | +Apache Software Foundation, reflecting the maturity and impact of the technology. |
| 145 | + |
| 146 | +Beyond these subprojects, many third-party efforts have adopted the Arrow formats |
| 147 | +for efficient interoperability. [GeoArrow](https://geoarrow.org/) is an impressive |
| 148 | +example of how building on top of existing Arrow formats and implementations can |
| 149 | +enable groundbreaking efficiency improvements in a very non-trivial problem space. |
| 150 | + |
| 151 | +It should also be noted that Arrow is often used hand in hand with |
| 152 | +[Apache Parquet](https://parquet.apache.org/), another open standard for columnar |
| 153 | +data with a significant community overlap and a tremendous usage base. |
| 154 | + |
| 155 | +## Tomorrow |
| 156 | + |
| 157 | +We cannot really predict the future, and the project does not have a formal roadmap. |
| 158 | +While the specifications are stable, they may still welcome additions to cater for |
| 159 | +new use cases, as they have done in the past. |
| 160 | + |
| 161 | +The Arrow implementations are actively maintained, gaining new features, bug fixes, |
| 162 | +performance improvements. We encourage people to contribute to their implementation |
| 163 | +of choice, and to [engage with us and the community](https://arrow.apache.org/community/). |
| 164 | + |
| 165 | +However, much of the progress is also happening in the broader ecosystem of |
| 166 | +third-party tools and libraries. Even for us, it has become almost impossible |
| 167 | +to keep track of all the things happening, on the same stable foundations that |
| 168 | +have been laid 10 years ago. |
0 commit comments