Skip to content

Commit 77e83f8

Browse files
committed
Add 10-year anniversary blog post
1 parent 0d7326a commit 77e83f8

1 file changed

Lines changed: 168 additions & 0 deletions

File tree

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
layout: post
3+
title: "Apache Arrow is 10 years old 🎉"
4+
date: "2026-02-09 00:00:00"
5+
author: pmc
6+
categories: [arrow]
7+
---
8+
<!--
9+
{% comment %}
10+
Licensed to the Apache Software Foundation (ASF) under one or more
11+
contributor license agreements. See the NOTICE file distributed with
12+
this work for additional information regarding copyright ownership.
13+
The ASF licenses this file to you under the Apache License, Version 2.0
14+
(the "License"); you may not use this file except in compliance with
15+
the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing, software
20+
distributed under the License is distributed on an "AS IS" BASIS,
21+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
22+
See the License for the specific language governing permissions and
23+
limitations under the License.
24+
{% endcomment %}
25+
-->
26+
27+
The Apache Arrow project was officially established and had its
28+
[first git commit](https://github.com/apache/arrow/commit/d5aa7c46692474376a3c31704cfc4783c86338f2)
29+
on February 5th 2016, and we are therefore enthusiastic to announce its 10-year
30+
anniversary!
31+
32+
Looking back over these 10 years, the project has developed in many unforeseen
33+
ways and we believe to have delivered on our objective of providing agnostic,
34+
efficient, durable standards for the exchange of columnar data.
35+
36+
## Apache Arrow 0.1.0
37+
38+
The first Arrow release, numbered 0.1.0, was tagged on October 7th 2016. It already
39+
featured the main data types that are still the bread-and-butter of most Arrow datasets,
40+
as evidenced in this [Flatbuffers declaration](https://github.com/apache/arrow/blob/e7080ef9f1bd91505996edd4e4b7643cc54f6b5f/format/Message.fbs#L96-L115):
41+
42+
```flatbuffers
43+
44+
/// ----------------------------------------------------------------------
45+
/// Top-level Type value, enabling extensible type-specific metadata. We can
46+
/// add new logical types to Type without breaking backwards compatibility
47+
48+
union Type {
49+
Null,
50+
Int,
51+
FloatingPoint,
52+
Binary,
53+
Utf8,
54+
Bool,
55+
Decimal,
56+
Date,
57+
Time,
58+
Timestamp,
59+
Interval,
60+
List,
61+
Struct_,
62+
Union
63+
}
64+
```
65+
66+
The [release announcement](https://lists.apache.org/thread/6ow4r2kq1qw1rxp36nql8gokgoczozgw)
67+
made the bold claim that **"the metadata and physical data representation should
68+
be fairly stable as we have spent time finalizing the details"**. Does that promise
69+
hold? The short answer is: yes, almost! But let us analyse that in a bit more detail:
70+
71+
* the [Columnar format](https://arrow.apache.org/docs/format/Columnar.html), for
72+
the most part, has only seen additions of new datatypes since 2016.
73+
**One single breaking change** occurred: Union types cannot have a
74+
top-level validity bitmap anymore.
75+
76+
* the [IPC format](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
77+
has seen several minor evolutions of its framing and metadata format; these
78+
evolutions are encoded in the `MetadataVersion` field which ensures that new
79+
readers can read data produced by old writers. The single breaking change is
80+
related to the same Union validity change mentioned above.
81+
82+
## First cross-language integration tests
83+
84+
Arrow 0.1.0 had two implementations: C++ and Java, with bindings of the former
85+
to Python. There were also no integration tests to speak of, that is, no automated
86+
assessment that the two implementations were in sync (what could go wrong?).
87+
88+
Integration tests had to wait for [November 2016](https://issues.apache.org/jira/browse/ARROW-372)
89+
to be designed, and the first [automated CI run](https://github.com/apache/arrow/commit/45ed7e7a36fb2a69de468c41132b6b3bbd270c92)
90+
probably occurred in December of the same year. Its results cannot be fetched anymore,
91+
so we can only assume the tests passed successfully. 🙂
92+
93+
From that moment, integration tests have grown to follow additions to the Arrow format,
94+
while ensuring that older data can still be read successfully. For example, the
95+
integration tests that are routinely checked against multiple implementations of
96+
Arrow have data files [generated in 2019 by Arrow 0.14.1](https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/0.14.1).
97+
98+
## The lost Union validity bitmap
99+
100+
As mentioned above, at some point the Union type lost its top-level validity bitmap,
101+
breaking compatibility for the workloads that made use of this feature.
102+
103+
This change was [proposed back in June 2020](https://lists.apache.org/thread/przo99rtpv4rp66g1h4gn0zyxdq56m27)
104+
and enacted shortly thereafter. It elicited no controversy and doesn't seem to have
105+
caused any significant discontent among users, signaling that the feature was
106+
probably not widely used (if at all).
107+
108+
Since then, there has been precisely zero breaking change in the Arrow Columnar and IPC
109+
formats.
110+
111+
## Apache Arrow 1.0.0
112+
113+
We have been extremely cautious with version numbering and waited
114+
[until July 2020](https://arrow.apache.org/blog/2020/07/24/1.0.0-release/)
115+
before finally switching away from 0.x version numbers. This was signalling
116+
to the world that Arrow had reached its "adult phase" of making formal compatibility
117+
promises, and that the Arrow formats were ready for wide consumption amongst
118+
the data ecosystem.
119+
120+
## Apache Arrow, today
121+
122+
Describing the breadth of the Arrow ecosystem today would take a full-fledged
123+
article of its own, or perhaps even multiple Wikipedia pages.
124+
125+
As for the Arrow project, we will merely refer you to our official documentation:
126+
127+
1. [The various specifications](https://arrow.apache.org/docs/format/index.html#)
128+
that cater to multiple aspects of sharing Arrow data, such as
129+
[in-process zero-copy sharing](https://arrow.apache.org/docs/format/CDataInterface.html)
130+
between producers and consumers that know nothing about each other, or
131+
[executing database queries](https://arrow.apache.org/docs/format/ADBC.html)
132+
that efficiently return their results in the Arrow format.
133+
134+
2. [The implementation status page](https://arrow.apache.org/docs/status.html)
135+
that lists the implementations developed officially under the Apache Arrow
136+
umbrella; but keep in mind that multiple third-party implementations exist
137+
in non-Apache projects, either open source or proprietary.
138+
139+
However, that is only a small part of the landscape. The Arrow project hosts
140+
several official subprojects, such as [ADBC](https://arrow.apache.org/adbc)
141+
and [nanoarrow](https://arrow.apache.org/nanoarrow). A notable success story is
142+
[Apache DataFusion](https://datafusion.apache.org/), which began as an Arrow
143+
subproject and later graduated to become an independent top-level project in the
144+
Apache Software Foundation, reflecting the maturity and impact of the technology.
145+
146+
Beyond these subprojects, many third-party efforts have adopted the Arrow formats
147+
for efficient interoperability. [GeoArrow](https://geoarrow.org/) is an impressive
148+
example of how building on top of existing Arrow formats and implementations can
149+
enable groundbreaking efficiency improvements in a very non-trivial problem space.
150+
151+
It should also be noted that Arrow is often used hand in hand with
152+
[Apache Parquet](https://parquet.apache.org/), another open standard for columnar
153+
data with a significant community overlap and a tremendous usage base.
154+
155+
## Tomorrow
156+
157+
We cannot really predict the future, and the project does not have a formal roadmap.
158+
While the specifications are stable, they may still welcome additions to cater for
159+
new use cases, as they have done in the past.
160+
161+
The Arrow implementations are actively maintained, gaining new features, bug fixes,
162+
performance improvements. We encourage people to contribute to their implementation
163+
of choice, and to [engage with us and the community](https://arrow.apache.org/community/).
164+
165+
However, much of the progress is also happening in the broader ecosystem of
166+
third-party tools and libraries. Even for us, it has become almost impossible
167+
to keep track of all the things happening, on the same stable foundations that
168+
have been laid 10 years ago.

0 commit comments

Comments
 (0)