SEDONA-738 Add sedonadb worker by Imbruced · Pull Request #2593 · apache/sedona

Imbruced · 2026-01-18T18:09:59Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [SEDONA-738] my subject.

What changes were proposed in this PR?

Sedona vectorized udf (Apache Arrow exchange), which is utilizing the SedonaDB. It supports:

scalar functions
daemon mode

How was this patch tested?

unit tests

Did this PR include necessary documentation updates?

TODO

Imbruced · 2026-01-18T19:06:07Z

Imbruced · 2026-01-18T19:06:45Z

Working on proper benchmarking

Imbruced · 2026-01-18T20:49:36Z

Items on the list are the extensions to the Sedona DB vectorized UDFs:

support for Spark 4.0
vectorized table functions (Sedona DB, table object would be an input to the function)
implementing additional serialization method in SedonaDB to reduce the amount of transformation for table functions
aggregate functions?
adding geopandas as other method

Imbruced · 2026-01-25T21:26:47Z

@jiayuasu @paleolimbot I think we can start reviewing the changes and the ideas that I am proposing in this MR. What I observed is that this way, UDF can be even faster than native Sedona functions like ST_Buffer. But, for instance, ST_Area is three times slower, and I guess it depends on the specific function. But what is more important, the performance is better than the previous UDFs in Sedona. I would mark this functionality as experimental.

Also, I haven't included a documentation update, as we might decide during the review that this MR needs adjustment.

Imbruced · 2026-01-25T21:27:47Z

This piece of code is working only for Spark 3.5, but I plan to extend it for Spark 4.0

Imbruced · 2026-01-25T21:29:43Z

I would like to extend it to include table-defined user functions, which will allow us to operate on the entire SedonaDB dataframe.

Imbruced · 2026-01-25T21:30:05Z

.github/workflows/pyflink.yml

          cd python
          uv add apache-flink==1.20.1
-          uv sync
+      #          uv sync --extra flink


Imbruced · 2026-01-25T21:31:39Z

.github/workflows/python-extension.yml

      matrix:
        os: ['ubuntu-latest', 'windows-latest', 'macos-15']
-        python: ['3.11', '3.10', '3.9', '3.8']
+        python: ['3.11', '3.10', '3.9']


I had trouble integrating with Python 3.8, it's already one year since it reached EOL, what would you think about removing it? and maybe start supporting Python 3.12 and 3.13

Imbruced · 2026-01-25T21:35:18Z

python/sedona/spark/worker/daemon.py

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+


This one is almost a copy-paste of what is in Apache Spark. The only difference is the import worker function

from sedona.spark.worker.worker import main as worker_main

I don't know what a better approach is, using the import of functions like manager?

I'm not sure if monkeypatching the worker_main property of PySpark's daemon module after importing it would work, but I still prefer the current approach. Maintaining a fork of daemon.py is fine once we know where are the modified pieces.

yeah I agree, I ll add that information in the header

Imbruced · 2026-01-25T21:36:18Z

python/sedona/spark/worker/udf_info.py

+                crs = self.geom_offsets[arg]
+                fields.append(
+                    f"ST_GeomFromSedonaSpark(_{arg}, 'EPSG:{crs}') AS _{arg}"
+                )  # nosec


Theoretical SQL injection, which is not causing any harm here.

Imbruced · 2026-01-25T21:37:35Z

python/src/geomserde_speedup_module.c

  return Py_BuildValue("(Kibi)", geom, geom_type_id, has_z, length);
 }

+static PyObject *to_sedona_func(PyObject *self, PyObject *args) {


If the Sedona speedup is available, instead of translating to wkb and then loading from wkb with shapely, we can create shapely objects directly to speed up vectorized UDFs.

I'm not sure if it could also be applied to _apply_shapely_series_udf and _apply_geo_series_udf.

Yeah, I think it could be, but with some modifications to the existing Python and Scala code. That's what I suppose

Imbruced · 2026-01-25T21:38:08Z

python/tests/test_base.py

                    "20",
                )
-                # Pandas on PySpark doesn't work with ANSI mode, which is enabled by default
+                .config("spark.executor.memory", "10G")


To remove, forgot to remove it after testing.

Imbruced · 2026-01-25T21:38:35Z

python/setup.py

+from setuptools import setup
+import numpy
+
+setup(


this is needed to make numpy C wrappers available

Imbruced · 2026-01-25T21:39:18Z

spark/common/src/main/scala/org/apache/sedona/spark/SedonaContext.scala

    val sedonaArrowStrategy = Try(
      Class
-        .forName("org.apache.spark.sql.udf.SedonaArrowStrategy")
+        .forName("org.apache.spark.sql.execution.python.SedonaArrowStrategy")


need some execution, python private methods from spark

Imbruced · 2026-01-25T21:40:09Z

spark/spark-3.5/src/main/scala/org/apache/spark/sql/execution/python/SedonaArrowStrategy.scala

+      case _ => None
+    }
+
+    schema


infer for geometry fields by taking the firs value

spark/spark-3.5/src/test/scala/org/apache/sedona/sql/TestBaseScala.scala

Imbruced · 2026-01-25T21:43:24Z

pom.xml

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-javadoc-plugin</artifactId>
+<!--                <version>3.12.0</version>-->


jiayuasu · 2026-01-28T07:56:43Z

@Kontinuation can you take a look?

Kontinuation

I initially thought that this PR is for enabling offloading some ST functions to SedonaDB to accelerate the evaluation of some functions supported by SedonaDB, but it does not seem to be the case after taking a closer look at it. The main purpose seems to delegate UDFs to SedonaDB, so that the vectorized UDFs will be executed by SedonaDB instead of PySpark. I'm not sure if my understanding is correct.

I'm a bit of curious why adding another level of indirection result in improvement in performance, and if SedonaDB is really playing an important role in the performance improvement.

Kontinuation · 2026-02-11T08:42:48Z

spark/spark-3.5/src/main/scala/org/apache/spark/sql/execution/python/SedonaArrowStrategy.scala

+          val geom = rowMatched.get.get(index, GeometryUDT).asInstanceOf[Array[Byte]]
+          val preambleByte = geom(0) & 0xff
+          val hasSrid = (preambleByte & 0x01) != 0
+
+          var srid = 0
+          if (hasSrid) {
+            val srid2 = (geom(1) & 0xff) << 16
+            val srid1 = (geom(2) & 0xff) << 8
+            val srid0 = geom(3) & 0xff
+            srid = srid2 | srid1 | srid0
+          }
+
+          (index, srid)


We can extract the code for parsing srid in

sedona/common/src/main/java/org/apache/sedona/common/geometrySerde/GeometrySerializer.java

Lines 73 to 85 in 6df699f

checkBufferSize(buffer, 8);

int preambleByte = buffer.getByte(0) & 0xFF;

int wkbType = preambleByte >> 4;

CoordinateType coordType = CoordinateType.valueOf((preambleByte & 0x0F) >> 1);

boolean hasSrid = (preambleByte & 0x01) != 0;

buffer.setCoordinateType(coordType);

int srid = 0;

if (hasSrid) {

int srid2 = (buffer.getByte(1) & 0xFF) << 16;

int srid1 = (buffer.getByte(2) & 0xFF) << 8;

int srid0 = buffer.getByte(3) & 0xFF;

srid = (srid2 | srid1 | srid0);

}

to a static function

Kontinuation · 2026-02-11T08:47:04Z

spark/spark-3.5/src/main/scala/org/apache/spark/sql/execution/python/SedonaArrowStrategy.scala

+    val row = iterator.next()
+
+    val rowMatched = row match {
+      case generic: GenericInternalRow =>
+        Some(generic)
+      case _ => None
+    }


We are only taking the SRID of the geometry value in the first row as the SRID of the entire field, this does not work well with geometry data with mixed SRIDs. SedonaDB has item-crs data type added for such data: apache/sedona-db#410, I think we should always use this to bridge Sedona and SedonaDB.

Kontinuation · 2026-02-11T08:54:22Z

.../spark-3.5/src/main/scala/org/apache/spark/sql/execution/python/SedonaBasePythonRunner.scala

+    mem.map(_ / cores)
+  }
+
+  import java.io._


Move to the top of the file

Kontinuation · 2026-02-11T09:20:11Z

.../spark-3.5/src/main/scala/org/apache/spark/sql/execution/python/SedonaPythonArrowInput.scala

+import java.io.DataOutputStream
+import java.net.Socket
+
+private[python] trait SedonaPythonArrowInput[IN] extends PythonArrowInput[IN] {


I suggest add a github link to the comment noting where most of the code was taken from. This suggestion also applies to SedonaPythonArrowOutput.

Kontinuation · 2026-02-11T09:29:29Z

python/sedona/spark/sql/functions.py

+    eval_type = 6200
+    if sedona_db_speedup_enabled:
+        eval_type = 6201


Define thse eval types as constants such as SQL_SCALAR_SEDONA_DB_UDF. I believe that we should follow the same pattern as SEDONA_SCALAR_EVAL_TYPE.

Kontinuation · 2026-02-11T09:44:39Z

python/src/geomserde_speedup_module.c

+  PyArrayObject *array = (PyArrayObject *)input_obj;
+  PyObject **objs = (PyObject **)PyArray_DATA(array);


Do we need to check the type of input_obj using PyArray_Check before casting it to PyArrayObject and calling PyArray_* methods?

Kontinuation · 2026-02-11T09:51:22Z

python/sedona/spark/worker/daemon.py

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+


I'm not sure if monkeypatching the worker_main property of PySpark's daemon module after importing it would work, but I still prefer the current approach. Maintaining a fork of daemon.py is fine once we know where are the modified pieces.

Kontinuation · 2026-02-11T09:55:32Z

python/sedona/spark/worker/serde.py

+            df = self.db.create_data_frame(table)
+            table_name = f"my_table_{index}"
+
+            df.to_view(table_name)


Do we need overwrite=True here?

Kontinuation · 2026-02-11T10:05:37Z

python/src/geomserde_speedup_module.c

+    {"from_sedona_func", from_sedona_func, METH_VARARGS,
+     "Deserialize bytes-like object to geometry object."},
+    {"to_sedona_func", to_sedona_func, METH_VARARGS,
+     "Deserialize bytes-like object to geometry object."},


These functions are for working with arrays, we should clarify this in the description.

Kontinuation · 2026-02-11T10:21:01Z

python/sedona/spark/worker/worker.py

+def register_sedona_db_udf(infile, pickle_ser) -> UDFInfo:
+    num_udfs = read_int(infile)
+
+    udf = None
+    for _ in range(num_udfs):
+        udf = read_udf(infile, pickle_ser)
+
+    return udf


Is it intended to discard udfs except the last one?

yes, it is supporting only one level nesting functions so far, I would like to extend this functionallity in next MRs to not overwhelm in reviews

Imbruced · 2026-02-16T12:51:28Z

#2593 (comment) yeah, that's a good question, I am not sure if the monkeypatching will be too hacky, maybe we can add the info in the file header that it's one to one copy with what is in Apache Spark with one changed line

Imbruced · 2026-02-16T14:39:45Z

@Kontinuation
#2593 (review)

How does it improve the performance of what we already have:

Standard udf Python function transfers data one by one, which suffers from the Python object serialization
Standard udf Python with C serialization code, suffers from the penalty of sending data over the network
Vectorized udfs which we already have, are using the WKB as the internal transfer format, so they suffer from the WKB to Sedona and Sedona to WKB translations

Current solution mitigates all those issues:

using Arrow and sends data in batches
using Sedona serde C code to convert the data directly to shapely and from shapely

Instead of SedonaDB, we could use GeoPandas, as Apache Spark already does with Pandas. However, we already have SedonaDB in the ecosystem, so why not use it? I guess the Python UDFs in SedonaDB will be improved, now in the most optimized version we run Shapely over the Arrow arrays, which is more efficient than we already have to the point where the buffer version is faster than the native one we have in Sedona.

Imbruced · 2026-02-16T14:40:58Z

@Kontinuation, I'll fix the errors once we have consensus on the direction. I am planning to add support for the GeoPandas alongside the SedonaDB.

Imbruced · 2026-02-16T16:41:44Z

@Kontinuation I would like to perceive as a fundamental to make Spatial Python UDFs more efficient

SEDONA-738 Add sedonadb worker

1c3aa88

github-actions bot added sedona-python sedona-spark github_actions Pull requests that update GitHub Actions code root labels Jan 18, 2026

Imbruced marked this pull request as ready for review January 25, 2026 21:27

Imbruced requested a review from jiayuasu as a code owner January 25, 2026 21:27

Imbruced commented Jan 25, 2026

View reviewed changes

.github/workflows/pyflink.yml Outdated

cd python

uv add apache-flink==1.20.1

uv sync

# uv sync --extra flink

Copy link

Member Author

Imbruced Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

Imbruced commented Jan 25, 2026

View reviewed changes

python/setup.py

from setuptools import setup

import numpy

setup(

Copy link

Member Author

Imbruced Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed to make numpy C wrappers available

Imbruced commented Jan 25, 2026

View reviewed changes

spark/spark-3.5/src/test/scala/org/apache/sedona/sql/TestBaseScala.scala Outdated Show resolved Hide resolved

Imbruced commented Jan 25, 2026

View reviewed changes

jiayuasu requested a review from Kontinuation January 28, 2026 07:56

jiayuasu added this to the sedona-1.9.0 milestone Feb 6, 2026

Kontinuation reviewed Feb 11, 2026

View reviewed changes

update the description

15136e2

Imbruced added 4 commits February 16, 2026 13:51

update the description

aeab0cb

update the description

f178f15

update the description

321fa81

update the description

503b334

	checkBufferSize(buffer, 8);
	int preambleByte = buffer.getByte(0) & 0xFF;
	int wkbType = preambleByte >> 4;
	CoordinateType coordType = CoordinateType.valueOf((preambleByte & 0x0F) >> 1);
	boolean hasSrid = (preambleByte & 0x01) != 0;
	buffer.setCoordinateType(coordType);
	int srid = 0;
	if (hasSrid) {
	int srid2 = (buffer.getByte(1) & 0xFF) << 16;
	int srid1 = (buffer.getByte(2) & 0xFF) << 8;
	int srid0 = buffer.getByte(3) & 0xFF;
	srid = (srid2 \| srid1 \| srid0);
	}

		PyArrayObject array = (PyArrayObject )input_obj;
		PyObject objs = (PyObject )PyArray_DATA(array);

Conversation

Imbruced commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Imbruced commented Jan 18, 2026

Uh oh!

Imbruced commented Jan 18, 2026

Uh oh!

Imbruced commented Jan 18, 2026

Uh oh!

Imbruced commented Jan 25, 2026

Uh oh!

Imbruced commented Jan 25, 2026

Uh oh!

Imbruced commented Jan 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayuasu commented Jan 28, 2026

Uh oh!

Kontinuation left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Imbruced commented Jan 18, 2026 •

edited

Loading

Kontinuation left a comment •

edited

Loading