PoC: New dataframe read source interface by Jolanrensen · Pull Request #1864 · Kotlin/dataframe

Jolanrensen · 2026-05-18T13:07:12Z

WIP and proof-of-concept.

Drafting and exploring what a new DataFrame.read() could be and do (together with claude). (named readSource() for now)

In its current state you can give it anything, and it figures out the rest (be that an ArrowReader, a URL, a String, or an Excel sheet). Extra options can be provided when needed.

DataRow.readSource() also works.

It also comes with a DataFrameSchema.readSource(), if you need just the types (overridden by jdbc), and something like CodeString.read() maybe, if you just need the generated interfaces (overridden by openapi-generator).

I'm also thinking about what a unified system like this could bring to the rest of dataframe. It will be very easy, for instance, to hook it into our parsers or converters! Currently the only format we can parse/convert is json Strings->DataFrame, but this could open up any conversion to DataFrame.

I prototyped it in our convert operation, meaning you can convert any supported type to DataRow, DataFrame, or DataFrameSchema now :)

I also tried to implement it for parse, since JSON parsing was already there. This appears to be a bit trickier though. There's a lot of edge-cases, where, for instance, "[a b c]" can successfully be parsed as CSV, causing all sorts of issues later on.
I did manage to make this pass all tests so far though, by making "parsing to dataframe read source" optional (false by default), enabling it only where needed and adding some extra checks for String input of CSV and JSON.

Writing is also quite interesting, it can figure out which format to use based on the extension alone: df.write("path/to/some.json") will write JSON and df.write("some.csv") will write CSV!
For mixed types like buildString { df.write(this) }, you will need to specify a DataFrameWriteOptions explicitly (Like DataFrameWriteOptions.Json() or Tsv.WriteOptions()), otherwise multiple targets match, and for writing, that should not be allowed.

…test and production code for improved API unification and flexibility.

… in converters/parsers

… parseToDataFrameReadSource parser option.

zaleslaw

Thanks for the exploration, and especially for the two approaches which could be compared

zaleslaw · 2026-05-29T14:50:43Z

+
+    override val supportedReadingTypes: Set<KType> =
+        setOf(
+            typeOf<Connection>(),


Meta layer, hmm

zaleslaw · 2026-05-29T14:53:10Z

+        internal val EXTENSIONS: Set<String> = setOf("xls", "xlsx")
+        internal val MIME_TYPES: Set<String> = setOf(
+            "application/vnd.ms-excel",
+            "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",


What does it give us?

mimetypes? if the file in question has a different extension than expected (or no extension), the mime-type will hint us to what it is. These are just expected mime-types for different sorts of excel files.

zaleslaw · 2026-05-29T14:55:23Z

+        val csvPath = "../data/movies.csv"
+        val expected = DataFrame.readCsv(csvPath)
+
+        DataFrame.readSource(csvPath) shouldBe expected


read or more verbose readDataFrom probably looks better?

yes it should be DataFrame.read() but I didn't want it to clash with the old one. Imagine all looking like DataFrame.read() ;P

zaleslaw · 2026-05-29T14:57:52Z

+        val tableOpts = Jdbc2.ReadOptions(sqlQueryOrTableName = "Customer")
+
+        DataFrame.readSource(config, tableOpts) shouldBe expected
+        DataFrame.readSource(config, Jdbc2.ReadOptions(sqlQueryOrTableName = "SELECT * FROM Customer")) shouldBe expected


The idea was make the API for databases simple and light-weight with default parameters to avoid configurations and fine-tunes, but now we need to build more navigation and edge objects

Yes, that's true, this is just what Claude came up with to let JDBC work together with DataFrameReadSource and still allow all options to be passed. It is probably better to use the respective function indeed.

No problem, I just shared my perception if I need to write a code at the top of such API

true, it's also not ideal. For other sources we can provide zero-configuration reading, even allowing it in parsers/converters because no options are needed. For JDBC, unfortunately, you will always need options.

zaleslaw · 2026-05-29T15:00:13Z

+                        conn.prepareStatement("SELECT * FROM Customer").executeQuery(),
+                        H2(),
+                    )
+                    val schema = DataFrameSchema.readSource(rs, Jdbc2.ReadOptions(dbType = H2()))


Jdbc2.ReadOptions gives a freedom with dictionary of properties, but we could specify better
I prefer to not demonstrate such apporach in examples because it gives yet another level of complexity

old-good specified API with

readCsv/readExcel/read something more easier to differ during code reading

Jolanrensen added 13 commits May 14, 2026 13:04

PoC for DataFrameReadSource

cddad9c

PoC for DataFrameReadSource with csv, tsv and excel support

f169768

PoC for DataFrameReadSource with jdbc

783d1dc

Refactored readReference and readFromData to readSource across …

08179ce

…test and production code for improved API unification and flexibility.

json early exit

6143eab

DataFrameSchema.readSource

0c93a45

added Arrow support to DataFrameReadSource

2283c93

moved supportedType to DataFrameReadSource so we could use it later…

36b722b

… in converters/parsers

DataFrameReadSource openapi support

a8ce712

DataRow.readSource function

b3aa890

put readSource functionality in convert operation

7b759f6

using apache tika to sniff mime types

1931297

api dump

0033315

Jolanrensen force-pushed the new-DataFrameReadSource branch from e592bd7 to 2696eed Compare May 20, 2026 10:47

tests for parsing json columns to check behavior still matches. Added…

aa2bd1b

… parseToDataFrameReadSource parser option.

Jolanrensen force-pushed the new-DataFrameReadSource branch from 2696eed to aa2bd1b Compare May 20, 2026 10:53

Jolanrensen added 6 commits May 26, 2026 21:15

rewrote -orNull logic to Results too

32fe347

adding missing test files

3979f37

wip writing support for DataFrameIO interface and json, todo: tests

2319c50

added some tests for df.write for json

9ea6776

rename all read-Options classes to ReadOptions

32a5a58

added shortcuts for read/write options for better discoverability

d041624

Jolanrensen added enhancement New feature or request research This requires a deeper dive to gather a better understanding API If it touches our API labels May 29, 2026

zaleslaw reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: New dataframe read source interface#1864

PoC: New dataframe read source interface#1864
Jolanrensen wants to merge 20 commits into
masterfrom
new-DataFrameReadSource

Jolanrensen commented May 18, 2026 •

edited

Loading

Uh oh!

zaleslaw left a comment

Uh oh!

zaleslaw May 29, 2026

Uh oh!

zaleslaw May 29, 2026

Uh oh!

Jolanrensen May 29, 2026

Uh oh!

zaleslaw May 29, 2026

Uh oh!

Jolanrensen May 29, 2026

Uh oh!

zaleslaw May 29, 2026

Uh oh!

Jolanrensen May 29, 2026

Uh oh!

zaleslaw May 29, 2026

Uh oh!

Jolanrensen May 29, 2026

Uh oh!

zaleslaw May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jolanrensen commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaleslaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jolanrensen commented May 18, 2026 •

edited

Loading