Skip to content

PoC: New dataframe read source interface#1864

Draft
Jolanrensen wants to merge 20 commits into
masterfrom
new-DataFrameReadSource
Draft

PoC: New dataframe read source interface#1864
Jolanrensen wants to merge 20 commits into
masterfrom
new-DataFrameReadSource

Conversation

@Jolanrensen
Copy link
Copy Markdown
Collaborator

@Jolanrensen Jolanrensen commented May 18, 2026

#450
@zaleslaw

WIP and proof-of-concept.

Drafting and exploring what a new DataFrame.read() could be and do (together with claude). (named readSource() for now)

In its current state you can give it anything, and it figures out the rest (be that an ArrowReader, a URL, a String, or an Excel sheet). Extra options can be provided when needed.

DataRow.readSource() also works.

It also comes with a DataFrameSchema.readSource(), if you need just the types (overridden by jdbc), and something like CodeString.read() maybe, if you just need the generated interfaces (overridden by openapi-generator).

I'm also thinking about what a unified system like this could bring to the rest of dataframe. It will be very easy, for instance, to hook it into our parsers or converters! Currently the only format we can parse/convert is json Strings->DataFrame, but this could open up any conversion to DataFrame.

I prototyped it in our convert operation, meaning you can convert any supported type to DataRow, DataFrame, or DataFrameSchema now :)

I also tried to implement it for parse, since JSON parsing was already there. This appears to be a bit trickier though. There's a lot of edge-cases, where, for instance, "[a b c]" can successfully be parsed as CSV, causing all sorts of issues later on.
I did manage to make this pass all tests so far though, by making "parsing to dataframe read source" optional (false by default), enabling it only where needed and adding some extra checks for String input of CSV and JSON.

Writing is also quite interesting, it can figure out which format to use based on the extension alone: df.write("path/to/some.json") will write JSON and df.write("some.csv") will write CSV!
For mixed types like buildString { df.write(this) }, you will need to specify a DataFrameWriteOptions explicitly (Like DataFrameWriteOptions.Json() or Tsv.WriteOptions()), otherwise multiple targets match, and for writing, that should not be allowed.

@Jolanrensen Jolanrensen force-pushed the new-DataFrameReadSource branch from e592bd7 to 2696eed Compare May 20, 2026 10:47
@Jolanrensen Jolanrensen force-pushed the new-DataFrameReadSource branch from 2696eed to aa2bd1b Compare May 20, 2026 10:53
@Jolanrensen Jolanrensen added enhancement New feature or request research This requires a deeper dive to gather a better understanding API If it touches our API labels May 29, 2026
Copy link
Copy Markdown
Collaborator

@zaleslaw zaleslaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the exploration, and especially for the two approaches which could be compared


override val supportedReadingTypes: Set<KType> =
setOf(
typeOf<Connection>(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta layer, hmm

internal val EXTENSIONS: Set<String> = setOf("xls", "xlsx")
internal val MIME_TYPES: Set<String> = setOf(
"application/vnd.ms-excel",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it give us?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mimetypes? if the file in question has a different extension than expected (or no extension), the mime-type will hint us to what it is. These are just expected mime-types for different sorts of excel files.

val csvPath = "../data/movies.csv"
val expected = DataFrame.readCsv(csvPath)

DataFrame.readSource(csvPath) shouldBe expected
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read or more verbose readDataFrom probably looks better?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should be DataFrame.read() but I didn't want it to clash with the old one. Imagine all looking like DataFrame.read() ;P

val tableOpts = Jdbc2.ReadOptions(sqlQueryOrTableName = "Customer")

DataFrame.readSource(config, tableOpts) shouldBe expected
DataFrame.readSource(config, Jdbc2.ReadOptions(sqlQueryOrTableName = "SELECT * FROM Customer")) shouldBe expected
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was make the API for databases simple and light-weight with default parameters to avoid configurations and fine-tunes, but now we need to build more navigation and edge objects

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true, this is just what Claude came up with to let JDBC work together with DataFrameReadSource and still allow all options to be passed. It is probably better to use the respective function indeed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I just shared my perception if I need to write a code at the top of such API

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it's also not ideal. For other sources we can provide zero-configuration reading, even allowing it in parsers/converters because no options are needed. For JDBC, unfortunately, you will always need options.

conn.prepareStatement("SELECT * FROM Customer").executeQuery(),
H2(),
)
val schema = DataFrameSchema.readSource(rs, Jdbc2.ReadOptions(dbType = H2()))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jdbc2.ReadOptions gives a freedom with dictionary of properties, but we could specify better
I prefer to not demonstrate such apporach in examples because it gives yet another level of complexity

old-good specified API with

readCsv/readExcel/read something more easier to differ during code reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API If it touches our API enhancement New feature or request research This requires a deeper dive to gather a better understanding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants