PoC: New dataframe read source interface#1864
Conversation
…test and production code for improved API unification and flexibility.
… in converters/parsers
e592bd7 to
2696eed
Compare
… parseToDataFrameReadSource parser option.
2696eed to
aa2bd1b
Compare
zaleslaw
left a comment
There was a problem hiding this comment.
Thanks for the exploration, and especially for the two approaches which could be compared
|
|
||
| override val supportedReadingTypes: Set<KType> = | ||
| setOf( | ||
| typeOf<Connection>(), |
| internal val EXTENSIONS: Set<String> = setOf("xls", "xlsx") | ||
| internal val MIME_TYPES: Set<String> = setOf( | ||
| "application/vnd.ms-excel", | ||
| "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", |
There was a problem hiding this comment.
mimetypes? if the file in question has a different extension than expected (or no extension), the mime-type will hint us to what it is. These are just expected mime-types for different sorts of excel files.
| val csvPath = "../data/movies.csv" | ||
| val expected = DataFrame.readCsv(csvPath) | ||
|
|
||
| DataFrame.readSource(csvPath) shouldBe expected |
There was a problem hiding this comment.
read or more verbose readDataFrom probably looks better?
There was a problem hiding this comment.
yes it should be DataFrame.read() but I didn't want it to clash with the old one. Imagine all looking like DataFrame.read() ;P
| val tableOpts = Jdbc2.ReadOptions(sqlQueryOrTableName = "Customer") | ||
|
|
||
| DataFrame.readSource(config, tableOpts) shouldBe expected | ||
| DataFrame.readSource(config, Jdbc2.ReadOptions(sqlQueryOrTableName = "SELECT * FROM Customer")) shouldBe expected |
There was a problem hiding this comment.
The idea was make the API for databases simple and light-weight with default parameters to avoid configurations and fine-tunes, but now we need to build more navigation and edge objects
There was a problem hiding this comment.
Yes, that's true, this is just what Claude came up with to let JDBC work together with DataFrameReadSource and still allow all options to be passed. It is probably better to use the respective function indeed.
There was a problem hiding this comment.
No problem, I just shared my perception if I need to write a code at the top of such API
There was a problem hiding this comment.
true, it's also not ideal. For other sources we can provide zero-configuration reading, even allowing it in parsers/converters because no options are needed. For JDBC, unfortunately, you will always need options.
| conn.prepareStatement("SELECT * FROM Customer").executeQuery(), | ||
| H2(), | ||
| ) | ||
| val schema = DataFrameSchema.readSource(rs, Jdbc2.ReadOptions(dbType = H2())) |
There was a problem hiding this comment.
Jdbc2.ReadOptions gives a freedom with dictionary of properties, but we could specify better
I prefer to not demonstrate such apporach in examples because it gives yet another level of complexity
old-good specified API with
readCsv/readExcel/read something more easier to differ during code reading
#450
@zaleslaw
WIP and proof-of-concept.
Drafting and exploring what a new
DataFrame.read()could be and do (together with claude). (namedreadSource()for now)In its current state you can give it anything, and it figures out the rest (be that an ArrowReader, a URL, a String, or an Excel sheet). Extra options can be provided when needed.
DataRow.readSource()also works.It also comes with a
DataFrameSchema.readSource(), if you need just the types (overridden by jdbc), and something likeCodeString.read()maybe, if you just need the generated interfaces (overridden by openapi-generator).I'm also thinking about what a unified system like this could bring to the rest of dataframe. It will be very easy, for instance, to hook it into our parsers or converters! Currently the only format we can parse/convert is json Strings->DataFrame, but this could open up any conversion to DataFrame.
I prototyped it in our
convertoperation, meaning you can convert any supported type toDataRow,DataFrame, orDataFrameSchemanow :)I also tried to implement it for
parse, since JSON parsing was already there. This appears to be a bit trickier though. There's a lot of edge-cases, where, for instance,"[a b c]"can successfully be parsed as CSV, causing all sorts of issues later on.I did manage to make this pass all tests so far though, by making "parsing to dataframe read source" optional (false by default), enabling it only where needed and adding some extra checks for String input of CSV and JSON.
Writing is also quite interesting, it can figure out which format to use based on the extension alone:
df.write("path/to/some.json")will write JSON anddf.write("some.csv")will write CSV!For mixed types like
buildString { df.write(this) }, you will need to specify aDataFrameWriteOptionsexplicitly (LikeDataFrameWriteOptions.Json()orTsv.WriteOptions()), otherwise multiple targets match, and for writing, that should not be allowed.