Skip to content

Add replace { }.by { } API for multi-column replacement using AddDsl  #1749

@Jolanrensen

Description

@Jolanrensen

Converting our "movies" example to the compiler plugin, I came across the following notation:

.split { title }.by {
    listOf<Any>(
        """\s*\(\d{4}\)\s*$""".toRegex().replace(title, ""),
        "\\d{4}".toRegex().findAll(title).lastOrNull()?.value?.toIntOrNull() ?: -1,
    )
}.into("title", "year")

This creates the columns title: DataColumn<String> and year: DataColumn<Int>. The problem with this list notation is that the compiler plugin cannot read the types easily. It can only see them as Any, because that's the type of the list.

I tried several alternatives, but they all boiled down to multiple steps:

  • add title2 and year, remove the old title, rename title2 to title
  • replace or convert the title column to a column group of title/year, ungroup it again
  • use split and then cast or requireColumn

But all seem to be way too complicated for what we want here: simply replacing a column with two new ones.

A new API could look like:

df.replace { title }.by { // AddDsl
    "title" from {
        """\s*\(\d{4}\)\s*$""".toRegex().replace(title, "")
    }
    "year" from {
        "\\d{4}".toRegex().findAll(title).lastOrNull()?.value?.toIntOrNull() ?: -1
    }
}

or more generally:

df.replace { name and firstName and age }.by {
    "name" from { 
         "$firstName $name ($age)"
    }
    "welcomeMessage" from {
        "Hi, $firstName!"
    }
}

We could explore other notations or names, of course, but it would narrow down to:

  • remove some columns
  • add some new ones

in one operation

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIIf it touches our APICompiler pluginAnything related to the DataFrame Compiler PluginenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions