DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions. In here we will cover some of the more popular use cases. If you want to view all the functions go to the :py:mod:`Functions <datafusion.functions>` API Reference.
We'll use the pokemon dataset in the following examples.
.. ipython:: python
import urllib.request
from datafusion import SessionContext
urllib.request.urlretrieve(
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
"pokemon.csv",
)
ctx = SessionContext()
ctx.register_csv("pokemon", "pokemon.csv")
df = ctx.table("pokemon")
DataFusion offers mathematical functions such as :py:func:`~datafusion.functions.pow` or :py:func:`~datafusion.functions.log`
.. ipython:: python
from datafusion import col, literal
from datafusion import functions as f
df.select(
f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
).limit(10)
There 3 conditional functions in DataFusion :py:func:`~datafusion.functions.coalesce`, :py:func:`~datafusion.functions.nullif` and :py:func:`~datafusion.functions.case`.
.. ipython:: python
df.select(
f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
).limit(10)
For selecting the current time use :py:func:`~datafusion.functions.now`
.. ipython:: python
df.select(f.now())
Convert to timestamps using :py:func:`~datafusion.functions.to_timestamp`
.. ipython:: python
df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
Extracting parts of a date using :py:func:`~datafusion.functions.date_part` (alias :py:func:`~datafusion.functions.extract`)
.. ipython:: python
df.select(
f.date_part(literal("month"), f.to_timestamp(col('"Total"'))).alias("month"),
f.extract(literal("day"), f.to_timestamp(col('"Total"'))).alias("day")
)
In the field of data science, working with textual data is a common task. To make string manipulation easier, DataFusion offers a range of helpful options.
.. ipython:: python
df.select(
f.char_length(col('"Name"')).alias("len"),
f.lower(col('"Name"')).alias("lower"),
f.left(col('"Name"'), literal(4)).alias("code")
)
This also includes the functions for regular expressions like :py:func:`~datafusion.functions.regexp_replace` and :py:func:`~datafusion.functions.regexp_match`
.. ipython:: python
df.select(
f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
)
The function :py:func:`~datafusion.functions.in_list` allows to check a column for the presence of multiple values:
.. ipython:: python
types = [literal("Grass"), literal("Fire"), literal("Water")]
(
df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
.limit(20)
.to_pandas()
)