Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/images/Variable_Transformation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 27 additions & 63 deletions docs/user_guide/transformation/ArcSinhTransformer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@
ArcSinhTransformer
==================

The inverse hyperbolic sine (or arcsinh) transformation is a variance-stabilizing
The inverse hyperbolic sine (or arcsinh) transformation is a variance stabilising
transformation that achieves results similar to the logarithmic transformation,
while retaining zero values in a variable, something the logarithm cannot do. It has
gained popularity in recent years; therefore, we add support for it in Feature-engine.

Variance stabilizing transformations
Variance stabilising transformations
------------------------------------

Variance stabilizing transformations are commonly used in regression analysis to make
Variance stabilising transformations are commonly used in regression analysis to make
skewed data more evenly distributed, approximate normality, or reduce heteroscedasticity.
One of the most commonly used transformations is the logarithm. However, the logarithm
transformation has one limitation: it is not defined for the value 0.
Expand All @@ -23,7 +23,7 @@ the logarithm is undefined, researchers developed a number of alternatives to tr
those zeros.

The simplest alternative consists of adding 1 (or a constant value to the variable). In fact,
the Box-Cox transformation is a generalized version of power transformations that automatically
the Box-Cox transformation is a generalised version of power transformations that automatically
introduces a shift in 0 valued observations before applying the logarithm.

However, adding 1 (or a constant) before applying a log transformation is arbitrary and can
Expand All @@ -42,8 +42,10 @@ The inverse hyperbolic sine (IHS) transformation is defined as follows:

x' = \operatorname{arcsinh}(x) = \ln\left(x + \sqrt{x^2 + 1}\right)

The IHS transformation works with data defined on the whole real line including
negative values and zeros. For large values of x, the IHS behaves like a log
The IHS transformation works with data defined on the whole real space including
negative values and zeros.

For large values of x, the IHS behaves like a log
transformation. For small values of x, or in other words as x approaches 0, IHS(x)
approaches x.

Expand Down Expand Up @@ -187,7 +189,7 @@ In the bottom panels we see the effect of the inverse hyperbolic sine transforma

The fundamental message of this experiment is that:

- Changing the variable scale will affect the variance stabilizing power of the IHS transformation
- Changing the variable scale will affect the variance stabilising power of the IHS transformation
- Reducing the scale (multiplying by values <1) increases the separation of larger values from zero values (second panel), which is probably not what we want
- Increasing the scale substantially, may also result in suboptimal distributions, as shown on the right panel

Expand Down Expand Up @@ -285,7 +287,7 @@ negative values after the transformation (middle panel).
Limitations of the IHS
----------------------

As with all variance stabilizing transformations, the IHS comes with limitations, being,
As with all variance stabilising transformations, the IHS comes with limitations, being,
the result of the transformation largely depends on the variable scale, by the own definition
of the transformation.

Expand Down Expand Up @@ -313,8 +315,8 @@ separation of larger values of the variable from 0.
Unlike :class:`LogTransformer()`, :class:`ArcSinhTransformer()` can handle
zero and negative values without requiring any preprocessing (or so we wanted to think).

Python demo
-----------
Python implementation
---------------------

In this demo, we'll show how to use the inverse hyperbolic sine transformation with care.

Expand Down Expand Up @@ -414,7 +416,7 @@ variables:
test_t.hist(bins=20, figsize=(8,4))
plt.show()

In the following figure, we see that while the arcsinh transformation seemed to stabilize the
In the following figure, we see that while the arcsinh transformation seemed to stabilise the
variance of the variable profit, it does an awful job for the variable net-worth:

.. image:: ../../images/arcsinh-ihs.png
Expand All @@ -426,7 +428,7 @@ Scaling the distribution before arcsinh
center and rescale data before transformation.

We discussed previously that re-scaling the variables before applying the arcsinh transformation
can help achieve better variance stabilizing results.
can help achieve better variance stabilising results.

Let's rescale the variable profit before applying the arcsinh transformation and then display
the histogram of the resulting dataframe:
Expand Down Expand Up @@ -456,7 +458,7 @@ Shifting the distribution before arcsinh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We mentioned previously that shifting the variables before applying the arcsinh transformation
can help achieve better variance stabilizing results.
can help achieve better variance stabilising results.

Let's shift the variable profit before applying the arcsinh transformation, to make all its
values positive. After that, we display the histogram of the resulting dataframe:
Expand Down Expand Up @@ -550,53 +552,15 @@ For more details on the inverse hyperbolic sine transformation, check the follow
3. `Burbidge, J. B., Magee, L., & Robb, A. L. (1988). Alternative transformations to handle extreme values of the dependent variable. Journal of the American Statistical Association. <https://www.jstor.org/stable/2288929>`_
4. `Aihounton, Henningsen. (2020). Units of measurement and the inverse hyperbolic sine transformation. The Econometrics Journal. <https://academic.oup.com/ectj/article-abstract/24/2/334/5948096>`_

Tutorials, books and courses
----------------------------

For tutorials about variance stabilizing transformations, check out our online course:

.. figure:: ../../images/feml.png
:width: 300
:figclass: align-center
:align: left
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning

Feature Engineering for Machine Learning

|
|
|
|
|
|
|
|
|
|

Or read our book:

.. figure:: ../../images/cookbook.png
:width: 200
:figclass: align-center
:align: left
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587

Python Feature Engineering Cookbook

|
|
|
|
|
|
|
|
|
|
|
|
|

Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.
Additional resources
--------------------

For tutorials about this and other feature engineering methods check out these resources:

- `Feature Engineering for Machine Learning <https://www.trainindata.com/p/feature-engineering-for-machine-learning>`_, online course.
- `Feature Engineering for Time Series Forecasting <https://www.trainindata.com/p/feature-engineering-for-forecasting>`_, online course.
- `Python Feature Engineering Cookbook <https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587>`_, book.

Both our book and courses are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting `Sole <https://linkedin.com/in/soledad-galli>`_,
the main developer of feature-engine.
115 changes: 43 additions & 72 deletions docs/user_guide/transformation/ArcsinTransformer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,35 @@
ArcsinTransformer
=================

The :class:`ArcsinTransformer()` applies the arcsin transformation to
numerical variables.

The arcsine transformation, also called arcsin square root transformation, or
angular transformation, takes the form of arcsin(sqrt(x)) where x is a real number
between 0 and 1.

The arcsin square root transformation helps in dealing with probabilities,
percentages, and proportions.
.. tip::

The arcsin square root transformation helps in dealing with probabilities,
percentages, and proportions.

:class:`ArcsinTransformer()` applies the arcsin transformation to
numerical variables.

.. note::

:class:`ArcsinTransformer()` only works with numerical variables with values
between 0 and 1. If the variable contains a value outside of this range, the
transformer will raise an error.

The :class:`ArcsinTransformer()` only works with numerical variables with values
between 0 and 1. If the variable contains a value outside of this range, the
transformer will raise an error.
Python implementation
---------------------

Example
~~~~~~~
In this section, we'll show how to apply the arcsin square root transformation with
:class:`ArcsinTransformer()`.

Let's load the breast cancer dataset from scikit-learn and separate it into train and
test sets.

.. code:: python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
Expand All @@ -43,8 +49,8 @@ test sets.
# Separate data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Now we want to apply the arcsin transformation to some of the variables in the
dataframe. These variables values are in the range 0-1, as we will see in coming
We want to apply the arcsin transformation to some of the variables in the
dataframe. These variables' values are in the range 0-1, as we will see in coming
histograms.

First, let's make a list with the variable names:
Expand All @@ -65,7 +71,7 @@ First, let's make a list with the variable names:
'worst symmetry',
'worst fractal dimension']

Now, let's set up the arscin transformer to modify only the previous variables:
Now, let's set up the arscin transformer to modify the previous variables:

.. code:: python

Expand All @@ -74,9 +80,11 @@ Now, let's set up the arscin transformer to modify only the previous variables:

# fit the transformer
tf.fit(X_train)

The transformer does not learn any parameters when applying the fit method. It does
check however that the variables are numericals and with the correct value range.

.. note::

The transformer does not learn any parameters when applying the fit method. It does
check, however, that the variables are numericals and with the correct value range.

We can now go ahead and transform the variables:

Expand All @@ -86,20 +94,21 @@ We can now go ahead and transform the variables:
train_t = tf.transform(X_train)
test_t = tf.transform(X_test)

And that's it, now the variables have been transformed with the arscin formula.
That's it, now the variables have been transformed with the arscin formula.

Finally, let's make a histogram for each of the original variables to examine their
distribution:
Let's go ahead and check out the effect of the transformation on the variables' distribution.
We'll start by making a histogram for each of the original variable:

.. code:: python

# original variables
X_train[vars_].hist(figsize=(20,20))

You can see in the following image that the variables are skewed. Note
that all variables have values between 0 and 1:

.. image:: ../../images/breast_cancer_raw.png

You can see in the previous image that many of the variables are skewed. Note however,
that all variables had values between 0 and 1.

Now, let's examine the distribution after the transformation:

Expand All @@ -108,60 +117,22 @@ Now, let's examine the distribution after the transformation:
# transformed variable
train_t[vars_].hist(figsize=(20,20))

In the following image, we see that many of the variables have a more Gaussian looking
shape after the transformation:

.. image:: ../../images/breast_cancer_arcsin.png

You can see in the previous image that many variables have after the transformation a
more Gaussian looking shape.


Additional resources
--------------------

For more details about this and other feature engineering methods check out these resources:


.. figure:: ../../images/feml.png
:width: 300
:figclass: align-center
:align: left
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning

Feature Engineering for Machine Learning

|
|
|
|
|
|
|
|
|
|

Or read our book:

.. figure:: ../../images/cookbook.png
:width: 200
:figclass: align-center
:align: left
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587

Python Feature Engineering Cookbook

|
|
|
|
|
|
|
|
|
|
|
|
|

Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.
For tutorials about this and other feature engineering methods check out these resources:

- `Feature Engineering for Machine Learning <https://www.trainindata.com/p/feature-engineering-for-machine-learning>`_, online course.
- `Feature Engineering for Time Series Forecasting <https://www.trainindata.com/p/feature-engineering-for-forecasting>`_, online course.
- `Python Feature Engineering Cookbook <https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587>`_, book.

Both our book and courses are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting `Sole <https://linkedin.com/in/soledad-galli>`_,
the main developer of feature-engine.
Loading