diff --git a/README.md b/README.md index 19ccf3416..6b9cdb24d 100644 --- a/README.md +++ b/README.md @@ -114,6 +114,7 @@ Please share your story by answering 1 quick question * PowerTransformer * BoxCoxTransformer * YeoJohnsonTransformer +* ArcSinhTransformer ### Variable Scaling methods * MeanNormalizationScaler diff --git a/docs/api_doc/transformation/ArcSinhTransformer.rst b/docs/api_doc/transformation/ArcSinhTransformer.rst new file mode 100644 index 000000000..69f315cbb --- /dev/null +++ b/docs/api_doc/transformation/ArcSinhTransformer.rst @@ -0,0 +1,5 @@ +ArcSinhTransformer +================== + +.. autoclass:: feature_engine.transformation.ArcSinhTransformer + :members: diff --git a/docs/api_doc/transformation/index.rst b/docs/api_doc/transformation/index.rst index 0705f4d0a..32866842d 100644 --- a/docs/api_doc/transformation/index.rst +++ b/docs/api_doc/transformation/index.rst @@ -13,6 +13,7 @@ mathematical transformations. LogCpTransformer ReciprocalTransformer ArcsinTransformer + ArcSinhTransformer PowerTransformer BoxCoxTransformer YeoJohnsonTransformer diff --git a/docs/images/arcsinh-demo-raw.png b/docs/images/arcsinh-demo-raw.png new file mode 100644 index 000000000..5767eff1f Binary files /dev/null and b/docs/images/arcsinh-demo-raw.png differ diff --git a/docs/images/arcsinh-ihs.png b/docs/images/arcsinh-ihs.png new file mode 100644 index 000000000..b155782f0 Binary files /dev/null and b/docs/images/arcsinh-ihs.png differ diff --git a/docs/images/arcsinh-loc-demo.png b/docs/images/arcsinh-loc-demo.png new file mode 100644 index 000000000..353c315a0 Binary files /dev/null and b/docs/images/arcsinh-loc-demo.png differ diff --git a/docs/images/arcsinh-loc.png b/docs/images/arcsinh-loc.png new file mode 100644 index 000000000..958d9dff1 Binary files /dev/null and b/docs/images/arcsinh-loc.png differ diff --git a/docs/images/arcsinh-qq.png b/docs/images/arcsinh-qq.png new file mode 100644 index 000000000..f24f28fcc Binary files /dev/null and b/docs/images/arcsinh-qq.png differ diff --git a/docs/images/arcsinh-scale-demo.png b/docs/images/arcsinh-scale-demo.png new file mode 100644 index 000000000..591f1d3e2 Binary files /dev/null and b/docs/images/arcsinh-scale-demo.png differ diff --git a/docs/images/arcsinh-scale.png b/docs/images/arcsinh-scale.png new file mode 100644 index 000000000..89446931f Binary files /dev/null and b/docs/images/arcsinh-scale.png differ diff --git a/docs/images/arcsinh-transformation.png b/docs/images/arcsinh-transformation.png new file mode 100644 index 000000000..a665d6888 Binary files /dev/null and b/docs/images/arcsinh-transformation.png differ diff --git a/docs/images/arcsinh_profit_histogram.png b/docs/images/arcsinh_profit_histogram.png new file mode 100644 index 000000000..e776e30d4 Binary files /dev/null and b/docs/images/arcsinh_profit_histogram.png differ diff --git a/docs/index.rst b/docs/index.rst index d1bf049a8..a7266bdc1 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -237,6 +237,7 @@ like anova, and machine learning models, like linear regression. Feature-engine - :doc:`api_doc/transformation/BoxCoxTransformer`: performs Box-Cox transformation of numerical variables - :doc:`api_doc/transformation/YeoJohnsonTransformer`: performs Yeo-Johnson transformation of numerical variables - :doc:`api_doc/transformation/ArcsinTransformer`: performs arcsin transformation of numerical variables +- :doc:`api_doc/transformation/ArcSinhTransformer`: applies arcsinh (pseudo-logarithm) transformation of numerical variables Feature Creation: ~~~~~~~~~~~~~~~~~ diff --git a/docs/user_guide/transformation/ArcSinhTransformer.rst b/docs/user_guide/transformation/ArcSinhTransformer.rst new file mode 100644 index 000000000..56acaf3e8 --- /dev/null +++ b/docs/user_guide/transformation/ArcSinhTransformer.rst @@ -0,0 +1,602 @@ +.. _arcsinh_transformer: + +.. currentmodule:: feature_engine.transformation + +ArcSinhTransformer +================== + +The inverse hyperbolic sine (or arcsinh) transformation is a variance-stabilizing +transformation that achieves results similar to the logarithmic transformation, +while retaining zero values in a variable, something the logarithm cannot do. It has +gained popularity in recent years; therefore, we add support for it in Feature-engine. + +Variance stabilizing transformations +------------------------------------ + +Variance stabilizing transformations are commonly used in regression analysis to make +skewed data more evenly distributed, approximate normality, or reduce heteroscedasticity. +One of the most commonly used transformations is the logarithm. However, the logarithm +transformation has one limitation: it is not defined for the value 0. + +Given that many variables often include meaningful zero-valued observations for which +the logarithm is undefined, researchers developed a number of alternatives to try and retain +those zeros. + +The simplest alternative consists of adding 1 (or a constant value to the variable). In fact, +the Box-Cox transformation is a generalized version of power transformations that automatically +introduces a shift in 0 valued observations before applying the logarithm. + +However, adding 1 (or a constant) before applying a log transformation is arbitrary and can +distort interpretation, particularly for small values and near zero. + +In recent years, the inverse hyperbolic sine (or arcsinh) transformation has grown in +popularity because it is similar to a logarithm, and it allows retaining zero-valued +(and even negative-valued) observations. + +Inverse Hyperbolic Sine Transformation +-------------------------------------- + +The inverse hyperbolic sine (IHS) transformation is defined as follows: + +.. math:: + + x' = \operatorname{arcsinh}(x) = \ln\left(x + \sqrt{x^2 + 1}\right) + +The IHS transformation works with data defined on the whole real line including +negative values and zeros. For large values of x, the IHS behaves like a log +transformation. For small values of x, or in other words as x approaches 0, IHS(x) +approaches x. + +The following code recreates the effect of the IHS transformation on small and big values +of x: + +.. code:: python + + import numpy as np + import matplotlib.pyplot as plt + + # Create data + z_small = np.linspace(0, 1, 200) + z_large = np.linspace(1, 10, 200) + + # IHS transformation + arcsinh_small = np.arcsinh(z_small) + arcsinh_large = np.arcsinh(z_large) + + # Create figure + fig, axes = plt.subplots(1, 2, figsize=(10, 4)) + + # ---- Left panel ---- + axes[0].plot(z_small, arcsinh_small, color="black", label="arcsinh(z)") + axes[0].plot(z_small, z_small, color="red", label="z") + axes[0].set_xlabel("raw value") + axes[0].set_ylabel("arcsinh-transformed value") + axes[0].legend() + axes[0].set_xlim(0, 1) + axes[0].set_ylim(0, 0.9) + + # ---- Right panel ---- + axes[1].plot(z_large, arcsinh_large, color="black", label="arcsinh(z)") + axes[1].plot(z_large, np.log(z_large) + np.log(2), color="red", label="log(z)+log(2)") + axes[1].set_xlabel("raw value") + axes[1].set_ylabel("arcsinh-transformed value") + axes[1].legend() + axes[1].set_xlim(1, 10) + axes[1].set_ylim(0.8, 3.1) + + plt.tight_layout() + plt.show() + +In the following image, we see that the IHS transformation retains the values of x +when x is small (left panel), or behaves like the log(x) (plus a shift) when x is large +(right panel): + +.. image:: ../../images/arcsinh-transformation.png + +Variable Scaling before IHS +--------------------------- + +The effect of the IHS transformation depends on the magnitude of the values to transform. +In general, if the values are smaller than 3, the IHS results in values similar to the original +variable, and hence, the transformation is not useful to reduce skewness. + +In contrast, if one chooses the unit of measurement for a variable in a way that all +values are rather large (e.g., larger than 3), the IHS transformation is almost identical +to the log transformation. Hence, to make a useful transformation, it's suggested to "rescale" +the original variable to greater values (i.e., multiply by a positive constant). + +However, when the variable to transform contains zeros as values, it is difficult to scale +this variable in a way that all values are rather large because zero values remain zero +no matter what unit of measurement is used. + +Let's compare the effect of rescaling the variable before applying the IHS transformation: + +.. code:: python + + import numpy as np + import matplotlib.pyplot as plt + + # ----------------------------- + # 1. Generate synthetic variable + # ----------------------------- + np.random.seed(42) + n = 500 + + # Skewed positive values (like log-normal) + positive = np.random.lognormal(mean=2, sigma=1, size=400) + + # Add zeros and some negative values + zeros = np.zeros(50) + negatives = np.random.uniform(-5, 0, 50) + + x = np.concatenate([positive, zeros, negatives]) + + # ----------------------------- + # 2. Define IHS transform with scale + # ----------------------------- + def ihs_transform(x, scale): + return np.arcsinh(x / scale) + + # ----------------------------- + # 3. Define scenarios + # ----------------------------- + scales = [1, 0.1, 5, 50] + + scenarios = {} + for scale in scales: + title = f"Scale = {scale}" + # Top row: original variable multiplied by scale + original_scaled = x * scale + # Bottom row: IHS-transformed + transformed = ihs_transform(x, scale=scale) + scenarios[title] = (original_scaled, transformed) + + # ----------------------------- + # 4. Plotting with larger fonts + # ----------------------------- + fig, axes = plt.subplots(2, len(scenarios), figsize=(16, 8), sharey='row') + + # Font sizes + title_fontsize = 16 + label_fontsize = 14 + tick_fontsize = 12 + + for i, (title, (original, transformed)) in enumerate(scenarios.items()): + # Top row: scaled original variable + axes[0, i].hist(original, bins=30, edgecolor='black') + axes[0, i].set_title(title, fontsize=title_fontsize) + axes[0, i].set_ylabel("Frequency", fontsize=label_fontsize) + axes[0, i].tick_params(axis='both', labelsize=tick_fontsize) + + # Bottom row: IHS-transformed + axes[1, i].hist(transformed, bins=30, edgecolor='black') + axes[1, i].set_xlabel("Value after transformation", fontsize=label_fontsize) + axes[1, i].set_ylabel("Frequency", fontsize=label_fontsize) + axes[1, i].tick_params(axis='both', labelsize=tick_fontsize) + + plt.tight_layout() + plt.show() + + +In the following image, we see the resulting IHS transformation after multiplying the +original variable by 0.1 (reducing the scale), 5 or 50 (increasing the scale). In the top +panels we see the original distribution (left) or the original distribution after re-scaling. +In the bottom panels we see the effect of the inverse hyperbolic sine transformation: + +.. image:: ../../images/arcsinh-scale-demo.png + +The fundamental message of this experiment is that: + +- Changing the variable scale will affect the variance stabilizing power of the IHS transformation +- Reducing the scale (multiplying by values <1) increases the separation of larger values from zero values (second panel), which is probably not what we want +- Increasing the scale substantially, may also result in suboptimal distributions, as shown on the right panel + +Hence, choosing the right scale, is key to achieving the desired results. + +Variable Centering before IHS +----------------------------- + +Another way to obtain better results using the inverse hyperbolic sine transformation is +to shift the data (i.e., to add a constant). This is particularly useful when the variable +has negative values, to transition from negative logarithmic behavior to positive logarithmic +behavior. + +Let's compare the effect of shifting the original variable distribution before applying the +IHS transformation. With the following code, we apply the IHS to a variable containing +zero and negative values after centering at its mean, or at its minimum value (shifting all +negative values to positive): + +.. code:: python + + import numpy as np + import matplotlib.pyplot as plt + + # ----------------------------- + # 1. Generate synthetic variable + # ----------------------------- + np.random.seed(42) + n = 500 + + # Skewed positive values (like log-normal) + positive = np.random.lognormal(mean=2, sigma=1, size=400) + + # Add zeros and some negative values + zeros = np.zeros(50) + negatives = np.random.uniform(-5, 0, 50) + + x = np.concatenate([positive, zeros, negatives]) + + # ----------------------------- + # 2. Define IHS transform + # ----------------------------- + def ihs_transform(x, loc=0): + return np.arcsinh(x - loc) + + # ----------------------------- + # 3. Define scenarios + # ----------------------------- + loc_mean = x.mean() + loc_min = x.min() + + # Each scenario: (top histogram data, bottom transformed) + scenarios = { + "Original": (x, ihs_transform(x, loc=0)), + f"Centered (loc = mean = {loc_mean:.2f})": (x - loc_mean, ihs_transform(x, loc=loc_mean)), + f"Shifted (loc = min = {loc_min:.2f})": (x - loc_min, ihs_transform(x, loc=loc_min)), + } + + # ----------------------------- + # 4. Plotting with larger fonts, square figure + # ----------------------------- + fig, axes = plt.subplots(2, len(scenarios), figsize=(12, 8), sharey='row') + + # Font sizes + title_fontsize = 16 + label_fontsize = 14 + tick_fontsize = 12 + + for i, (title, (top_data, transformed)) in enumerate(scenarios.items()): + # Top row: original or shifted variable + axes[0, i].hist(top_data, bins=30, edgecolor='black', color='skyblue') + axes[0, i].set_title(title, fontsize=title_fontsize) + axes[0, i].set_ylabel("Frequency", fontsize=label_fontsize) + axes[0, i].tick_params(axis='both', labelsize=tick_fontsize) + + # Bottom row: IHS-transformed + axes[1, i].hist(transformed, bins=30, edgecolor='black', color='salmon') + axes[1, i].set_xlabel("Value after transformation", fontsize=label_fontsize) + axes[1, i].set_ylabel("Frequency", fontsize=label_fontsize) + axes[1, i].tick_params(axis='both', labelsize=tick_fontsize) + + plt.tight_layout() + plt.show() + +In the following image, we observe the original distributions (top panels) before or after shifting its +values to the variable mean (middle) or variable minimum (right). In the bottom panels, we +see the variables after the ISH transformation: + +.. image:: ../../images/arcsinh-loc-demo.png + +We observe that making all variable values positive, results in the best transformation +(right panel), as the transformed variable has a more Gaussian looking distribution. +Centering the variable at the mean reduces the difference between larger and zero and +negative values after the transformation (middle panel). + +Limitations of the IHS +---------------------- + +As with all variance stabilizing transformations, the IHS comes with limitations, being, +the result of the transformation largely depends on the variable scale, by the own definition +of the transformation. + +That means, that the IHS is not a one-stop solution for the transformation of numerical variables +with zero and negative values. Instead, we need to review the result of the transformation before +using it in our machine learning pipelines. + +In fact, to improve the effect of the transformations, scaling and shifting have been proposed, +which, in a sense, kill the beauty of this function, which was to avoid adding a +constant before applying the logarithm. So, use it with care. + +ArcSinhTransformer +------------------ + +Feature-engine's :class:`ArcSinhTransformer()` applies the inverse hyperbolic sine transformation +to numerical variables. It also supports centering and shifting the variables as follows: + +.. math:: + y = \text{arcsinh}\left(\frac{x - \text{loc}}{\text{scale}}\right) + +In short, the loc parameter moves the distribution of the original variable to the right or left, +whereas the scale parameter re-scales the variable. Here: smaller values of scale, reduce the +separation of larger values of the variable from 0. + +Unlike :class:`LogTransformer()`, :class:`ArcSinhTransformer()` can handle +zero and negative values without requiring any preprocessing (or so we wanted to think). + +Python demo +----------- + +In this demo, we'll show how to use the inverse hyperbolic sine transformation with care. + +Let's create a dataframe with 2 variables containing positive, zero and negative +values, and then split the dataset into a training and a testing set: + +.. code:: python + + import numpy as np + import pandas as pd + import matplotlib.pyplot as plt + from sklearn.model_selection import train_test_split + + from feature_engine.transformation import ArcSinhTransformer + + # Create sample data with positive and negative values + np.random.seed(42) + + # Skewed positive values (like log-normal) + positive = np.random.lognormal(mean=2, sigma=1, size=400) + zeros = np.zeros(50) + negatives = np.random.uniform(-5, 0, 50) + x = np.concatenate([positive, zeros, negatives]) + + X = pd.DataFrame({ + 'profit': x, + 'net_worth': np.random.randn(500) * 5000, + }) + + # Separate into train and test + X_train, X_test = train_test_split(X, test_size=0.3, random_state=0) + + print(X.head()) + +In the following output, we see the created dataframe: + +.. code:: python + + profit net_worth + 0 12.142530 -8516.912197 + 1 6.434896 -277.738494 + 2 14.121360 1920.327245 + 3 33.886946 -163.473740 + 4 5.846520 -10337.210500 + +Let's display histograms with the distributions of these variables: + +.. code:: python + + X_test.hist(bins=20, figsize=(8,4)) + plt.show() + +In the following image, we see the distributions of the numerical variables: + +.. image:: ../../images/arcsinh-demo-raw.png + +Let's set up the :class:`ArcSinhTransformer()` to apply the IHS transformation to both +variables, and fit it to the training set: + +.. code:: python + + # Set up the arcsinh transformer + tf = ArcSinhTransformer(variables=['profit', 'net_worth']) + + # Fit the transformer + tf.fit(X_train) + +The transformer does not learn any parameters when applying the fit method. It does +check, however, that the variables are numerical. + +We can now transform the variables: + +.. code:: python + + # Transform the data + train_t = tf.transform(X_train) + test_t = tf.transform(X_test) + + print(train_t.head()) + +The dataframe with the transformed variables: + +.. code:: python + + profit net_worth + 141 4.000625 9.344656 + 383 2.180247 9.553467 + 135 4.243288 -7.692116 + 493 -1.897234 9.842227 + 122 4.096218 7.971689 + +To evaluate the effect of the transformation, let's plot the histograms of the transformed +variables: + +.. code:: python + + test_t.hist(bins=20, figsize=(8,4)) + plt.show() + +In the following figure, we see that while the arcsinh transformation seemed to stabilize the +variance of the variable profit, it does an awful job for the variable net-worth: + +.. image:: ../../images/arcsinh-ihs.png + +Scaling the distribution before arcsinh +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:class:`ArcSinhTransformer()` supports location and scale parameters to +center and rescale data before transformation. + +We discussed previously that re-scaling the variables before applying the arcsinh transformation +can help achieve better variance stabilizing results. + +Let's rescale the variable profit before applying the arcsinh transformation and then display +the histogram of the resulting dataframe: + +.. code:: python + + tf = ArcSinhTransformer(variables=['profit'], scale=5) + + # Fit the transformer + tf.fit(X_train) + + # Transform the data + train_scale = tf.transform(X_train) + test_scale = tf.transform(X_test) + + test_scale.hist(bins=20, figsize=(8,4)) + plt.show() + +In the following image, we see that decreasing the scale of profit (we divided it by 5) +results in a more stable distribution (more on this later): + +.. image:: ../../images/arcsinh-scale.png + +Net worth was untransformed, so we see the original distribution. + +Shifting the distribution before arcsinh +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We mentioned previously that shifting the variables before applying the arcsinh transformation +can help achieve better variance stabilizing results. + +Let's shift the variable profit before applying the arcsinh transformation, to make all its +values positive. After that, we display the histogram of the resulting dataframe: + +.. code:: python + + loc = X_train['profit'].min() + tf = ArcSinhTransformer(variables=['profit'], loc=loc) + + # Fit the transformer + tf.fit(X_train) + + # Transform the data + train_loc = tf.transform(X_train) + test_loc = tf.transform(X_test) + + test_loc.hist(bins=20, figsize=(8,4)) + plt.show() + +In the following image, we see that moving the variable profit towards positive values +results in a more stable distribution (more on this later): + +.. image:: ../../images/arcsinh-loc.png + +Net worth was untransformed, so we see the original distribution. + +Which transformation is better? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To analyse the effect of the IHS transformation before or after scaling and shifting the +profit variable distribution, let's plot Q-Q plots: + +.. code:: python + + import matplotlib.pyplot as plt + import scipy.stats as stats + + # List of dataframes and titles + dfs = [X_test, test_t, test_scale, test_loc] + titles = ['X_test', 'test_t', 'test_scale', 'test_loc'] + + # Create figure with 2x2 subplots + fig, axes = plt.subplots(2, 2, figsize=(12, 10)) + + for ax, df, title in zip(axes.flatten(), dfs, titles): + # Q-Q plot for profit variable + stats.probplot(df['profit'], dist="norm", plot=ax) + ax.set_title(title, fontsize=14) + + plt.tight_layout() + plt.show() + +In the following image, we see that the inverse hyperbolic sine makes the distribution +of profit follow a normal distribution more closely (top right panel). While scaling and +shifting the variable makes transformation even better (bottom panels): + +.. image:: ../../images/arcsinh-qq.png + +Inverse transformation +~~~~~~~~~~~~~~~~~~~~~~ + +:class:`ArcSinhTransformer()` supports inverse transformation to recover +the original values: + +.. code:: python + + # Transform and then inverse transform + train_t = tf.transform(X_train) + train_recovered = tf.inverse_transform(train_t) + + print(train_recovered.head()) + +The recovered data: + +.. code:: python + + profit net_worth + 141 27.306991 5718.770216 + 383 4.367737 7046.737201 + 135 34.811034 -1095.502644 + 493 -3.258723 9405.785347 + 122 30.047946 1448.874284 + +References +---------- + +For more details on the inverse hyperbolic sine transformation, check the following resources: + +1. `How should I transform non-negative data including zeros? `_ (StackExchange) +2. `Interpreting Treatment Effects: Inverse Hyperbolic Sine Outcome Variable `_ (World Bank Blog) +3. `Burbidge, J. B., Magee, L., & Robb, A. L. (1988). Alternative transformations to handle extreme values of the dependent variable. Journal of the American Statistical Association. `_ +4. `Aihounton, Henningsen. (2020). Units of measurement and the inverse hyperbolic sine transformation. The Econometrics Journal. `_ + +Tutorials, books and courses +---------------------------- + +For tutorials about variance stabilizing transformations, check out our online course: + +.. figure:: ../../images/feml.png + :width: 300 + :figclass: align-center + :align: left + :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning + + Feature Engineering for Machine Learning + +| +| +| +| +| +| +| +| +| +| + +Or read our book: + +.. figure:: ../../images/cookbook.png + :width: 200 + :figclass: align-center + :align: left + :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 + + Python Feature Engineering Cookbook + +| +| +| +| +| +| +| +| +| +| +| +| +| + +Both our book and course are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file diff --git a/docs/user_guide/transformation/index.rst b/docs/user_guide/transformation/index.rst index 85422c9f6..00ce20bfb 100644 --- a/docs/user_guide/transformation/index.rst +++ b/docs/user_guide/transformation/index.rst @@ -33,6 +33,7 @@ on the nature of the variable. LogCpTransformer ReciprocalTransformer ArcsinTransformer + ArcSinhTransformer PowerTransformer BoxCoxTransformer YeoJohnsonTransformer diff --git a/feature_engine/transformation/__init__.py b/feature_engine/transformation/__init__.py index 15011ac4b..9bbb62a59 100644 --- a/feature_engine/transformation/__init__.py +++ b/feature_engine/transformation/__init__.py @@ -4,6 +4,7 @@ """ from .arcsin import ArcsinTransformer +from .arcsinh import ArcSinhTransformer from .boxcox import BoxCoxTransformer from .log import LogCpTransformer, LogTransformer from .power import PowerTransformer @@ -11,11 +12,12 @@ from .yeojohnson import YeoJohnsonTransformer __all__ = [ + "ArcsinTransformer", + "ArcSinhTransformer", "BoxCoxTransformer", "LogTransformer", "LogCpTransformer", "PowerTransformer", "ReciprocalTransformer", "YeoJohnsonTransformer", - "ArcsinTransformer", ] diff --git a/feature_engine/transformation/arcsinh.py b/feature_engine/transformation/arcsinh.py new file mode 100644 index 000000000..e0020ff86 --- /dev/null +++ b/feature_engine/transformation/arcsinh.py @@ -0,0 +1,228 @@ +# Authors: Ankit Hemant Lade (contributor) +# License: BSD 3 clause + +from typing import List, Optional, Union + +import numpy as np +import pandas as pd + +from feature_engine._base_transformers.base_numerical import BaseNumericalTransformer +from feature_engine._check_init_parameters.check_variables import ( + _check_variables_input_value, +) +from feature_engine._docstrings.fit_attributes import ( + _feature_names_in_docstring, + _n_features_in_docstring, + _variables_attribute_docstring, +) +from feature_engine._docstrings.init_parameters.all_trasnformers import ( + _variables_numerical_docstring, +) +from feature_engine._docstrings.methods import ( + _fit_not_learn_docstring, + _fit_transform_docstring, + _inverse_transform_docstring, +) +from feature_engine._docstrings.substitute import Substitution +from feature_engine.tags import _return_tags + + +@Substitution( + variables=_variables_numerical_docstring, + variables_=_variables_attribute_docstring, + feature_names_in_=_feature_names_in_docstring, + n_features_in_=_n_features_in_docstring, + fit=_fit_not_learn_docstring, + fit_transform=_fit_transform_docstring, + inverse_transform=_inverse_transform_docstring, +) +class ArcSinhTransformer(BaseNumericalTransformer): + """ + The ArcSinhTransformer() applies the inverse hyperbolic sine transformation + (arcsinh) to numerical variables. Also known as the pseudo-logarithm, this + transformation is useful for data that contains both positive and negative values. + + The transformation is: x → arcsinh((x - loc) / scale) + + For large values of x, arcsinh(x) behaves like ln(x) + ln(2), providing similar + variance-stabilizing properties as the log transformation. For small values of x, + it behaves approximately linearly (i.e., arcsinh(x) ≈ x). This makes it ideal for + variables like net worth, profit/loss, or any metric that can be positive or + negative. + + A list of variables can be passed as an argument. Alternatively, the transformer + will automatically select and transform all variables of type numeric. + + More details in the :ref:`User Guide `. + + Parameters + ---------- + {variables} + + loc: float, default=0.0 + Location parameter for shifting the data before transformation. + The transformation becomes: arcsinh((x - loc) / scale) + + scale: float, default=1.0 + Scale parameter for normalizing the data before transformation. + Must be greater than 0. The transformation becomes: arcsinh((x - loc) / scale) + + Attributes + ---------- + {variables_} + + {feature_names_in_} + + {n_features_in_} + + Methods + ------- + {fit} + + {fit_transform} + + {inverse_transform} + + transform: + Transform the variables using the arcsinh function. + + See Also + -------- + feature_engine.transformation.LogTransformer : + Applies log transformation (only for positive values). + feature_engine.transformation.YeoJohnsonTransformer : + Applies Yeo-Johnson transformation. + + References + ---------- + .. [1] Burbidge, J. B., Magee, L., & Robb, A. L. (1988). Alternative + transformations to handle extreme values of the dependent variable. + Journal of the American Statistical Association, 83(401), 123-127. + + Examples + -------- + + >>> import numpy as np + >>> import pandas as pd + >>> from feature_engine.transformation import ArcSinhTransformer + >>> np.random.seed(42) + >>> X = pd.DataFrame(dict(x = np.random.randn(100) * 1000)) + >>> ast = ArcSinhTransformer() + >>> ast.fit(X) + >>> X = ast.transform(X) + >>> X.head() + x + 0 7.516076 + 1 -6.330816 + 2 7.780254 + 3 8.825252 + 4 -6.995893 + """ + + def __init__( + self, + variables: Union[None, int, str, List[Union[str, int]]] = None, + loc: float = 0.0, + scale: float = 1.0, + ) -> None: + + if not isinstance(loc, (int, float)): + raise ValueError( + f"loc must be a number (int or float). " + f"Got {type(loc).__name__} instead." + ) + + if not isinstance(scale, (int, float)) or scale <= 0: + raise ValueError( + f"scale must be a positive number (> 0). Got {scale} instead." + ) + + self.variables = _check_variables_input_value(variables) + self.loc = float(loc) + self.scale = float(scale) + + def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None): + """ + Selects the numerical variables and stores feature names. + + Parameters + ---------- + X: Pandas DataFrame of shape = [n_samples, n_features]. + The training input samples. Can be the entire dataframe, not just the + variables to transform. + + y: pandas Series, default=None + It is not needed in this transformer. You can pass y or None. + + Returns + ------- + self: ArcSinhTransformer + The fitted transformer. + """ + + # check input dataframe and find/check numerical variables + X = super().fit(X) + + return self + + def transform(self, X: pd.DataFrame) -> pd.DataFrame: + """ + Transform the variables using the arcsinh function. + + Parameters + ---------- + X: Pandas DataFrame of shape = [n_samples, n_features] + The data to be transformed. + + Returns + ------- + X_new: pandas dataframe + The dataframe with the transformed variables. + """ + + # check input dataframe and if class was fitted + X = self._check_transform_input_and_state(X) + + # Ensure float dtype for the transformation + X[self.variables_] = X[self.variables_].astype(float) + + # Apply arcsinh transformation: arcsinh((x - loc) / scale) + X.loc[:, self.variables_] = np.arcsinh( + (X.loc[:, self.variables_] - self.loc) / self.scale + ) + + return X + + def inverse_transform(self, X: pd.DataFrame) -> pd.DataFrame: + """ + Convert the data back to the original representation. + + Parameters + ---------- + X: Pandas DataFrame of shape = [n_samples, n_features] + The data to be inverse transformed. + + Returns + ------- + X_tr: pandas dataframe + The dataframe with the inverse transformed variables. + """ + + # check input dataframe and if class was fitted + X = self._check_transform_input_and_state(X) + + # Inverse transform: x = sinh(y) * scale + loc + X.loc[:, self.variables_] = ( + np.sinh(X.loc[:, self.variables_]) * self.scale + self.loc + ) + + return X + + def _more_tags(self): + tags_dict = _return_tags() + tags_dict["variables"] = "numerical" + return tags_dict + + def __sklearn_tags__(self): + tags = super().__sklearn_tags__() + return tags diff --git a/feature_engine/variable_handling/find_variables.py b/feature_engine/variable_handling/find_variables.py index 04779ad5d..51d56e519 100644 --- a/feature_engine/variable_handling/find_variables.py +++ b/feature_engine/variable_handling/find_variables.py @@ -85,7 +85,7 @@ def find_categorical_variables(X: pd.DataFrame) -> List[Union[str, int]]: """ variables = [ column - for column in X.select_dtypes(include=["O", "category"]).columns + for column in X.select_dtypes(include=["object", "category"]).columns if _is_categorical_and_is_not_datetime(X[column]) ] if len(variables) == 0: @@ -254,7 +254,7 @@ def find_categorical_and_numerical_variables( if variables is None: variables_cat = [ column - for column in X.select_dtypes(include=["O", "category"]).columns + for column in X.select_dtypes(include=["object", "category"]).columns if _is_categorical_and_is_not_datetime(X[column]) ] # find numerical variables in dataset @@ -271,14 +271,14 @@ def find_categorical_and_numerical_variables( raise ValueError("The list of variables is empty.") # find categorical variables - variables_cat = [ - var for var in X[variables].select_dtypes(include=["O", "category"]).columns - ] + variables_cat = list( + X[variables].select_dtypes(include=["object", "category"]).columns + ) # find numerical variables variables_num = list(X[variables].select_dtypes(include="number").columns) - if any([v for v in variables if v not in variables_cat + variables_num]): + if any(v for v in variables if v not in variables_cat + variables_num): raise TypeError( "Some of the variables are neither numerical nor categorical." ) diff --git a/tests/test_transformation/test_arcsinh.py b/tests/test_transformation/test_arcsinh.py new file mode 100644 index 000000000..a3d8b8d4d --- /dev/null +++ b/tests/test_transformation/test_arcsinh.py @@ -0,0 +1,201 @@ +import numpy as np +import pandas as pd +import pytest + +from feature_engine.transformation import ArcSinhTransformer + + +@pytest.fixture +def df_numerical(): + """Fixture providing sample numerical data with positive and negative values.""" + return pd.DataFrame({ + "a": [-100, -10, 0, 10, 100], + "b": [1, 2, 3, 4, 5], + }) + + +@pytest.fixture +def df_multi_column(): + """Fixture providing DataFrame with multiple columns.""" + return pd.DataFrame({ + "a": [1, 2, 3], + "b": [4, 5, 6], + "c": [7, 8, 9], + }) + + +def test_default_parameters(df_numerical): + """Test transformer with default parameters applies arcsinh to all columns.""" + transformer = ArcSinhTransformer() + X_tr = transformer.fit_transform(df_numerical.copy()) + + expected_a = np.arcsinh(df_numerical["a"]) + expected_b = np.arcsinh(df_numerical["b"]) + np.testing.assert_array_almost_equal(X_tr["a"], expected_a) + np.testing.assert_array_almost_equal(X_tr["b"], expected_b) + + +def test_specific_variables(df_multi_column): + """Test transformer with specific variables selected.""" + transformer = ArcSinhTransformer(variables=["a", "b"]) + X_tr = transformer.fit_transform(df_multi_column.copy()) + + np.testing.assert_array_almost_equal( + X_tr["a"], np.arcsinh(df_multi_column["a"]) + ) + np.testing.assert_array_almost_equal( + X_tr["b"], np.arcsinh(df_multi_column["b"]) + ) + np.testing.assert_array_equal(X_tr["c"], df_multi_column["c"]) + + +def test_with_loc_and_scale(): + """Test transformer with loc and scale parameters.""" + X = pd.DataFrame({"a": [10, 20, 30, 40, 50]}) + loc = 30.0 + scale = 10.0 + transformer = ArcSinhTransformer(loc=loc, scale=scale) + X_tr = transformer.fit_transform(X.copy()) + + expected = np.arcsinh((X["a"] - loc) / scale) + np.testing.assert_array_almost_equal(X_tr["a"], expected) + np.testing.assert_almost_equal(X_tr["a"].iloc[2], 0.0, decimal=10) + + +@pytest.mark.parametrize("loc", [0.0, 10.0, -10.0, 100.5]) +def test_various_loc_values(loc): + """Test that various loc values work correctly.""" + X = pd.DataFrame({"a": [1, 2, 3, 4, 5]}) + transformer = ArcSinhTransformer(loc=loc) + X_tr = transformer.fit_transform(X.copy()) + + expected = np.arcsinh((X["a"] - loc) / 1.0) + np.testing.assert_array_almost_equal(X_tr["a"], expected) + + +@pytest.mark.parametrize("scale", [0.5, 1.0, 2.0, 10.0, 100.0]) +def test_various_scale_values(scale): + """Test that various scale values work correctly.""" + X = pd.DataFrame({"a": [1, 2, 3, 4, 5]}) + transformer = ArcSinhTransformer(scale=scale) + X_tr = transformer.fit_transform(X.copy()) + + expected = np.arcsinh((X["a"] - 0.0) / scale) + np.testing.assert_array_almost_equal(X_tr["a"], expected) + + +def test_inverse_transform(df_numerical): + """Test inverse_transform returns original values.""" + X_original = df_numerical.copy() + transformer = ArcSinhTransformer() + X_tr = transformer.fit_transform(df_numerical.copy()) + X_inv = transformer.inverse_transform(X_tr) + + np.testing.assert_array_almost_equal(X_inv["a"], X_original["a"], decimal=10) + np.testing.assert_array_almost_equal(X_inv["b"], X_original["b"], decimal=10) + + +def test_inverse_transform_with_loc_scale(): + """Test inverse_transform with loc and scale parameters.""" + X = pd.DataFrame({"a": [10, 20, 30, 40, 50]}) + X_original = X.copy() + transformer = ArcSinhTransformer(loc=25.0, scale=5.0) + X_tr = transformer.fit_transform(X.copy()) + X_inv = transformer.inverse_transform(X_tr) + + np.testing.assert_array_almost_equal(X_inv["a"], X_original["a"], decimal=10) + + +def test_negative_values(): + """Test that transformer handles negative values correctly.""" + X = pd.DataFrame({"a": [-1000, -500, 0, 500, 1000]}) + transformer = ArcSinhTransformer() + X_tr = transformer.fit_transform(X.copy()) + + # Expected values: arcsinh([ -1000, -500, 0, 500, 1000 ]) + expected = [-7.600902, -6.907755, 0.0, 6.907755, 7.600902] + np.testing.assert_array_almost_equal(X_tr["a"], expected, decimal=5) + + # Verify symmetry property: arcsinh(-x) = -arcsinh(x) + np.testing.assert_almost_equal( + X_tr["a"].iloc[0], -X_tr["a"].iloc[4], decimal=10 + ) + np.testing.assert_almost_equal( + X_tr["a"].iloc[1], -X_tr["a"].iloc[3], decimal=10 + ) + + +@pytest.mark.parametrize("invalid_scale", [0, -1, -0.5, -100, "string", False]) +def test_invalid_scale_raises_error(invalid_scale): + """Test that non-positive scale values raise ValueError.""" + with pytest.raises(ValueError, match="scale must be a positive number"): + ArcSinhTransformer(scale=invalid_scale) + + +@pytest.mark.parametrize("invalid_loc", ["invalid", [1, 2], {"a": 1}, None]) +def test_invalid_loc_raises_error(invalid_loc): + """Test that non-numeric loc values raise ValueError.""" + with pytest.raises(ValueError, match="loc must be a number"): + ArcSinhTransformer(loc=invalid_loc) + + +def test_fit_stores_attributes(): + """Test that fit stores expected attributes with correct values.""" + X = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) + transformer = ArcSinhTransformer() + transformer.fit(X) + + assert hasattr(transformer, "variables_") + assert hasattr(transformer, "feature_names_in_") + assert hasattr(transformer, "n_features_in_") + assert transformer.n_features_in_ == 2 + assert set(transformer.variables_) == {"a", "b"} + assert transformer.feature_names_in_ == ["a", "b"] + + +def test_get_feature_names_out(): + """Test get_feature_names_out returns correct feature names.""" + X = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) + transformer = ArcSinhTransformer() + transformer.fit(X) + + feature_names = transformer.get_feature_names_out() + assert feature_names == ["a", "b"] + + +def test_get_feature_names_out_with_subset(): + """Test get_feature_names_out with subset of variables.""" + X = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]}) + transformer = ArcSinhTransformer(variables=["a"]) + transformer.fit(X) + + feature_names = transformer.get_feature_names_out() + assert feature_names == ["a", "b", "c"] + + +def test_behavior_like_log_for_large_values(): + """Test that arcsinh behaves like log for large positive values.""" + X = pd.DataFrame({"a": [1000, 10000, 100000]}) + transformer = ArcSinhTransformer() + X_tr = transformer.fit_transform(X.copy()) + + log_approx = np.log(2 * X["a"]) + np.testing.assert_array_almost_equal(X_tr["a"], log_approx, decimal=1) + + +def test_behavior_like_identity_for_small_values(): + """Test that arcsinh behaves like identity for small values.""" + X = pd.DataFrame({"a": [0.001, 0.01, 0.1]}) + transformer = ArcSinhTransformer() + X_tr = transformer.fit_transform(X.copy()) + + np.testing.assert_array_almost_equal(X_tr["a"], X["a"], decimal=2) + + +def test_zero_input_returns_zero(): + """Test that arcsinh(0) = 0.""" + X = pd.DataFrame({"a": [0.0]}) + transformer = ArcSinhTransformer() + X_tr = transformer.fit_transform(X.copy()) + + assert X_tr["a"].iloc[0] == 0.0 diff --git a/tests/test_transformation/test_check_estimator_transformers.py b/tests/test_transformation/test_check_estimator_transformers.py index 7db0088f8..812cbbbaf 100644 --- a/tests/test_transformation/test_check_estimator_transformers.py +++ b/tests/test_transformation/test_check_estimator_transformers.py @@ -7,6 +7,7 @@ from feature_engine.transformation import ( ArcsinTransformer, + ArcSinhTransformer, BoxCoxTransformer, LogCpTransformer, LogTransformer, @@ -21,6 +22,7 @@ LogTransformer(), LogCpTransformer(), ArcsinTransformer(), + ArcSinhTransformer(), PowerTransformer(), ReciprocalTransformer(), YeoJohnsonTransformer(), diff --git a/tests/test_wrappers/test_sklearn_wrapper.py b/tests/test_wrappers/test_sklearn_wrapper.py index e825a7bc0..527047246 100644 --- a/tests/test_wrappers/test_sklearn_wrapper.py +++ b/tests/test_wrappers/test_sklearn_wrapper.py @@ -401,6 +401,18 @@ def test_sklearn_ohe_all_features(df_vartypes): transformer=_OneHotEncoder(sparse=False, dtype=np.int64) ) + # Get the expected dob column names dynamically to handle varying precision + # across pandas/sklearn versions. + dob_names = [ + f"dob_{val.isoformat()}" for val in df_vartypes["dob"] + ] + # If isoformat doesn't have microseconds, sklearn might still add .000... + # OneHotEncoder uses categories_ which are often strings. + # Let's use a more robust way: + ohe = _OneHotEncoder(sparse=False) + ohe.fit(df_vartypes[["dob"]]) + dob_names = ohe.get_feature_names_out(["dob"]).tolist() + ref = pd.DataFrame( { "Name_jack": [0, 0, 0, 1], @@ -419,12 +431,10 @@ def test_sklearn_ohe_all_features(df_vartypes): "Marks_0.7": [0, 0, 1, 0], "Marks_0.8": [0, 1, 0, 0], "Marks_0.9": [1, 0, 0, 0], - "dob_2020-02-24T00:00:00.000000000": [1, 0, 0, 0], - "dob_2020-02-24T00:01:00.000000000": [0, 1, 0, 0], - "dob_2020-02-24T00:02:00.000000000": [0, 0, 1, 0], - "dob_2020-02-24T00:03:00.000000000": [0, 0, 0, 1], } ) + for i, name in enumerate(dob_names): + ref[name] = [1 if j == i else 0 for j in range(4)] transformed_df = transformer.fit_transform(df_vartypes) @@ -473,6 +483,10 @@ def test_wrap_one_hot_encoder_get_features_name_out(df_vartypes): ohe_wrap = SklearnTransformerWrapper(transformer=_OneHotEncoder(sparse=False)) ohe_wrap.fit(df_vartypes) + ohe = _OneHotEncoder(sparse=False) + ohe.fit(df_vartypes[["dob"]]) + dob_names = ohe.get_feature_names_out(["dob"]).tolist() + expected_features_all = [ "Name_jack", "Name_krish", @@ -490,11 +504,7 @@ def test_wrap_one_hot_encoder_get_features_name_out(df_vartypes): "Marks_0.7", "Marks_0.8", "Marks_0.9", - "dob_2020-02-24T00:00:00.000000000", - "dob_2020-02-24T00:01:00.000000000", - "dob_2020-02-24T00:02:00.000000000", - "dob_2020-02-24T00:03:00.000000000", - ] + ] + dob_names assert ohe_wrap.get_feature_names_out() == expected_features_all