diff --git a/docs/user_guide/creation/CyclicalFeatures.rst b/docs/user_guide/creation/CyclicalFeatures.rst index 2bb02d8b0..ae33e27a8 100644 --- a/docs/user_guide/creation/CyclicalFeatures.rst +++ b/docs/user_guide/creation/CyclicalFeatures.rst @@ -34,7 +34,7 @@ Cyclical encoding The trigonometric functions sine and cosine are periodic and repeat their values every 2 pi radians. Thus, to transform cyclical variables into (x, y) coordinates using these -functions, first we need to normalize them to 2 pi radians. +functions, first we need to normalise them to 2 pi radians. We achieve this by dividing the variables' values by their maximum value. Thus, the two new features are derived as follows: @@ -51,9 +51,9 @@ In Python, we can encode cyclical features by using the Numpy functions `sin` an X[f"{variable}_sin"] = np.sin(X["variable"] * (2.0 * np.pi / X["variable"]).max()) X[f"{variable}_cos"] = np.cos(X["variable"] * (2.0 * np.pi / X["variable"]).max()) -We can also use Feature-engine to automate this process. +We can also use feature-engine to automate this process. -Cyclical encoding with Feature-engine +Cyclical encoding with feature-engine ------------------------------------- :class:`CyclicalFeatures()` creates two new features from numerical variables to better @@ -69,7 +69,7 @@ Finding the max_value ~~~~~~~~~~~~~~~~~~~~~ :class:`CyclicalFeatures()` attempts to automate the process of cyclical encoding by -automatically determining the value used to normalize the feature between +automatically determining the value used to normalise the feature between 0 and 2 * pi radians, which coincides with the cycle of the periodic functions sine and cosine. @@ -86,7 +86,7 @@ Applying cyclical encoding -------------------------- We'll start by applying cyclical encoding to a toy dataset to get familiar with how to -use Feature-engine for cyclical encoding. +use feature-engine for cyclical encoding. In this example, we'll encode the cyclical features **days of the week** and **months**. Let's create a toy dataframe with the variables "days" and "months": @@ -131,6 +131,8 @@ The maximum values used for the transformation are stored in the attribute cyclical.max_values_ +Below we see the maximum values of each variable: + .. code:: python {'day': 7, 'months': 12} @@ -141,7 +143,7 @@ Let's have a look at the transformed dataframe: print(X.head()) -We see that the new variables were added at the right of our dataframe. +We see that the new variables were added at the right of our dataframe: .. code:: python @@ -166,7 +168,7 @@ the feature creation, we can set the parameter to `True`: print(X.head()) The resulting dataframe contains only the cyclical encoded features; the original variables -are removed: +were removed: .. code:: python @@ -200,7 +202,7 @@ Understanding cyclical encoding ------------------------------- We now know how to convert cyclical variables into (x, y) coordinates of a circle by using -the sine and cosine functions. Let’s now carry out some visualizations to better understand +the sine and cosine functions. Let’s now carry out some visualisations to better understand the effect of this transformation. Let's create a toy dataframe: @@ -266,7 +268,7 @@ These are the sine and cosine features that represent the hour: Let's now plot the hour variable against its sine transformation. We add perpendicular -lines to flag the hours 0 and 22. +lines to flag the hours 0 and 22: .. code:: python @@ -369,10 +371,10 @@ functions and cyclical encoding. Feature-engine vs Scikit-learn ------------------------------ -Let's compare the implementations of cyclical encoding between Feature-engine and Scikit-learn. +Let's compare the implementations of cyclical encoding between feature-engine and scikit-learn. We'll work with the Bike sharing demand dataset, and we'll follow the implementation of -Cyclical encoding found in the `Time related features documentation `_ -from Scikit-learn. +cyclical encoding found in the `Time related features documentation `_ +from scikit-learn. Let's load the libraries and dataset: @@ -409,7 +411,7 @@ In the following output we see the bike sharing dataset: 3 14.395 0.75 0.0 13 4 14.395 0.75 0.0 1 -To apply cyclical encoding with Scikit-learn, we can use the `FunctionTransformer`: +To apply cyclical encoding with scikit-learn, we can use the `FunctionTransformer`: .. code:: python @@ -476,7 +478,7 @@ and hour: [17379 rows x 6 columns] -With Feature-engine, we can do the same as follows: +With feature-engine, we can do the same as follows: .. code:: python @@ -533,10 +535,10 @@ the variable hour by 23, instead of 24, because the values of these variables va {'month': 12, 'weekday': 6, 'hour': 23} Practically, there isn't a big difference between the values of the dataframes returned by -Scikit-learn and Feature-engine, and I doubt that this subtle difference will incur in a big +scikit-learn and feature-engine, and I doubt that this subtle difference will incur in a big change in model performance. -However, if you want to divide the varibles weekday and hour by 7 and 24 respectively, you can +However, if you want to divide the variables weekday and hour by 7 and 24 respectively, you can do so like this: .. code:: python @@ -574,35 +576,6 @@ the user, with automation, we can only go that far. Additional resources -------------------- -For tutorials on how to create cyclical features, check out the following courses: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -.. figure:: ../../images/fetsf.png - :width: 300 - :figclass: align-center - :align: right - :target: https://www.trainindata.com/p/feature-engineering-for-forecasting - - Feature Engineering for Time Series Forecasting - -| -| -| -| -| -| -| -| -| -| - For a comparison between one-hot encoding, ordinal encoding, cyclical encoding and spline encoding of cyclical features check out the following `sklearn demo `_. @@ -610,3 +583,14 @@ encoding of cyclical features check out the following Check also these Kaggle demo on the use of cyclical encoding with neural networks: - `Encoding Cyclical Features for Deep Learning `_. + + +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/creation/DecisionTreeFeatures.rst b/docs/user_guide/creation/DecisionTreeFeatures.rst index 5569e34c0..00fe389dd 100644 --- a/docs/user_guide/creation/DecisionTreeFeatures.rst +++ b/docs/user_guide/creation/DecisionTreeFeatures.rst @@ -7,19 +7,19 @@ DecisionTreeFeatures The winners of the KDD 2009 competition observed that many features had high mutual information with the target, but low correlation, leading them to conclude -that the relationships were non-linear. While non-linear relationships can be +that the relationships were non-linear. + +While non-linear relationships can be captured by non-linear models, to leverage the information from these features with -linear models, we need to somehow transform that information into a linear, or +linear models, we need to transform that information into a linear, or monotonic relationship with the target. The output of decision trees, that is, their predictions, should be monotonic with -the target, if there is a good fit for the tree. - -In addition, decision trees trained on 2 or more features could capture feature -interactions that simpler models would miss. +the target, if there is a good fit for the tree. In addition, decision trees trained on +2 or more features could capture feature interactions that simpler models would miss. By enriching the dataset with features resulting from the predictions of decision trees, -we can create better performing models. On the downside the features resulting +we can create better performing models. On the downside, the features resulting from decision trees, are not easy to interpret or explain. :class:`DecisionTreeFeatures()` creates and adds features resulting from the predictions @@ -39,13 +39,13 @@ are the output of the `predict_proba` method of the model corresponding to the p of class 1. If the output is multiclass, on the other hand, the features are derived from the `predict` method, and hence return the predicted class. -Examples --------- +Python implementation +--------------------- In the rest of the document, we'll show the versatility of :class:`DecisionTreeFeatures()` to create multiple features by using decision trees. -Let's start by loading and displaying the California housing dataset from sklearn +Let's start by loading and displaying the California housing dataset from sklearn: .. code:: python @@ -76,7 +76,7 @@ Let's split the dataset into a training and a testing set: X, y, test_size=0.3, random_state=0) Combining features - integers ------------------------------ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We'll set up :class:`DecisionTreeFeatures()` to create **all possible** combinations of 2 features. To create all possible combinations we use integers with the `features_to_combine` @@ -88,7 +88,7 @@ parameter: dtf.fit(X_train, y_train) If we leave the parameter `variables` to `None`, :class:`DecisionTreeFeatures()` will combine -all numerical variables in the training set, in the way we indicate in `features_to_combine`. +all numerical variables in the training set in the way we indicate in `features_to_combine`. Since we set `features_to_combine=2`, the transformer will create all possible combinations of 1 or 2 variables. @@ -191,7 +191,7 @@ decision trees: [5 rows x 27 columns] Combining features - Lists --------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's say that we want to create features based of trees trained of 2 or more variables. Instead of using an integer in `features_to_combine`, we need to pass a list of integers, telling :class:`DecisionTreeFeatures()` @@ -266,7 +266,7 @@ In the following output we see the dataframe with the new features: 15709 1.843904 Specifying the feature combinations - tuples --------------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can indicate precisely the features that we want to use as input of the decision trees. Let's make a tuple containing the features combinations. We want a tree trained with @@ -316,7 +316,7 @@ And now we can go ahead and add the features to the data: test_t = dtf.transform(X_test) Examining the new features --------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`DecisionTreeFeatures()` appends the word `tree` to the new features, so if we wanted to display only the new features, we can do so as follows @@ -344,7 +344,7 @@ we wanted to display only the new features, we can do so as follows Evaluating individual trees ---------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can evaluate the performance of each of the trees used to create the features, if we so wish. Let's set up the :class:`DecisionTreeFeatures()`: @@ -355,7 +355,7 @@ we so wish. Let's set up the :class:`DecisionTreeFeatures()`: dtf.fit(X_train, y_train) :class:`DecisionTreeFeatures()` trains each tree with cross-validation. If we do not -pass a grid with hyperparameters, it will optimize the depth by default. We can find +pass a grid with hyperparameters, it will optimise the depth by default. We can find the trained estimators like this: .. code:: python @@ -396,7 +396,7 @@ of the feature **Population** to predict house price: {'max_depth': 2} -If we want to check out the performance of the best tree during found in the grid search, +If we want to check out the performance of the best tree found with the grid search, we can do so like this: .. code:: python @@ -404,13 +404,16 @@ we can do so like this: tree.score(X_test[['Population']], y_test) The following performance value corresponds to the negative of the mean squared error -which is the metric optimised durign the search (you can select the metric to optimize -through the `scoring` parameter of :class:`DecisionTreeFeatures()`). +which is the metric optimised durign the search: .. code:: python -1.3308515769033213 +.. note:: + + You can select the metric to optimise through the `scoring` parameter of :class:`DecisionTreeFeatures()`). + Note that you can also isolate the tree, and then obtain a performance metric: .. code:: python @@ -418,18 +421,18 @@ Note that you can also isolate the tree, and then obtain a performance metric: tree.best_estimator_.score(X_test[['Population']], y_test) In this case, the following performance metric corresponds to the R2, which is the -default metric returned by scikit-learn's DecisionTreeRegressor. +default metric returned by scikit-learn's DecisionTreeRegressor: .. code:: python 0.0017890442253447603 Dropping the original variables -------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -With :class:`DecisionTreeFeatures()`, we can automatically remove from the resulting -dataframe the features used as input from the decision trees. We need to set `drop_original` -to `True`. +With :class:`DecisionTreeFeatures()`, we can automatically remove the features used as +input for the decision trees from the resulting dataframe. We need to set `drop_original` +to `True`: .. code:: python @@ -446,7 +449,7 @@ to `True`. print(test_t.head()) -We see in the resulting dataframe that the variables ["AveRooms", "AveBedrms", "Population"] +We see in the resulting dataframe that the variables `AveRooms`, `AveBedrms`, `Population` are not there: .. code:: python @@ -473,64 +476,27 @@ are not there: 15709 1.843904 Creating features for classification ------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If we are creating features for a classifier instead of a regressor, the procedure is identical. We just need to set the parameter `regression` to False. -Note that if you are creating features for binary classification, the added features -will contain the probabilities of class 1. If you are creating features for multi-class -classification, on the other hand, the features will contain the prediction of the class. +.. note:: + + Note that if you are creating features for binary classification, the added features + will contain the probabilities of class 1. If you are creating features for multi-class + classification, on the other hand, the features will contain the prediction of the class. Additional resources -------------------- -For more details about this and other feature engineering methods check out these resources: - - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/creation/GeoDistanceFeatures.rst b/docs/user_guide/creation/GeoDistanceFeatures.rst index db9984f5b..5c991cae5 100644 --- a/docs/user_guide/creation/GeoDistanceFeatures.rst +++ b/docs/user_guide/creation/GeoDistanceFeatures.rst @@ -9,7 +9,7 @@ GeoDistanceFeatures coordinate pairs (latitude/longitude) and adds the result as a new feature. :class:`GeoDistanceFeatures()` is useful for location-based machine learning problems such as -real estate pricing, delivery route optimization, ride-sharing applications, +real estate pricing, delivery route optimisation, ride-sharing applications, and any domain where geographic proximity is relevant. Distance Methods @@ -34,9 +34,8 @@ The distance can be returned in various units: - **meters**: Meters - **feet**: Feet -Python Demo ------------ - +Python implementation +--------------------- Let's create a dataframe with origin and destination coordinates: .. code:: python @@ -88,7 +87,7 @@ Using different distance methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can use the Euclidean distance method, which provides a faster but less accurate -calculation suitable for short distances: +calculation (suitable for short distances): .. code:: python @@ -234,3 +233,16 @@ The pipeline successfully trains and returns predictions: .. code:: python Predictions: [100. 150. 80. 200.] + +Additional resources +-------------------- + +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/creation/MathFeatures.rst b/docs/user_guide/creation/MathFeatures.rst index 3a83b2c0a..09cc83cba 100644 --- a/docs/user_guide/creation/MathFeatures.rst +++ b/docs/user_guide/creation/MathFeatures.rst @@ -53,8 +53,8 @@ The variable **total_number_payments** is obtained by adding up the features indicated in `variables`, whereas the variable **mean_number_payments** is the mean of those 4 features. -Examples --------- +Python implementation +--------------------- Let's dive into how we can use :class:`MathFeatures()` in more details. Let's first create a toy dataset: @@ -71,7 +71,7 @@ create a toy dataset: "City": ["London", "Manchester", "Liverpool", "Bristol"], "Age": [20, 21, 19, 18], "Marks": [0.9, 0.8, 0.7, 0.6], - "dob": pd.date_range("2020-02-24", periods=4, freq="T"), + "dob": pd.date_range("2020-02-24", periods=4, freq="min"), }) print(df) @@ -101,8 +101,8 @@ strings to indicate the functions: print(df_t) -And we obtain the following dataset, where the new variables are named after the function -used to obtain them, plus the group of variables that were used in the computation: +We obtain the following dataset, where the new variables are named after the function +used to create them, plus the group of variables that were used in the computation: .. code:: python @@ -132,7 +132,7 @@ For more flexibility, we can pass existing functions to the `func` argument as f print(df_t) -And we obtain the following dataframe: +We obtain the following dataframe: .. code:: python @@ -229,51 +229,12 @@ provided: Additional resources -------------------- -For more details about this and other feature engineering methods check out these resources: - - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/creation/RelativeFeatures.rst b/docs/user_guide/creation/RelativeFeatures.rst index c9262feb3..0568632c2 100644 --- a/docs/user_guide/creation/RelativeFeatures.rst +++ b/docs/user_guide/creation/RelativeFeatures.rst @@ -44,8 +44,8 @@ The precedent code block will return a new dataframe, Xt, with 4 new variables t calculated as the division of each one of the variables in `variables` and 'total_payments'. -Examples --------- +Python implementation +--------------------- Let's dive into how we can use :class:`RelativeFeatures()` in more details. Let's first create a toy dataset: @@ -61,7 +61,7 @@ create a toy dataset: "City": ["London", "Manchester", "Liverpool", "Bristol"], "Age": [20, 21, 19, 18], "Marks": [0.9, 0.8, 0.7, 0.6], - "dob": pd.date_range("2020-02-24", periods=4, freq="T"), + "dob": pd.date_range("2020-02-24", periods=4, freq="min"), }) print(df) @@ -76,8 +76,8 @@ The dataset looks like this: 2 krish Liverpool 19 0.7 2020-02-24 00:02:00 3 jack Bristol 18 0.6 2020-02-24 00:03:00 -We can now apply several functions between the numerical variables Age and Marks and Age -as follows: +We can now apply several functions between the numerical variables `Age` and `Marks` +and `Age` as follows: .. code:: python @@ -91,7 +91,7 @@ as follows: print(df_t) -And we obtain the following dataset, where the new variables are named after the variables +We obtain the following dataset, where the new variables are named after the variables that were used for the calculation and the function in the middle of their names. Thus, `Mark_sub_Age` means `Mark - Age`, and `Marks_mod_Age` means `Mark % Age`. @@ -144,51 +144,12 @@ Which will return the names of all the variables in the transformed data: Additional resources -------------------- -For more details about this and other feature engineering methods check out these resources: - - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/creation/index.rst b/docs/user_guide/creation/index.rst index 88672bc1a..b97dd1eec 100644 --- a/docs/user_guide/creation/index.rst +++ b/docs/user_guide/creation/index.rst @@ -3,7 +3,7 @@ Feature Creation ================ -Feature creation, is a common step during data preprocessing, and consists of constructing new +Feature creation is a common step during data preprocessing and consists of constructing new variables from the dataset’s original features. By combining two or more variables, we develop new features that can improve the performance of a machine learning model, capture additional information or relationships among variables, or simply make more sense within the domain we @@ -15,18 +15,21 @@ is a feature engineering technique used to transform a categorical feature into binary variables that represent each category. Another common feature extraction procedure consist of creating new features from past -values of time series data, for example through the use of lags and windows. +values of time series data, for example through the use of :ref:`lags ` +and :ref:`windows `. -In general, creating features requires a dose of domain knowledge and significant time -invested in analyzing the raw data, including evaluating the relationship between the independent or -predictor variables and the dependent or target variable in the dataset. +.. note:: + + In general, creating features requires a dose of domain knowledge and significant time + invested in analysing the raw data, including evaluating the relationship between the independent or + predictor variables and the dependent or target variable in the dataset. Feature creation can be one of the more creative aspects of feature engineering, and the new features can help improve a predictive model’s performance. -Lastly, a data scientist should be mindful that creating new features may increase the dimensionality -of the dataset quite dramatically. For example, one hot encoding of highly cardinal categorical -features results in lots of binary variables, and so does polynomial combinations of high powers. +A data scientist should be mindful that creating new features may increase the dimensionality +of the dataset quite dramatically. For example, one-hot encoding of highly cardinal categorical +features results in lots of binary variables, and so does polynomial combinations using high power binomials. This may have downstream effects depending on the machine learning algorithm being used. For example, decision trees are known for not being able to cope with huge number of features. @@ -34,27 +37,33 @@ Creating New Features with Feature-engine ----------------------------------------- Feature-engine has several transformers that create and add new features to the dataset. One of -the most popular ones is the `OneHotEncoder `_ +the most popular ones is `OneHotEncoder `_ that creates dummy variables from categorical features. With Feature-engine we can also create new features from time series data through lags and windows by using `LagFeatures `_ or `WindowFeatures `_. +Feature-engine also supports the creation of new features from :ref:`datetime ` variables +as well as the extraction of features from :ref:`text `. + Feature-engine’s creation module, supports transformers that create and add new features to a pandas dataframe by either combining existing features through different mathematical or statistical operations, or through feature transformations. These transformers operate with numerical variables, that is, those with integer and float data types. -Summary of Feature-engine’s feature-creation transformers: - -- **CyclicalFeatures** - Creates two new features per variable by applying the trigonometric operations sine and cosine to the original feature. - -- **MathFeatures** - Combines a set of features into new variables by applying basic mathematical functions like the sum, mean, maximum or standard deviation. +Summary of Feature-engine’s creation transformers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- **RelativeFeatures** - Utilizes basic mathematical functions between a group of variables and one or more reference features, appending the new features to the pandas dataframe. - -- **DecisionTreeFeatures** - Creates new features as the output of decision trees trained on 1 or more feature combinations. +================================== ===================================================================================================================================== + Transformer Description +================================== ===================================================================================================================================== +:class:`CyclicalFeatures()` Creates 2 new features per variable by applying the trigonometric operations sine and cosine. +:class:`DecisionTreeFeatures()` Creates new features as the output of decision trees trained on 1 or more feature combinations. +:class:`GeoDistanceFeatures()` Creates distance features from latitude and longitude. +:class:`MathFeatures()` Combines a set of features into new variables by applying mathematical functions like sum, mean, maximum or standard deviation. +:class:`RelativeFeatures()` Combines features with math functions like subtraction, division, or modulo. +================================== ===================================================================================================================================== Feature creation module ----------------------- @@ -63,24 +72,24 @@ Feature creation module :maxdepth: 1 CyclicalFeatures - MathFeatures - RelativeFeatures DecisionTreeFeatures GeoDistanceFeatures + MathFeatures + RelativeFeatures Feature-engine in Practice -------------------------- -Here, you'll get a taste of the transformers from the feature creation module from Feature-engine. +Here, you'll get a taste of the transformers from the feature creation module from feature-engine. We'll use the wine quality dataset. The dataset is comprised of 11 features, including `alcohol`, -`ash`, and ``flavonoids``, and has `quality` as its target variable. +`ash`, and `flavonoids`, and has `quality` as its target variable. Through exploratory data analysis and our domain knowledge which includes real-world experimentation, i.e., drinking various brands/types of wine, we believe that we can create better features to train our algorithm by combining original features with various mathematical operations. -Let's load the dataset from Scikit-learn. +Let's load the dataset from scikit-learn. .. code:: python @@ -118,7 +127,7 @@ Below we see the wine quality dataset: Now, we create a new feature by removing non-flavonoid phenols from the total phenols to -obtain the phenols that are not flavonoid. +obtain the phenols that are not flavonoid: .. code:: python @@ -229,7 +238,7 @@ We see the new features at the right of the resulting pandas dataframe: In the above examples, we used `RelativeFeature()` and `MathFeatures` to perform automated feature engineering on the input data by applying the transformations defined in the `func` parameter on -the features identified in `variables` and ``reference`` parameters. +the features identified in `variables` and `reference` parameters. The original and new features can now be used to train a regression model, or a multiclass classification algorithm, to predict the `quality` of the wine. @@ -237,75 +246,36 @@ classification algorithm, to predict the `quality` of the wine. Summary ------- -Through feature engineering and feature creation, we can optimize the machine learning algorithm's +Through feature engineering and feature creation, we can optimise the machine learning algorithm's learning process and improve its performance metrics. We'd strongly recommend the creation of features based on domain knowledge, exploratory data analysis and thorough data mining. We also understand that this is not always possible, particularly with big datasets and limited time allocated to each project. In this situation, we can combine -the creation of features with feature selection procedures to let machine learning algorithms +the creation of features with :ref:`feature selection ` procedures to let machine learning algorithms select what works best for them. Good luck with your models! -Tutorials, books and courses ----------------------------- - -For tutorials about this and other feature engineering for machine learning methods check out -our online course: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 +Additional resources +-------------------- - Python Feature Engineering Cookbook +For tutorials about this and other feature engineering methods check out these resources: -| -| -| -| -| -| -| -| -| -| -| -| -| +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. Transformers in other Libraries ------------------------------- -Check also the following transformer from Scikit-learn: +Check also the following transformer from scikit-learn: * `PolynomialFeatures `_ * `SplineTransformer `_ diff --git a/docs/user_guide/datetime/DatetimeFeatures.rst b/docs/user_guide/datetime/DatetimeFeatures.rst index 9f3a58db3..62414142c 100644 --- a/docs/user_guide/datetime/DatetimeFeatures.rst +++ b/docs/user_guide/datetime/DatetimeFeatures.rst @@ -7,7 +7,7 @@ DatetimeFeatures In datasets commonly used in data science and machine learning projects, the variables very often contain information about date and time. **Date of birth** and **time of purchase** are two -examples of these variables. They are commonly referred to as “datetime features”, that is, +examples of these variables. They are commonly referred to as *“datetime features”*, that is, data whose data type is date and time. We don’t normally use datetime variables in their raw format to train machine learning models, @@ -15,28 +15,57 @@ like those for regression, classification, or clustering. Instead, we can extrac from these variables by extracting the different date and time components of the datetime variable. -Examples of date and time components are the year, the month, the week_of_year, the day -of the week, the hour, the minutes, and the seconds. +Examples of date and time components are the year, the month, the week of the year, the day +of the week, the hour, the minutes, and the seconds, among others. Datetime features with pandas ----------------------------- In Python, we can extract date and time components through the `dt` module of the open-source -library pandas. For example, by executing the following: +library pandas. For example, if we have the following dataframe: .. code:: python + import pandas as pd data = pd.DataFrame({"date": pd.date_range("2019-03-05", periods=20, freq="D")}) + +We can extract the features *year*, *quarter* and *month* by executing the following: + +.. code:: python + data["year"] = data["date"].dt.year data["quarter"] = data["date"].dt.quarter data["month"] = data["date"].dt.month -In the former code block we created 3 features from the timestamp variable: the *year*, the -*quarter* and the *month*. - - -Datetime features with Feature-engine +If we now execute `print(data)`, we'll obtain the following dataframe with the date and time +components added as columns: + +.. code:: python + + date year quarter month + 0 2019-03-05 2019 1 3 + 1 2019-03-06 2019 1 3 + 2 2019-03-07 2019 1 3 + 3 2019-03-08 2019 1 3 + 4 2019-03-09 2019 1 3 + 5 2019-03-10 2019 1 3 + 6 2019-03-11 2019 1 3 + 7 2019-03-12 2019 1 3 + 8 2019-03-13 2019 1 3 + 9 2019-03-14 2019 1 3 + 10 2019-03-15 2019 1 3 + 11 2019-03-16 2019 1 3 + 12 2019-03-17 2019 1 3 + 13 2019-03-18 2019 1 3 + 14 2019-03-19 2019 1 3 + 15 2019-03-20 2019 1 3 + 16 2019-03-21 2019 1 3 + 17 2019-03-22 2019 1 3 + 18 2019-03-23 2019 1 3 + 19 2019-03-24 2019 1 3 + +Datetime features with feature-engine ------------------------------------- :class:`DatetimeFeatures()` automatically extracts several date and time features from @@ -46,7 +75,7 @@ format. It *cannot* extract features from numerical variables. :class:`DatetimeFeatures()` uses the pandas `dt` module under the hood, therefore automating datetime feature engineering. In two lines of code and by specifying which features we -want to create with :class:`DatetimeFeatures()`, we can create multiple date and time variables +want to create, with :class:`DatetimeFeatures()` we can create multiple date and time variables from various variables simultaneously. :class:`DatetimeFeatures()` can automatically create all features supported by pandas `dt` @@ -81,7 +110,7 @@ First, we will create a toy dataframe with 2 date variables: }) Now, we will extract the variables month, month-end and the day of the year from the -second datetime variable in our dataset. +second datetime variable in our dataset: .. code:: python @@ -107,9 +136,11 @@ We see the new features in the following output: 2 Jan-1999 8 0 215 3 Feb-2002 10 1 305 -By default, :class:`DatetimeFeatures()` drops the variable from which the date and time -features were extracted, in this case, *var_date2*. To keep the variable, we just need -to indicate `drop_original=False` when initializing the transformer. +.. note:: + + By default, :class:`DatetimeFeatures()` drops the variable from which the date and time + features were extracted, in this case, *var_date2*. To keep the variable, we just need + to indicate `drop_original=False` when initialising the transformer. Finally, we can obtain the name of the variables in the returned data as follows: @@ -117,6 +148,9 @@ Finally, we can obtain the name of the variables in the returned data as follows dtfs.get_feature_names_out() +In the following output, we see the name of the remaining original variables plus the +newly created features: + .. code:: python ['var_date1', @@ -131,7 +165,7 @@ Extract time features In this example, we are going to extract the feature *minute* from the two time variables in our dataset. -First, let's create a toy dataset with 2 time variables and an object variable. +First, let's create a toy dataset with 2 time variables and an object variable: .. code:: python @@ -181,15 +215,19 @@ The variables detected as datetime are stored in the transformer's `variables_` dfts.variables_ +In the following output we see the name of the datatime variables identified by the transformer: + .. code:: python ['var_time1', 'var_time2'] -The original datetime variables are dropped from the data by default. This leaves the -dataset ready to train machine learning algorithms like linear regression or random forests. +.. note:: + + The original datetime variables are dropped from the data by default. This leaves the + dataset ready to train machine learning algorithms like linear regression or random forests. -If we want to keep the datetime variables, we just need to indicate `drop_original=False` -when initializing the transformer. + If we want to keep the datetime variables, we just need to indicate `drop_original=False` + when initialising the transformer. Finally, if we want to obtain the names of the variables in the output data, we can use: @@ -197,6 +235,8 @@ Finally, if we want to obtain the names of the variables in the output data, we dfts.get_feature_names_out() +Below the names of the variables in the resulting dataframe: + .. code:: python ['not_a_dt', 'var_time1_minute', 'var_time2_minute'] @@ -209,7 +249,7 @@ In this example, we will combine what we have seen in the previous two examples and extract a date feature - *year* - and time feature - *hour* - from two variables that contain both date and time information. -Let's go ahead and create a toy dataset with 3 datetime variables. +Let's go ahead and create a toy dataset with 3 datetime variables: .. code:: python @@ -224,7 +264,7 @@ Let's go ahead and create a toy dataset with 3 datetime variables. Now, we set up the :class:`DatetimeFeatures()` to extract features from 2 of the datetime variables. In this case, we do not want to drop the datetime variable after extracting -the features. +the features: .. code:: python @@ -254,6 +294,13 @@ We can see the resulting dataframe in the following output: And that is it. The new features are now added to the dataframe. +.. tip:: + + The original datetime variables remain in the dataframe, in case we want to calculate + the time difference between them (see :ref:`DatetimeSubtraction() `) + or recode them as time passed since a specific date (see :ref:`DatetimeOrdinal() `). + + Time series ~~~~~~~~~~~ @@ -305,7 +352,7 @@ We can extract features from the index as follows: Xtr We can see that the transformer created the default time features and added them at -the end of the dataframe. +the end of the dataframe: .. code:: python @@ -335,6 +382,8 @@ We can obtain the name of all the variables in the output dataframe as follows: dtf.get_feature_names_out() +Below the name of the variables in the resulting dataframe: + .. code:: python ['ambient_temp', @@ -385,6 +434,9 @@ And now we mistakenly extract only date features: df_transf +As you see in the following output, tThe transformer will still create features derived +from today's date (the date of creating the docs). + .. code:: python not_a_dt var_time1_year var_time1_month var_time1_day_of_week var_time2_year \ @@ -399,8 +451,7 @@ And now we mistakenly extract only date features: 2 12 2 3 12 2 -The transformer will still create features derived from today's date (the date of -creating the docs). + If instead we have a dataframe with only date variables: @@ -426,6 +477,8 @@ And we mistakenly extract the hour and the minute: print(df_transf) +The new features will contain the value 0, as seen in the resulting dataframe: + .. code:: python var_date1_hour var_date1_minute var_date2_hour var_date2_minute @@ -434,7 +487,6 @@ And we mistakenly extract the hour and the minute: 2 0 0 0 0 3 0 0 0 0 -The new features will contain the value 0. Automating feature extraction ----------------------------- @@ -476,6 +528,9 @@ To do this, we leave the parameter `features_to_extract` to `None`. df_transf +The resulting dataset contains the original features plus the new variables extracted +from them: + .. code:: python var_dt1 var_dt2 var_dt3 var_dt1_month \ @@ -493,15 +548,15 @@ To do this, we leave the parameter `features_to_extract` to `None`. 1 0 0 2 0 0 -Our new dataset contains the original features plus the new variables extracted -from them. -We can find the group of features extracted by the transformer in its attribute: +We can find the group of features extracted by the transformer in the following attribute: .. code:: python dfts.features_to_extract_ +Below, the date and time features that will be extracted from the datetime variables: + .. code:: python ['month', @@ -530,6 +585,8 @@ We can also extract all supported features automatically, by setting the paramet print(df_transf) +Below we see the resulting dataframe with all the features supported by feature-engine: + .. code:: python var_dt1 var_dt2 var_dt3 var_dt1_month \ @@ -562,12 +619,14 @@ We can also extract all supported features automatically, by setting the paramet 1 0 2 0 -We can find the group of features extracted by the transformer in its attribute: +We can find the group of features extracted by the transformer in the following attribute: .. code:: python dfts.features_to_extract_ +Below we see the date and time features that will be extracted from the datetime variable: + .. code:: python ['month', @@ -598,7 +657,7 @@ If we have a dataframe with date variables, time variables and date and time var we can extract all features, or the most common features from all the variables, and then go ahead and remove the irrelevant features with the `DropConstantFeatures()` class. -Let's create a dataframe with a mix of datetime variables. +Let's create a dataframe with a mix of datetime variables: .. code:: python @@ -613,11 +672,12 @@ Let's create a dataframe with a mix of datetime variables. "var_dt": ['08/31/00 12:34:45', '12/01/90 23:01:02', '04/25/01 11:59:21', '04/25/01 11:59:21'], }) -Now, we line up in a Scikit-learn pipeline the :class:`DatetimeFeatures` and the -`DropConstantFeatures()`. The :class:`DatetimeFeatures` will create date features -derived from today for the time variable, and time features with the value 0 for the -date only variable. `DropConstantFeatures()` will identify and remove these features -from the dataset. +Now, we set up a scikit-learn pipeline with :class:`DatetimeFeatures` and +`DropConstantFeatures()`. + +:class:`DatetimeFeatures` will create date features with today's date for the time variable, +and time features with the value 0 for the date variable. `DropConstantFeatures()` will +identify and remove these features from the dataset. .. code:: python @@ -628,17 +688,23 @@ from the dataset. pipe.fit(toy_df) +Below we see the output of fitting the pipeline: + .. code:: python Pipeline(steps=[('datetime', DatetimeFeatures()), ('drop_constant', DropConstantFeatures())]) +Now, we extract the datetime features and remove those that are constant: + .. code:: python df_transf = pipe.transform(toy_df) print(df_transf) +In the following output we see the resulting dataframe: + .. code:: python var_date_month var_date_year var_date_day_of_week var_date_day_of_month \ @@ -676,7 +742,7 @@ with such variables in three different scenarios. **Case 1**: our dataset contains a time-aware variable in object format, with potentially different timezones across different observations. -We pass `utc=True` when initializing the transformer to make sure it +We pass `utc=True` when initialising the transformer to make sure it converts all data to UTC timezone. .. code:: python @@ -699,6 +765,8 @@ converts all data to UTC timezone. df_transf +Below the resulting dataframe: + .. code:: python var_tz var_tz_hour var_tz_minute @@ -708,10 +776,12 @@ converts all data to UTC timezone. 3 08:44:23Z 8 44 -**Case 2**: our dataset contains a variable that is cast as a localized +**Case 2**: our dataset contains a variable that is cast as a localised datetime in a particular timezone. However, we decide that we want to get all the datetime information extracted as if it were in UTC timezone. +Let's create a toy dataset with a datetime variable localised in the US eastern time zone: + .. code:: python import pandas as pd @@ -722,6 +792,8 @@ the datetime information extracted as if it were in UTC timezone. var_tz = var_tz.dt.tz_localize("US/eastern") var_tz +Below our toy dataset: + .. code:: python 0 2000-08-31 12:34:45-04:00 @@ -729,8 +801,8 @@ the datetime information extracted as if it were in UTC timezone. 2 2001-04-25 11:59:21-04:00 dtype: datetime64[ns, US/Eastern] -We need to pass `utc=True` when initializing the transformer to revert back to the UTC -timezone. +We need to pass `utc=True` when initialising the transformer to revert back to the UTC +timezone before extracting the features: .. code:: python @@ -746,6 +818,10 @@ timezone. df_transf +In the output we see the resulting dataframe, where the variable was first set to UTC and +after that the features were created (see missmatch between the hour in the original variable +and the extracted feature): + .. code:: python var_tz var_tz_day_of_month var_tz_hour @@ -755,7 +831,7 @@ timezone. **Case 3**: given a variable like *var_tz* in the example above, we now want -to extract the features keeping the original timezone localization, +to extract the features keeping the original timezone localisation, therefore we pass `utc=False` or `None`. In this case, we leave it to `None` which is the default option. @@ -771,6 +847,9 @@ is the default option. print(df_transf) +In the following dataset, we can see that the features were extracted respecting the US +eastern time zone: + .. code:: python var_tz var_tz_day_of_month var_tz_hour @@ -791,62 +870,12 @@ when a missing value is encountered in a datetime variable. Additional resources -------------------- -You can find an example of how to use :class:`DatetimeFeatures()` with a real dataset in -the following `Jupyter notebook `_ - -For tutorials on how to create and use features from datetime columns, check the following courses: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -.. figure:: ../../images/fetsf.png - :width: 300 - :figclass: align-center - :align: right - :target: https://www.trainindata.com/p/feature-engineering-for-forecasting - - Feature Engineering for Time Series Forecasting - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/datetime/DatetimeOrdinal.rst b/docs/user_guide/datetime/DatetimeOrdinal.rst index 4ca543e06..54ad52b2e 100644 --- a/docs/user_guide/datetime/DatetimeOrdinal.rst +++ b/docs/user_guide/datetime/DatetimeOrdinal.rst @@ -48,7 +48,7 @@ The output shows the new ordinal feature: In the variable `ordinal`, the value `738521` means that `2023-01-01` is 738521 days *after* the 1st of January of the year 1. -Datetime ordinal with Feature-engine +Datetime ordinal with feature-engine ------------------------------------ :class:`DatetimeOrdinal()` automatically converts one or more datetime variables into @@ -62,8 +62,8 @@ functionalities are: - It can compute the ordinal number relative to a `start_date`. - It can automatically find and select datetime variables. -Example -~~~~~~~ +Python implementation +--------------------- First, let's create a toy dataframe with 2 date variables: @@ -78,7 +78,7 @@ First, let's create a toy dataframe with 2 date variables: "other_var": [1, 2, 3, 4] }) -Now, we will set up the transformer to convert `var_date2` into an ordinal feature. +Now, we will set up the transformer to convert `var_date2` into an ordinal feature: .. code:: python @@ -105,7 +105,7 @@ Calculate days from a start date ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`DatetimeOrdinal()` can also calculate the number of days elapsed since a -specific `start_date`. +specific `start_date` as follows: .. code:: python @@ -119,7 +119,7 @@ specific `start_date`. df_transf The new feature now represents the number of days between `var_date2` and January 1st, -2010. Note that dates before the `start_date` will result in negative numbers. +2010. Note that dates before the `start_date` will result in negative numbers: .. code:: python @@ -146,67 +146,12 @@ ordinal feature will contain `NaN` (or `pd.NA`) in their place. Additional resources -------------------- -For tutorials on how to create and use features from datetime columns, check the following courses: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -.. figure:: ../../images/fetsf.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-forecasting - - Feature Engineering for Time Series Forecasting - -| -| -| -| -| -| -| -| -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/datetime/DatetimeSubtraction.rst b/docs/user_guide/datetime/DatetimeSubtraction.rst index 0085b8945..fb055533e 100644 --- a/docs/user_guide/datetime/DatetimeSubtraction.rst +++ b/docs/user_guide/datetime/DatetimeSubtraction.rst @@ -5,7 +5,7 @@ DatetimeSubtraction =================== -Very often, we have datetime variables in our datasets, and we want to determine the time +Often we have datetime variables in our datasets and we want to determine the time difference between them. For example, if we work with financial data, we may have the variable **date of loan application**, with the date and time when the customer applied for a loan, and also the variable **date of birth**, with the customer's date of birth. With those @@ -17,11 +17,11 @@ In a different example, if we are trying to predict the price of the house and w information about the year in which the house was built, we can infer the age of the house at the point of sale. Generally, older houses cost less. To calculate the age of the house, we’d simply compute the difference in years between the sale date and the date at which -it was built. +the house was was built. -The Python program offers many options for making operations between datetime objects, like, +Python offers many options for making operations between datetime objects, like, for example, the datetime module. Since most likely you will be working with Pandas dataframes, -we will focus this guide on pandas and then how we can automate the procedure with Feature-engine. +we will focus this guide on pandas and then how we can automate the procedure with feature-engine. Subtracting datetime features with pandas ----------------------------------------- @@ -56,7 +56,7 @@ This is the data that we created, containing two datetime variables: 4 2019-03-09 2018-04-08 Now, we can subtract `date2` from `date1` and capture the difference in a new variable by -utilizing the pandas subtraction operator: +utilising the pandas subtraction operator: .. code:: python @@ -98,22 +98,22 @@ We see the new variable now expressing the difference in years, at the right of 4 2019-03-09 2018-04-08 0.917199 If you wanted to subtract various datetime variables, you would have to write lines of code -for every subtraction. Fortunately, we can automate this procedure with :class:`DatetimeSubstraction()`. +for every subtraction. Fortunately, we can automate this procedure with :class:`DatetimeSubtraction()`. -Datetime subtraction with Feature-engine +Datetime subtraction with feature-engine ---------------------------------------- -:class:`DatetimeSubstraction()` automatically subtracts several date and time features from +:class:`DatetimeSubtraction()` automatically subtracts several date and time features from each other. You just need to indicate the features at the right of the subtraction operation -in the `variables` parameters and those on the left in the `reference parameter`. You can also +in the `variables` parameters and those on the left in the `reference` parameter. You can also change the output unit through the `output_unit` parameter. -:class:`DatetimeSubstraction()` works with variables whose `dtype` is datetime, as well as +:class:`DatetimeSubtraction()` works with variables whose `dtype` is datetime, as well as with object-like and categorical variables, provided that they can be parsed into datetime format. This will be done under the hood by the transformer. Following up with the former example, here is how we obtain the difference in number of -days using :class:`DatetimeSubstraction()`: +days using :class:`DatetimeSubtraction()`: .. code:: python @@ -133,7 +133,7 @@ days using :class:`DatetimeSubstraction()`: print(data) -With `transform()`, :class:`DatetimeSubstraction()` returns a new dataframe containing the +With `transform()`, :class:`DatetimeSubtraction()` returns a new dataframe containing the original variables and also the new variables with the time difference: .. code:: python @@ -188,8 +188,8 @@ Subtract multiple variables simultaneously We can perform multiple subtractions at the same time. In this example, we will add new datetime variables to the toy dataframe as strings. The idea is to show that -:class:`DatetimeSubstraction()` will convert those strings to datetime under the hood to -carry out the subtraction operation. +:class:`DatetimeSubtraction()` will convert those strings to datetime under the hood to +carry out the subtraction operation: .. code:: python @@ -210,7 +210,7 @@ carry out the subtraction operation. print(data) The resulting dataframe contains the original variables plus the new variables expressing -the time difference between the date objects. +the time difference between the date objects: .. code:: python @@ -228,7 +228,7 @@ the time difference between the date objects. Working with missing values ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -By default, :class:`DatetimeSubstraction()` will raise an error if the dataframe passed +By default, :class:`DatetimeSubtraction()` will raise an error if the dataframe passed to the `fit()` or `transform()` methods contains NA in the variables to subtract. We can override this behaviour and allow computations between variables with nan by setting the parameter `missing_values` to `"ignore"`. Here is a code example: @@ -275,7 +275,7 @@ Working with different timezones ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If we have timestamps in different timezones or variables in different timezones, we can -still perform subtraction operations with :class:`DatetimeSubstraction()` by first setting +still perform subtraction operations with :class:`DatetimeSubtraction()` by first setting all timestamps to the universal central time zone. Here is a code example, were we return the time difference in microseconds: @@ -314,8 +314,8 @@ We see the resulting dataframe with the time difference in microseconds: Adding arbitrary names to the new variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Often, we want to compute just a few time differences. In this case, we may want as well -to assign the new variables specific names. In this code example, we do so: +Often, we want to compute just a few time differences. In this case, we may also want +to assign specific names to the new variables. In this code example, we do so: .. code:: python @@ -348,15 +348,17 @@ called `my_new_var`: 3 2019-03-08 2018-04-01 341.0 4 2019-03-09 2018-04-08 335.0 -We should be mindful to pass a list of variales containing as many names as new variables. -The number of variables that will be created is obtained by multiplying the number of variables -in the parameter `variables` by the number of variables in the parameter `reference`. +.. note:: + + Be mindful to pass a list containing as many names as new variables. + The number of variables that will be created is obtained by multiplying the number of variables + in the parameter `variables` by the number of variables in the parameter `reference`. get_feature_names_out() ~~~~~~~~~~~~~~~~~~~~~~~ -Finally, we can extract the names of the transformed dataframe for compatibility with the -Scikit-learn pipeline: +Finally, we can extract the names of the variables in the transformed dataframe for compatibility with +scikit-learn: .. code:: python @@ -454,59 +456,12 @@ difference: Additional resources -------------------- -For tutorials on how to create and use features from datetime columns, check the following courses: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -.. figure:: ../../images/fetsf.png - :width: 300 - :figclass: align-center - :align: right - :target: https://www.trainindata.com/p/feature-engineering-for-forecasting - - Feature Engineering for Time Series Forecasting - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/datetime/index.rst b/docs/user_guide/datetime/index.rst index 7aa90215e..fea5f34c6 100644 --- a/docs/user_guide/datetime/index.rst +++ b/docs/user_guide/datetime/index.rst @@ -1,10 +1,21 @@ -.. -*- mode: rst -*- +.. _datetime_module: Datetime Features ================= -Feature-engine’s datetime transformers are able to extract a wide variety of datetime -features from existing datetime or object-like data. +Feature-engine’s datetime transformers extract a wide variety of date and time features +from datetime variables. Datetime variables can be cast as datetime or object. + +Summary of Feature-engine’s creation transformers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +================================ =============================================================================== + Transformer Description +================================ =============================================================================== +:class:`DatetimeFeatures()` Extracts features like day, month, year, hour, minute, second, and more. +:class:`DatetimeOrdinal()` Recodes variable as time elapsed since a certain date. +:class:`DatetimeSubtraction()` Calculates time difference between 2 datetime variables. +================================ =============================================================================== .. toctree:: :maxdepth: 1 diff --git a/docs/user_guide/text/TextFeatures.rst b/docs/user_guide/text/TextFeatures.rst index 84d1b4e22..04ae5511a 100644 --- a/docs/user_guide/text/TextFeatures.rst +++ b/docs/user_guide/text/TextFeatures.rst @@ -2,8 +2,8 @@ .. currentmodule:: feature_engine.text -Extracting Features from Text -============================= +TextFeatures +============ Short pieces of text are often found among the variables in our datasets. For example, in insurance, a text variable can describe the circumstances of an accident. Customer @@ -31,7 +31,7 @@ contain text data via the `variables` parameter. Unlike scikit-learn's CountVectorizer or TfidfVectorizer which create sparse matrices, :class:`TextFeatures()` extracts metadata features that remain in DataFrame format -and can be easily combined with other Feature-engine or sklearn transformers in a pipeline. +and can be easily combined with other feature-engine or sklearn transformers in a pipeline. Text Features ------------- @@ -119,8 +119,8 @@ count: 1 NaN 0 2 World 5 -Python demo ------------ +Python implementation +--------------------- In this section, we'll show how to use :class:`TextFeatures()`. Let's create a dataframe with text data: @@ -279,8 +279,8 @@ extracted features remain: 2 Average 9 27 3 Awful 4 20 -Combining with scikit-learn Bag-of-Words -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Combining with sklearn's Bag-of-Words +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In most NLP tasks, it is common to use bag-of-words (e.g., `CountVectorizer`) or TF-IDF (e.g., `TfidfVectorizer`) to represent the text. :class:`TextFeatures()` can be used @@ -363,3 +363,16 @@ By adding statistical metadata through :class:`TextFeatures()`, we provided the with information about text length, complexity, and style that is not explicitly captured by a word-count-based approach like TF-IDF, leading to a small but noticeable improvement in performance. + +Additional resources +-------------------- + +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. diff --git a/docs/user_guide/text/index.rst b/docs/user_guide/text/index.rst index 0a7ce55bb..f37d34835 100644 --- a/docs/user_guide/text/index.rst +++ b/docs/user_guide/text/index.rst @@ -1,4 +1,4 @@ -.. -*- mode: rst -*- +.. _text_module: Text Feature Extraction =======================