In recent years, we have made progress in automatic model selection and hyperparameter tuning, but the most important aspect of the machine learning process-feature engineering, has been largely ignored by us. In this article, we will use the Featuretools library to understand how automated feature engineering changes and optimizes the way machine learning works.
Featuretools is an open source Python library for automated feature engineering
Automated feature engineering is a relatively new technology used to solve a series of scientific problems faced by real-world data sets. Automatic feature engineering can reduce time costs, build better predictive models, generate more meaningful features, and prevent data leakage. It is so powerful that I believe it will be a standard part of any machine learning workflow.
Next, we will further understand its power through the following two projects, both of which can reflect some of the advantages of automated feature engineering:
Loan Repayment Prediction: Compared with manual feature engineering, automated feature engineering can shorten machine learning development time by 10 times, while also providing better model performance.
Notebooks address of the project:
https://github.com/Featuretools/Automated-Manual-Comparison/tree/master/Loan%20Repayment
Retail Spending Prediction: Automated feature engineering can create more meaningful features through the processing of internal time series filters, while preventing data leakage, so as to successfully implement model deployment.
Notebooks address of the project:
https://github.com/Featuretools/Automated-Manual-Comparison/tree/master/Retail%20Spending
Manual feature engineering vs automatic feature engineering
Feature engineering refers to the process of acquiring data sets and constructing explained feature variables, and feature variables can be used to train machine learning models and used for prediction. Usually, data distributed in multiple tables need to be aggregated into one table, where rows represent observations and columns represent features.
Manual feature engineering is a traditional feature engineering method. It mainly uses domain knowledge to construct features. Only one feature can be generated at a time. This is a tedious, time-consuming and error-prone process. In addition, each time the code for manual feature engineering is for a specific problem, when we want to solve a new problem or a new data set, we need to rewrite the relevant code.
Automated feature engineering is to automatically extract useful and meaningful features from a set of related data tables. This method can change the standard workflow and is suitable for tasks related to the data set. In addition, it not only reduces the time required for feature engineering, but also creates interpretable features and prevents data leakage by filtering time-related data.
Loan repayment items
Build models faster and better
The Home Credit Loan problem is a machine learning competition project that ended on Kaggle today. Its goal is to predict whether the customer will be able to repay the loan. For data scientists, the challenge of this problem lies in the size and distribution of their data. Let's take a look at the complete data set, we can see that there are 58 million rows of data distributed in seven tables, and the machine learning method requires model training for one table. At this time, feature engineering needs to extract and merge all the information of each customer into a table.
Feature engineering needs to obtain all data information from the data table set and integrate it into one table
For this problem, I first tried to use traditional manual feature engineering to solve it, and it took 10 hours to manually create a set of features. First, I studied the work of other data scientists, explored data and research, in order to obtain the necessary domain knowledge. Then I write the required knowledge into code through programming, constructing one feature at a time. For a manual feature, I used 3 different tables to find the total number of late payments by customers in previous loans.
In the end, manual feature engineering achieved quite good performance: compared with baseline features, manual feature engineering achieved a 65% performance improvement, indicating the applicability and importance of feature engineering.
However, due to the low efficiency of this method, I cannot describe the entire process here. For manual feature engineering, each feature takes more than 15 minutes, because the method I use can only create one feature at a time.
Manual feature engineering process
In addition to cumbersome and time-consuming shortcomings, manual feature engineering has the following disadvantages:
Only for specific problems: For this project, the code I wrote for a few hours cannot be applied to any other problems
Error-prone: every line of code may produce errors
In addition, the features extracted by manual feature engineering are also limited by human creativity and patience: for a problem, we need to consider a large number of features, and the construction of each feature takes a lot of time.
Feature engineering from manual to automatic
Like the functions that Featuretools can achieve, automated feature engineering can create thousands of features from a set of related data tables. All we need to know is the basic structure of the data table and the relationship between them. We call a single data structure an entity set. Once we have an entity set, we will use the deep feature synthesis method (DFS) in the data set to build thousands of features by calling a function.
Use Featuretools for automated feature engineering
DFS uses functions called "primitives" to aggregate and transform our data. The acquisition of these primitives can be as simple as the average or maximum value of the column, or it can be acquired in a relatively complicated way based on the subject's professional knowledge, because Featuretools allows us to customize our primitives for the task.
Feature primitives (feature primitives) include many operations that need to be done manually, but through Featuretools, we can use the same exact syntax in any relational database, which means that we don’t need to rewrite the code on different data sets. Use these operations. In addition, when we stack primitives together to create deep features, the power of DFS is undoubtedly obvious.
For more information about DFS, you can refer to:
https://
Below, I will demonstrate how to build this process. Here, I only need one line of code to use DFS operations, and use 7 tables of data to create thousands of features for each customer, as shown below, where ft represents the imported featuretools library:
1#Deepfeaturesynthesis2feature_matrix,features=ft.dfs(entityset=es,3target_entity='clients',4agg_primitives=agg_primitives,5trans_primitives=trans_primitives)
The following are some of the 1820 features we automatically obtained from Featuretools, including:
The highest total amount paid by the customer for a previous loan. This is created using the MAX and SUM values ​​in the 3 tables.
The average debt ranking of the customer's credit card. This is created using the PERCENTILE and MEAN values ​​in the 2 tables.
Whether the customer submitted two documents during the application process. This is created using AND conversion and 1 table.
Each feature is constructed using simple aggregation, so it is also interpretable. Featuretools can not only create many of the same features that we can also complete manually, but also a large number of features that cannot be created manually. These features are either beyond our consideration, or they require expensive time and cost to construct. Although not every feature is related to our problem, and some features are still highly correlated, so compared to insufficient feature amount, more features may be more helpful for us to solve the problem.
After some feature selection and model optimization, the performance of these features in the prediction model is also better, and the running time of the entire model is 1 hour, which is reduced by 10 times compared with the manual process. Featuretools is a fast automatic feature engineering library for colleges and universities. It requires less domain expertise, so the number of lines of code to be written is much less than that of manual feature engineering.
It takes some time to learn Featuretools, but I think it is a worthwhile and rewarding investment. It took an hour to learn Featuretools, and you can apply it to any machine learning feature engineering problem.
The following chart is a summary of my loan repayment project:
Automated Feature Engineering vs Manual Feature Engineering: Development Time, Feature Number, and Performance Comparison
Development time: Consider every feature time required for the final feature engineering code-manual feature engineering requires 10 hours, while automated feature engineering only requires 1 hour.
Number of features generated: Manual feature engineering generated 30 features, while automated feature engineering created 1820 features.
The performance improvement achieved by using the extracted feature training model relative to the baseline: manual feature engineering performance has improved by 65%, while automated feature engineering has achieved a 66% improvement.
In addition, the Featuretools code I wrote for the first project can also be applied to any data set, while manual engineering code requires rewriting the code for a new data set.
Retail spending items
Build meaningful features and prevent data leakage
The second item is the customer's retail spending forecast. The data set used is online customer transaction records. The prediction problem is to divide the customer into two parts, those who spend more than $500 next month and those who spend no more than $500. Each customer corresponds to multiple tags, that is, the customer's tags in the previous month are used as forecasts for the next month. For example, we can use the customer’s spending in May as a label, and then use it in June, and so on.
Each client is a training sample used multiple times
Using customer tags multiple times will make it difficult to create training data: in a given month, when the corresponding features are extracted for the customer, even if we can access the data, we will not be able to get any information for the next few months from this month. In the deployment, we do not have future data, so we cannot use it to train the model. This is also the challenge we often face on real-world data sets: because the model cannot be trained on an effective data set, the performance of this model is usually very poor in real-world applications.
Fortunately, this problem can be easily solved in Featuretools. In the deep feature synthesis (DFS) function, as shown above, the cut-off time represents the point where we cannot use any data as a label, and Featuretools automatically takes time into consideration when constructing features.
Given a certain month, we can use the filtered data from the previous month to construct customer characteristics. Please note that the process of calling the feature set we created is the same as calling in the loan repayment project, except that there is an extra cutoff_time parameter.
1#Deepfeaturesynthesis2feature_matrix,features=ft.dfs(entityset=es,3target_entity='customers',4agg_primitives=agg_primitives,5trans_primitives=trans_primitives,6cutoff_time=cutoff_times)
The result of running Deep Feature Synthesis is a feature table, that is, each customer corresponds to a feature every month. We can use these features and labels to train our model, and then make predictions for any month in the future. In addition, we don’t have to worry about the features used to build the model will contain future information, and we don’t have to worry about unfairness and wrong training scores.
Using automated features, I can build a machine learning model and use it to predict the model’s monthly expenditure. The results show that our model is significantly better than the 0.69 ROC AUC performance achieved by the baseline model, and can reach 0.90 ROC AUC.
In addition to predicting performance, the Featuretools implementation can also provide a very valuable thing: interpretability. Below, let's take a look at the 15 most important features in the random forest model:
Using Featuretools, the 15 most important features obtained by the random forest model
The importance of features tells us that the most important factor that affects the customer’s spending forecast for the next month. Here, we can know that the sum of user spending last month SUM (purchases.total) and purchase quantity SUM (purchases.quantity) are the key factors that affect the next month's spending forecast. Although these features can be constructed manually, we are concerned about data leakage, so we need to create a model with better performance during the development phase.
If the tool has been able to automatically create some meaningful features without worrying about its effectiveness, then why do we have to implement it manually? In addition, for this problem, the characteristics of automation are completely clear and can also explain the real-world reasoning process to us.
Even though manual feature engineering takes much more time than Featuretools, I cannot create a set of features with close to the same performance. The following figure shows the ROC curve for forecasting future monthly customer expenditures using the model trained on two data sets, where the curve closer to the upper left corner represents better performance:
ROC curve of automatic feature engineering vs manual feature engineering
Among them, the curve closer to the upper left part indicates better performance
I'm not even sure whether manual features are created with valid data, but at least Featuretools implements it this way, so I don't need to worry about data leakage in time-related issues. Perhaps, it is impossible to manually design a set of useful features to indicate the failure of a data scientist, but if automated tools can be safely achieved by us, then why don't we use them?
in conclusion
Apart from the above projects, I believe that automated feature engineering will be an indispensable part of the machine learning workflow. Although the technology is not perfect at present, it can still significantly improve our work efficiency.
Below I summarize some key points of automated feature engineering:
Can reduce development time by 10 times
Able to build models with the same or even better performance
Provide practically meaningful interpretable functions
Prevent the model from using invalid and incorrect data features
Suitable for existing workflows and machine learning models
These tasks can be made easier through automated feature engineering. The Python-based automated feature engineering that we introduced to you before can teach you how to quickly start automatically creating machine learning features.
Ncm Battery,Lithium-Ion Battery With Ncm Material,18650-2800Mah Battery,Nicomn Lithium Battery
Henan Xintaihang Power Source Co.,Ltd , https://www.taihangbattery.com