How To Select The Right Variables From A Large Dataset?
What Is Feature Selection?
Imagine that we have a dataset of 1000 columns and millions of rows, how do we decide which columns are more important than the others when we develop a prediction model? The answer is feature selection. By applying some simple and complicated methods, we find the necessary input variables and throw away the noise from the data.
Why do we select features?
You may think that more data results in a better machine learning model. It is true for the number of rows(instances) but not true for the number of columns(features). If we have redundant features in the dataset, we don’t get what we expect from the model.
Before starting to discover how to do feature selection, we should understand the motivation for feature selection.
1. Reduce the number of dimensions
High dimensional inputs can be problematic because they are difficult to sample from, they can introduce lots of challenges.
- reduces storage space
- reduces the computational cost and enables the machine learning algorithm to train faster
2. Show the key relationships in the data
By discovering the relationships in the data, we decrease the model complexity and increase the interpretability of our model. This improves the power of data visualizations.
3. Keep only the task-relevant information
When we select the most important features, we remove unnecessary features (noise) and still capture the key relationships. Theoretically, we have the same information, but we change the representation of it. We improve the accuracy of our model and prevent overfitting.
How do we select features?
The first and the most important technique to select the right features is your intuition combined with the domain knowledge. While working with data, the most crucial and neglected step is to understand the data, to ask the people who know what each column is about before starting technical processes. We think that we are solving a Math problem, actually, we are trying to find answers to real-world problems.
- Filter methods are faster when compared to wrapper methods and they save you from the curse of dimensionality.
- Selected features can be used in any machine learning algorithm
- Each feature is considered independently, feature dependencies are ignored.
- It is unclear how to determine the threshold for selecting the necessary features and exclude the noise.
- Give an accurate indicator of performance
- High computational cost
- Designed for only the model used for feature selection. If we change the model, we need to change the selected subset of features.
By applying feature selection, we choose the right subset of features and change a high dimensional dataset into something more manageable.
It can be difficult to decide which method is more proper for a specific task. We can apply both methods iteratively. We can use both filter and wrapper methods on the same dataset. Hence feature and model selection are both iterative processes, we can go back and forth and select the relevant features by using our domain knowledge about the data.
Any suggestions and comments will be very appreciated!