Image for post
Image for post
Image by Pete Linforth from Pixabay

How To Select The Right Variables From A Large Dataset?

A Guide To Feature Selection

What Is Feature Selection?

Imagine that we have a dataset of 1000 columns and millions of rows, how do we decide which columns are more important than the others when we develop a prediction model? The answer is feature selection. By applying some simple and complicated methods, we find the necessary input variables and throw away the noise from the data.

Why do we select features?

You may think that more data results in a better machine learning model. It is true for the number of rows(instances) but not true for the number of columns(features). If we have redundant features in the dataset, we don’t get what we expect from the model.

Before starting to discover how to do feature selection, we should understand the motivation for feature selection.

High dimensional inputs can be problematic because they are difficult to sample from, they can introduce lots of challenges.

Dimensionality reduction

  • reduces storage space
  • reduces the computational cost and enables the machine learning algorithm to train faster

By discovering the relationships in the data, we decrease the model complexity and increase the interpretability of our model. This improves the power of data visualizations.

When we select the most important features, we remove unnecessary features (noise) and still capture the key relationships. Theoretically, we have the same information, but we change the representation of it. We improve the accuracy of our model and prevent overfitting.

How do we select features?

Domain Knowledge

The first and the most important technique to select the right features is your intuition combined with the domain knowledge. While working with data, the most crucial and neglected step is to understand the data, to ask the people who know what each column is about before starting technical processes. We think that we are solving a Math problem, actually, we are trying to find answers to real-world problems.

Technical Methods

Filter Methods

Image for post
Image for post
Filter Method, Image by Author
  • Filter methods are faster when compared to wrapper methods and they save you from the curse of dimensionality.
  • Selected features can be used in any machine learning algorithm
  • Each feature is considered independently, feature dependencies are ignored.
  • It is unclear how to determine the threshold for selecting the necessary features and exclude the noise.

Wrapper Methods

Image for post
Image for post
Wrapper Method, Image by Author
  • Give an accurate indicator of performance
  • High computational cost
  • Designed for only the model used for feature selection. If we change the model, we need to change the selected subset of features.

Conclusion

By applying feature selection, we choose the right subset of features and change a high dimensional dataset into something more manageable.

It can be difficult to decide which method is more proper for a specific task. We can apply both methods iteratively. We can use both filter and wrapper methods on the same dataset. Hence feature and model selection are both iterative processes, we can go back and forth and select the relevant features by using our domain knowledge about the data.

Contact me

If you want to be updated with my latest articles follow me on Medium. You can connect with me on LinkedIn, and email me at seymatas@gmail.com!

Any suggestions and comments will be very appreciated!

Data Scientist, Data Educator, Blogger https://www.linkedin.com/in/seyma-tas/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store