Contents
  1. 1. Background
    1. 1.0.1. The Boruta algorithm

Background

High-dimensional data is common these days (i.e. the TF-IDF features), which makes it more important to choose features that are uncorrelated and non-redundant. Good feature selection help to accelerate you model and improves te accuracy, precision or recall.

The three major feature selection methods is summarized in this datacamp post.

  1. Filter Methods; 2. Wrapper Method; 3.Embedded method (i.e. LASSO)

In the wrapper method, algorithms train the model using subset of features, then you can compare the inferences to add or remove features. Here, I would like to summarize more about “Boruta” method.

The Boruta algorithm

Boruta algorithm is based on Random Forest algorithm and feature importance estimation with random shuffle. The intuition of the clever method is to only select the features that over-shadow its random shuffled images. The origenal Boruta algorithm was firstly introduced in R. Then this author reintroduce it into the Python family and improve the multiprocessing ability, the source code can be found in this Github repo.

The team recommend to use pruned trees with a depth between 3-7.

This post is inspired by the Kaggle Elo competition, which LB socre is sensitive to the feature selected. Good ref can be found in this kaggle kernel and the data camp post.