Skip to content

Latest commit

 

History

History

feature_selection

SIG TFX-Addons

Project Proposal

Your name: Nirzari Gupta

Your email: [email protected]

Your company/organization: Outreachy

Project name: Feature selection custom component

Project Description

This project provides a facility to perform various feature selection algorithms on datasets in TFX pipelines. Additionally, feature scores for selected features will also be generated as a custom artifact.

Project Category

Component

Project Use-Case(s)

This project will allow the user to select different algorithms for performing feature selection on datasets artifacts in TFX pipelines.

Project Implementation

Feature Selection Custom Component is implemented as Python function-based component.

Implementation of the Feature Selection Custom Component is done using the following approach:

  • Get dataset artifact generated by ExampleGen
  • Convert it into the format compatible with Scikit-Learn functions (TFRecord to numpy disctionaries)
  • Perform univariate feature selection with SelectorFunc specified in the module file
  • Output the following two artifacts:
    • updated_data: Duplicate of the input Example artifact, but with updated URI and data values
    • feature_selection: Contains data about the feature selection process with the following values available:
      • scores: Metric scores from the selector
      • p_values: Calculated p-values from the selector
      • selected_features: List of selected columns afetr feature selection

Module file

Structure

The module file is required to have a structure with the following three values:

  • SELECTOR_PARAMS: Parameters for SelectorFunc
  • TARGET_FEATURE: The target feature in the dataset
  • SelectorFunc: Univariate function for feature selection

Example module file

In the below example, we have used sklearn functions directly for simplicity. You may define custom functions while ensuring that the overall i/o structure is the same.

from sklearn.feature_selection import SelectKBest as SelectorFunc
from sklearn.feature_selection import chi2

SELECTOR_PARAMS = {"score_func": chi2, "k": 2}
TARGET_FEATURE = 'species'

Example usage

You may use the feature selection component in a way similar to StatisticsGen

feature_selector = FeatureSelection(
    orig_examples = example_gen.outputs['examples'],
    module_file='example.modules.iris_module_file'
    )

Project Dependencies

The implementation will use the Scikit-learn feature selection functions

Project Team

Project Leader : Nirzari Gupta, nirzu97, [email protected]

  1. Fatimah Adwan, FatimahAdwan, [email protected]
  2. Kshitijaa Jaglan, deutranium, [email protected]
  3. Pratishtha Abrol, pratishtha-abrol, [email protected]