Your name: Nirzari Gupta
Your email: [email protected]
Your company/organization: Outreachy
Project name: Feature selection custom component
This project provides a facility to perform various feature selection algorithms on datasets in TFX pipelines. Additionally, feature scores for selected features will also be generated as a custom artifact.
Component
This project will allow the user to select different algorithms for performing feature selection on datasets artifacts in TFX pipelines.
Feature Selection Custom Component is implemented as Python function-based component.
Implementation of the Feature Selection Custom Component is done using the following approach:
- Get dataset artifact generated by ExampleGen
- Convert it into the format compatible with Scikit-Learn functions (TFRecord to numpy disctionaries)
- Perform univariate feature selection with
SelectorFunc
specified in the module file - Output the following two artifacts:
updated_data
: Duplicate of the inputExample
artifact, but with updated URI and data valuesfeature_selection
: Contains data about the feature selection process with the following values available:scores
: Metric scores from the selectorp_values
: Calculated p-values from the selectorselected_features
: List of selected columns afetr feature selection
The module file is required to have a structure with the following three values:
SELECTOR_PARAMS
: Parameters forSelectorFunc
TARGET_FEATURE
: The target feature in the datasetSelectorFunc
: Univariate function for feature selection
In the below example, we have used sklearn functions directly for simplicity. You may define custom functions while ensuring that the overall i/o structure is the same.
from sklearn.feature_selection import SelectKBest as SelectorFunc
from sklearn.feature_selection import chi2
SELECTOR_PARAMS = {"score_func": chi2, "k": 2}
TARGET_FEATURE = 'species'
You may use the feature selection component in a way similar to StatisticsGen
feature_selector = FeatureSelection(
orig_examples = example_gen.outputs['examples'],
module_file='example.modules.iris_module_file'
)
The implementation will use the Scikit-learn feature selection functions
Project Leader : Nirzari Gupta, nirzu97, [email protected]
- Fatimah Adwan, FatimahAdwan, [email protected]
- Kshitijaa Jaglan, deutranium, [email protected]
- Pratishtha Abrol, pratishtha-abrol, [email protected]