FeatureEngineering-Framework-ML
Variable Types
Categorical Variables: Nominal and Ordinal
Numerical Variables: Discrete and continuous
Mixed variables: strings and numbers
Datetime variables
Variable Types
Code + Blog Link
Video Link
Variable Characteristics
Missing Data
Cardinality
Category Frequency
Distributions
Outliers
Magnitude
Variable Characteristics
Code + Blog Link
Video Link
Missing Data Imputation
For Numerical Variables
Mean and Median Imputation
Arbitrary value imputation
End of Tail Imputation
For Categorical Variables
Frequent category imputation
Adding a missing category
Random Sample Imputation
Adding a missing indicator
Imputation with Scikit-learn
Imputation with Feature-engine
Missing Data Imputation
Code + Blog Link
Video Link
Multivariate Imputation
Multivariate Imputation
Code + Blog Link
Video Link
Categorical Variable Encoding
Traditional Techniques
One hot encoding: simple and of frequent categories
Ordinal / Label encoding: arbitrary and ordered
Count / Frequency encoding
Monotonic Relationship
Target mean encoding
Weight of evidence
Ordered label encoding
Alternative Techniques
Binary encoding
Feature hashing
Probability Ratio
For Rare Labels
One hot encoding of frequent categories
Grouping of rare categories
Rare Label encoding
Encoding with Scikit-learn
Encoding with category encoders
Categorical Variable Encoding
Code + Blog Link
Video Link
Variable Transformation
Mathematical Transformations
Logarithic
Exponential / Power
Reciprocal
Box-Cox
Yeo-Johnson
Discretisation
Unsupervised
Equal-width
Equal-frequency
K means
Supervised
Other
Transformation with Scikit-learn
Variable Transformation
Code + Blog Link
Video Link
Discretisation
Arbitrary
Equal-frequency discretisation
Equal-width discretisation
K-means discretisation
Discretisation with trees
Discretisation with Scikit-learn
Discretisation with Feature-engine
Discretisation
Code + Blog Link
Video Link
Outliers
Discretisation
Capping / Censoring
Trimming / Truncation
Outliers
Code + Blog Link
Video Link
Feature Scaling
Standardisation (common one)
MinMaxScaling (common one)
MaxAbsoluteScaling
RobustScaling
Scaling to absolute maxima
Scaling to median & quantiles
Scaling to unit norm
Models Effected by magnitude of feature
Linear & Logistic Regression
SVM
KNN
K-means Clustering
LDA
PCA
Neural Networks
Models insensitive to feature magnitude
- Tree Based Models
Classification & Regression Trees
Random Forest
Gradient Boosted Trees
Feature Scaling
Code + Blog Link
Video Link
Mixed variables
Creating new variables from strings and numbers
Mixed variables
Code + Blog Link
Video Link
Datetime Variables
Extracting day, month, week, semester, year ...etc
Extracting hour, min, sec ...etc
Capturing Elapsed time
Time between transactions
Age
Working with timezones
Datetime
Code + Blog Link
Video Link
Text
Characters, Words, Unique words
Lexical diversity
Sentences, Paragraphs
Bag of Words
TFiDF
Transactions & Time Series
Aggregate data
Number of payments in last 3, 6, 12 months
Time since last transaction
Total spending in last month
Feature Combination
Ratio : total debt with income --> Debt to income ratio
Sum : Debt in different credit cards --> total debt
Subtraction : Income without expenses --> disposable income
Pipelines
Classification Pipeline
Regression Pipeline
Pipeline with cross-validation
Pipelines
Code + Blog Link
Video Link