Ensuring Fully Time-Aware Cross-Validation with FLAML & DoubleML to Prevent Data Leakage in Time Series #329

paul-jdfagan · 2025-05-30T08:50:15Z

paul-jdfagan
May 30, 2025

Hello,
Huge fan of the package. I'm trying to apply it and FLAML to time series data but Im getting tripped up ensuring I've a fully time-aware CV fit and avoid data leakage etc.

Below is the base code i'm using. I use "split_type="time" for the FlamlRegressorDoubleML, but the final DoubleML Cross-Fitting will not be time-aware.

dml_plr = dml.DoubleMLPLR(
    dml_data,
    ml_l=ml_l,
    ml_m=ml_m,
    **n_folds=5,
    n_rep=5,**
    score="partialling out",
    **draw_sample_splitting=True**
)

My ask: I would appreciate any advice for feeding time‐stamped data into a standard K-Fold splitter and avoid the data leakage issues. Maybe im deep down a rabbit hole of overthinking it

I believe I'd need to change the automatic splitting 'draw_sample_splitting=False', and do my own TimeSeriesSplit , but previous attempts would not run properly - run into partitions errors. An alternative is I don't respect the final CV time series ordering and just accept that my final model results are not time sensitive. Maybe its good enough that my two nuisance models are. All very confusing...

I'd greatly appreciate any support. Thank you.

Paul


import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from flaml import AutoML
import doubleml as dml
import logging
import time

# --- Configure Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# 1) Pure temporal transformer
class PureTemporalFeatures(BaseEstimator, TransformerMixin):
    """
    Add only date-based, leak-safe features.
    """
    def __init__(self, date_col="Date"):
        self.date_col = date_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        t0 = time.time()
        logger.info("Starting PureTemporalFeatures.transform")
        df = X.copy()
        df[self.date_col] = pd.to_datetime(df[self.date_col])
        df = df.sort_values(self.date_col).reset_index(drop=True)

        # Basic cyclical encodings
        df["month_sin"] = np.sin(2*np.pi * df[self.date_col].dt.month/12)
        df["month_cos"] = np.cos(2*np.pi * df[self.date_col].dt.month/12)
        df["dow_sin"]   = np.sin(2*np.pi * df[self.date_col].dt.weekday/7)
        df["dow_cos"]   = np.cos(2*np.pi * df[self.date_col].dt.weekday/7)
        df["doy_sin"]   = np.sin(2*np.pi * df[self.date_col].dt.dayofyear/365)
        df["doy_cos"]   = np.cos(2*np.pi * df[self.date_col].dt.dayofyear/365)

        # calendar flags
        df["is_weekend"]       = (df[self.date_col].dt.weekday >= 5).astype(int)
        df["is_month_start"]   = (df[self.date_col].dt.day <= 7).astype(int)
        df["is_month_end"]     = (df[self.date_col].dt.day >= 24).astype(int)
        df["is_quarter_start"] = df[self.date_col].dt.month.isin([1,4,7,10]).astype(int)
        df["is_quarter_end"]   = df[self.date_col].dt.month.isin([3,6,9,12]).astype(int)

        # shopping-season flags
        df["is_holiday_season"]           = (
            (df[self.date_col].dt.month==12) |
            ((df[self.date_col].dt.month==11)&(df[self.date_col].dt.day>=15))
        ).astype(int)
        df["is_black_friday_week"]        = (
            (df[self.date_col].dt.month==11)&(df[self.date_col].dt.day>=20)
        ).astype(int)
        df["is_nordstrom_anniversary_sale"] = (
            ((df[self.date_col].dt.month==7)&(df[self.date_col].dt.day>=15)) |
            ((df[self.date_col].dt.month==8)&(df[self.date_col].dt.day<=10))
        ).astype(int)
        df["is_sephora_spring_sale"]      = (
            (df[self.date_col].dt.month==4)&
            df[self.date_col].dt.day.between(1,20)
        ).astype(int)
        df["is_walmart_july_event"]       = (
            (df[self.date_col].dt.month==7)&
            df[self.date_col].dt.day.between(10,20)
        ).astype(int)
        df["is_target_july_event"]        = df["is_walmart_july_event"]
        df["is_amazon_prime_days_window"] = (
            (df[self.date_col].dt.month==7)&
            df[self.date_col].dt.day.between(9,18)
        ).astype(int)

        # simple trend
        df["time_trend"]    = np.arange(len(df))
        df["time_trend_sq"] = df["time_trend"] ** 2

        elapsed = time.time() - t0
        logger.info(f"PureTemporalFeatures.transform completed in {elapsed:.2f}s, added {len(df.columns)-1} features")
        return df.drop(columns=[self.date_col])


# 2) FLAML wrapper for DoubleML
class FlamlRegressorDoubleML:
    _estimator_type = "regressor"
    def __init__(self, time=120, estimator_list=None, metric="rmse", **kwargs):
        self.auto_ml = AutoML(**kwargs)
        self.time = time
        self.estimator_list = estimator_list or ["lgbm","xgboost","histgb","xgb_limitdepth","catboost"]
        self.metric = metric
        self.best_loss_ = None
        self.tuned_model = None

    def set_params(self, **params):
        self.auto_ml.set_params(**params)
        return self

    def get_params(self, deep=True):
        p = self.auto_ml.get_params(deep)
        p.update(time=self.time, estimator_list=self.estimator_list, metric=self.metric)
        return p

    def fit(self, X, y):
        t0 = time.time()
        logger.info(f"FLAML.fit: starting on {X.shape[0]} rows, budget={self.time}s")
        self.auto_ml.fit(
            X, y,
            task="regression",
            time_budget=self.time,
            estimator_list=self.estimator_list,
            metric=self.metric,
            split_type="time",
            verbose=False
        )
        self.tuned_model = self.auto_ml.model.estimator
        try:
            self.best_loss_ = self.auto_ml.best_loss
        except:
            self.best_loss_ = getattr(self.auto_ml._state, "best_loss", None)
        elapsed = time.time() - t0
        logger.info(f"FLAML.fit completed in {elapsed:.2f}s, best_loss_={self.best_loss_}")
        return self

    def predict(self, X):
        return self.tuned_model.predict(X)


# 3) Load & preprocess 
t0_total = time.time()
logger.info("Starting data load & preprocessing")
df = dataset
df["Amount spent (USD)"].fillna(method='ffill', inplace=True)

def geometric_adstock(s, alpha=0.9, lags=7):
    w = alpha ** np.arange(lags)
    return s.rolling(lags, min_periods=1) \
            .apply(lambda x: np.dot(x[::-1], w[:len(x)]), raw=True)

df["adstock_spend"] = geometric_adstock(df["Amount spent (USD)"])
df.dropna(subset=["gmv","adstock_spend"], inplace=True)
df.reset_index(drop=True, inplace=True)
logger.info(f"Data loaded & preprocessed in {time.time() - t0_total:.2f}s — {len(df)} obs")

#  4) Feature engineering & DMLData 
t0 = time.time()
logger.info("Applying PureTemporalFeatures")
df_temp = PureTemporalFeatures(date_col="Date").fit_transform(df)

business = ["10_yr_ir","dj_close_open_ratio","App DAUs (iOS+Android)","posts"]
temp_feats = [
    "month_sin","month_cos","dow_sin","dow_cos","doy_sin","doy_cos",
    "is_weekend","is_month_start","is_month_end",
    "is_quarter_start","is_quarter_end","is_holiday_season","is_black_friday_week",
    "is_nordstrom_anniversary_sale","is_sephora_spring_sale",
    "is_walmart_july_event","is_target_july_event","is_amazon_prime_days_window",
    "time_trend","time_trend_sq"
]
all_x = business + temp_feats

df_model = pd.concat([
    df_temp[temp_feats],
    df[business],
    df[["gmv","adstock_spend"]]
], axis=1).loc[:, lambda d: ~d.columns.duplicated()]

dml_data = dml.DoubleMLData(df_model, y_col="gmv", d_cols="adstock_spend", x_cols=all_x)
logger.info(f"Feature engineering & DMLData creation took {time.time() - t0:.2f}s — {dml_data.n_obs} obs × {len(all_x)} covs")

# 5) Nuisance learners 
t0 = time.time()
logger.info("Setting up nuisance learners")
ml_l = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler(),
    FlamlRegressorDoubleML(time=120, metric="rmse")
)
ml_m = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler(),
    FlamlRegressorDoubleML(time=120, metric="rmse")
)
logger.info(f"Nuisance learners ready in {time.time() - t0:.2f}s")

# 6) DoubleMLPLR fit 
t0 = time.time()
logger.info("Fitting DoubleMLPLR (n_folds=5, n_rep=1)")
dml_plr = dml.DoubleMLPLR(
    dml_data,
    ml_l=ml_l,
    ml_m=ml_m,
    n_folds=5,
    n_rep=10,
    score="partialling out"
)
res = dml_plr.fit(n_jobs_cv=-1, store_models=True)
logger.info(f"DoubleMLPLR fit completed in {time.time() - t0:.2f}s")

#  7) Results & diagnostics 
logger.info("Causal effect estimate:")
logger.info(f"\n{res.summary}")

fl_l = dml_plr.models['ml_l'][0].named_steps['flamlregressordoubleml'].best_loss_
fl_m = dml_plr.models['ml_m'][0].named_steps['flamlregressordoubleml'].best_loss_
logger.info(f"FLAML tuning RMSEs → ml_l: {fl_l:.3f}, ml_m: {fl_m:.3f}")

evals = dml_plr.evaluate_learners()
for learner, scores in evals.items():
    rmse = np.mean([s[0] for s in scores])
    logger.info(f"OOS RMSE ({learner}): {rmse:.2f}")

logger.info(f"Total script time: {time.time() - t0_total:.2f}s")

SvenKlaassen · 2025-06-01T14:16:21Z

SvenKlaassen
Jun 1, 2025
Maintainer

Hi @paul-jdfagan,

generally I would agree that a time-aware CV procedure should be used if you would like to apply DoubleML to time series data.
And i can see that external partitioning might not be very simple to use in this setting.

A possible alternative would be the use of external predictions. This way you would be able to fully control the CV-procedure.

I will try to take a more detailed look at your issue in the upcoming week.
You could also provide me with your current version of the time-relevant sample splitting (and I can try to see what might be the reason for the partitions errors).

0 replies

paul-jdfagan · 2025-06-02T12:58:03Z

paul-jdfagan
Jun 2, 2025
Author

Thanks @SvenKlaassen for looking into the case. Much appreciated.

Below is an updated flow trying to build around the external_predictions argument. Apologies if its overcomplicated.

I needed to add a fixed-schema guard as features were being dropped and then not found in testing.
The latest error is seems to be NaNs despite there being none in the dataset. Somehow the model returned NaN predictions.

Thanks again!

2025-06-02 14:39:06,220 - INFO - Loading data …
2025-06-02 14:39:06,239 - INFO - Data ready: 418 obs
2025-06-02 14:39:06,252 - INFO - Feature engineering & DMLData creation took 0.01s — 418 obs × 24 covs
2025-06-02 14:39:06,253 - INFO - Generating external predictions with TimeSeriesSplit
2025-06-02 14:39:06,256 - INFO - Fold 1: train 0–105 | test 106–209
2025-06-02 14:41:06,271 - INFO - Fold 2: train 0–209 | test 210–313
2025-06-02 14:43:06,354 - INFO - Fold 3: train 0–313 | test 314–417
2025-06-02 14:45:06,427 - INFO - CV finished in 360.17s

AssertionError Traceback (most recent call last)
Cell In[158], line 258
255 preds_m_ext[tei, 0] = robust_m.predict(dml_data.x[tei, :])
257 logger.info("CV finished in %.2fs", time.time() - t0_manual_cv)
--> 258 assert not np.isnan(preds_l_ext).any()
259 assert not np.isnan(preds_m_ext).any()
261 external_predictions_dict = {
262 "adstock_spend": {"ml_l": preds_l_ext, "ml_m": preds_m_ext}
263 }

AssertionError:


import numpy as np
import pandas as pd
import logging
import time
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
from flaml import AutoML
import doubleml as dml

# Global timers & constants 
T0_TOTAL = time.time()
N_FOLDS_TS = 3
N_REP_MANUAL = 1

# Configure Logging 
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

# 1) Pure temporal transformer (leak‑safe)
class PureTemporalFeatures(BaseEstimator, TransformerMixin):
    """Generate leak‑safe calendar features."""

    def __init__(self, date_col: str = "Date"):
        self.date_col = date_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        t0 = time.time()
        df = X.copy()
        if self.date_col not in df.columns:
            raise KeyError(f"Date column '{self.date_col}' not found.")

        df[self.date_col] = pd.to_datetime(df[self.date_col])
        df = df.sort_values(self.date_col).reset_index(drop=True)

        # Cyclical encodings
        df["month_sin"] = np.sin(2 * np.pi * df[self.date_col].dt.month / 12)
        df["month_cos"] = np.cos(2 * np.pi * df[self.date_col].dt.month / 12)
        df["dow_sin"]   = np.sin(2 * np.pi * df[self.date_col].dt.weekday / 7)
        df["dow_cos"]   = np.cos(2 * np.pi * df[self.date_col].dt.weekday / 7)
        df["doy_sin"]   = np.sin(2 * np.pi * df[self.date_col].dt.dayofyear / 365.25)
        df["doy_cos"]   = np.cos(2 * np.pi * df[self.date_col].dt.dayofyear / 365.25)

        # Flags
        df["is_weekend"] = (df[self.date_col].dt.weekday >= 5).astype(int)
        df["is_month_start"] = (df[self.date_col].dt.day <= 7).astype(int)
        df["is_month_end"]   = (df[self.date_col].dt.day >= 24).astype(int)
        df["is_quarter_start"] = df[self.date_col].dt.is_quarter_start.astype(int)
        df["is_quarter_end"]   = df[self.date_col].dt.is_quarter_end.astype(int)

        df["is_holiday_season"] = ((df[self.date_col].dt.month == 12) |
                                     ((df[self.date_col].dt.month == 11) &
                                      (df[self.date_col].dt.day >= 15))).astype(int)
        df["is_black_friday_week"] = ((df[self.date_col].dt.month == 11) &
                                        df[self.date_col].dt.day.between(20, 30)).astype(int)
        df["is_nordstrom_anniversary_sale"] = (((df[self.date_col].dt.month == 7) &
                                                  (df[self.date_col].dt.day >= 15)) |
                                                 ((df[self.date_col].dt.month == 8) &
                                                  (df[self.date_col].dt.day <= 10))).astype(int)
        df["is_sephora_spring_sale"] = ((df[self.date_col].dt.month == 4) &
                                          df[self.date_col].dt.day.between(1, 20)).astype(int)
        df["is_walmart_july_event"]  = ((df[self.date_col].dt.month == 7) &
                                          df[self.date_col].dt.day.between(10, 20)).astype(int)
        df["is_target_july_event"]   = df["is_walmart_july_event"]
        df["is_amazon_prime_days_window"] = ((df[self.date_col].dt.month == 7) &
                                               df[self.date_col].dt.day.between(9, 18)).astype(int)

        # Time trend
        df["time_trend"]    = np.arange(len(df))
        df["time_trend_sq"] = df["time_trend"] ** 2

        return df.drop(columns=[self.date_col])

# 2) Fixed‑schema guard 
class FixedSchema(BaseEstimator, TransformerMixin):
    """Freeze feature schema at fit() and reproduce at transform()."""

    def __init__(self, fill_value: float = 0.0, as_ndarray: bool = True):
        self.fill_value = fill_value
        self.as_ndarray = as_ndarray

    def fit(self, X, y=None):
        if isinstance(X, pd.DataFrame):
            self.columns_ = list(X.columns)
            self.n_features_in_ = len(self.columns_)
        else:
            self.columns_ = None
            self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            Xf = X.copy()
            for col in set(self.columns_) - set(Xf.columns):
                Xf[col] = self.fill_value
            Xf = Xf[self.columns_]
        else:
            X_arr = np.asarray(X)
            if X_arr.shape[1] < self.n_features_in_:
                pad = np.full((X_arr.shape[0], self.n_features_in_ - X_arr.shape[1]), self.fill_value)
                Xf = np.hstack([X_arr, pad])
            else:
                Xf = X_arr[:, :self.n_features_in_]
        return Xf if self.as_ndarray else pd.DataFrame(Xf, columns=self.columns_)

#3) FLAML regressor wrapper 
class FlamlRegressorDoubleML(BaseEstimator):
    _estimator_type = "regressor"

    def __init__(self, time_budget: int = 60, metric: str = "rmse", estimator_list=None):
        self.auto_ml = AutoML()
        self.time_budget = time_budget
        self.metric = metric
        self.estimator_list = estimator_list or [
            "lgbm", "xgboost", "histgb", "catboost", "histgb", "rf"
        ]

    def fit(self, X, y):
        X_df = pd.DataFrame(X)
        self.auto_ml.fit(
            X_train=X_df,
            y_train=y,
            task="regression",
            time_budget=self.time_budget,
            estimator_list=self.estimator_list,
            metric=self.metric,
            split_type="time",
            verbose=0,
        )
        self.model_ = self.auto_ml.model.estimator

        # robust feature tracking
        names = None
        if hasattr(self.model_, "get_booster"):
            booster = self.model_.get_booster()
            if booster is not None:
                names = booster.feature_names
        if names is None and hasattr(self.model_, "feature_names_in_"):
            names = list(self.model_.feature_names_in_)
        # fallback: keep incoming order
        self.training_features_ = names if names is not None else list(X_df.columns)
        self.n_expected_ = getattr(self.model_, "n_features_in_", len(self.training_features_))
        return self


    def predict(self, X):
        X_df = pd.DataFrame(X)
        # ensure all expected cols
        for col in self.training_features_:
            if col not in X_df.columns:
                X_df[col] = 0.0
        X_df = X_df[self.training_features_]
        # align width to what the estimator finally used (after dropping constants)
        if X_df.shape[1] > self.n_expected_:
            X_df = X_df.iloc[:, : self.n_expected_]
        elif X_df.shape[1] < self.n_expected_:
            pad = np.zeros((len(X_df), self.n_expected_ - X_df.shape[1]))
            X_df = pd.concat([X_df, pd.DataFrame(pad, index=X_df.index)], axis=1)
        return self.model_.predict(X_df)

# 4) Robust AutoML builder 

def make_robust_automl(time_budget=120, metric="rmse"):
    """FixedSchema ➜ Imputer ➜ Scaler ➜ FLAML"""
    return make_pipeline(
        FixedSchema(fill_value=0.0, as_ndarray=False),
        SimpleImputer(strategy="median"),
        StandardScaler(),
        FlamlRegressorDoubleML(time_budget=time_budget, metric=metric),
    )

# 5) Data load & preprocessing
logger.info("Loading data …")
df = df, parse_dates=["Date"])
df["Amount spent (USD)"].ffill(inplace=True)

# Past‑only ad‑stock → no leakage
alpha, lags = 0.9, 7
w = alpha ** np.arange(lags)
df["adstock_spend"] = (
    df["Amount spent (USD)"]
    .rolling(window=lags, min_periods=1)
    .apply(lambda x: np.dot(x, w[len(w) - len(x):]), raw=True)
)

df = df.sort_values("Date").dropna(subset=["gmv", "adstock_spend"]).reset_index(drop=True)
logger.info("Data ready: %d obs", len(df))

# 3) Feature engineering & DoubleML data


T0_FE = time.time()
featurizer = PureTemporalFeatures("Date")
df_feat = featurizer.fit_transform(df)

business_cols = ["10_yr_ir", "dj_close_open_ratio", "App DAUs (iOS+Android)", "posts"]

temp_feats = [
    "month_sin","month_cos","dow_sin","dow_cos","doy_sin","doy_cos","is_weekend","is_month_start",
    "is_month_end","is_quarter_start","is_quarter_end","is_holiday_season","is_black_friday_week",
    "is_nordstrom_anniversary_sale","is_sephora_spring_sale","is_walmart_july_event","is_target_july_event",
    "is_amazon_prime_days_window","time_trend","time_trend_sq",
]

all_x_cols = business_cols + temp_feats

# build modelling frame -------------------------------------------------
df_model = df_feat.copy()
df_model["gmv"]            = df["gmv"]
df_model["adstock_spend"]  = df["adstock_spend"]

df_model.dropna(subset=["gmv","adstock_spend"], inplace=True)
df_model.reset_index(drop=True, inplace=True)

dml_data = dml.DoubleMLData(
    df_model,
    y_col="gmv",
    d_cols=["adstock_spend"],
    x_cols=all_x_cols,
)
logger.info("Feature engineering & DMLData creation took %.2fs — %d obs × %d covs",
            time.time()-T0_FE, dml_data.n_obs, len(dml_data.x_cols))


# 4) Manual, time-aware CV to obtain external predictions


logger.info("Generating external predictions with TimeSeriesSplit")
t0_manual_cv = time.time()

tscv = TimeSeriesSplit(n_splits=N_FOLDS_TS)
robust_l = make_robust_automl(time_budget=60, metric="rmse")
robust_m = make_robust_automl(time_budget=60, metric="rmse")

preds_l_ext = np.full((dml_data.n_obs, 1), np.nan)
preds_m_ext = np.full((dml_data.n_obs, 1), np.nan)

for k, (tri, tei) in enumerate(tscv.split(dml_data.x), 1):
    logger.info("Fold %d: train %d–%d | test %d–%d", k, tri[0], tri[-1], tei[0], tei[-1])

    # outcome model ------------------------------------------------------
    robust_l.fit(dml_data.x[tri, :], dml_data.y[tri])
    preds_l_ext[tei, 0] = robust_l.predict(dml_data.x[tei, :])

    # treatment model ----------------------------------------------------
    robust_m.fit(dml_data.x[tri, :], dml_data.d[tri])
    preds_m_ext[tei, 0] = robust_m.predict(dml_data.x[tei, :])

logger.info("CV finished in %.2fs", time.time() - t0_manual_cv)
assert not np.isnan(preds_l_ext).any()
assert not np.isnan(preds_m_ext).any()

external_predictions_dict = {
    "adstock_spend": {"ml_l": preds_l_ext, "ml_m": preds_m_ext}
}


# 5) DoubleMLPLR with external predictions


t0_dml = time.time()
logger.info("Fitting DoubleMLPLR with external predictions")
dml_plr = dml.DoubleMLPLR(
    dml_data,
    ml_l=None,               # not used because we pass external preds
    ml_m=None,
    n_folds=N_FOLDS_TS,
    n_rep=N_REP_MANUAL,
    score="partialling out",
    draw_sample_splitting=False,
)

res = dml_plr.fit(external_predictions=external_predictions_dict, store_models=False)
logger.info("DoubleMLPLR completed in %.2fs", time.time() - t0_dml)

print("\nCausal estimate (theta):", res.coef)
print("Std. error:", res.se)

logger.info("Total script time: %.2fs", time.time() - T0_TOTAL)

#7) Results & Diagnostics (with External Predictions) 
logger.info("Causal effect estimate (from external predictions):")
logger.info(f"\n{res_ext.summary}")

print("\n--- External Prediction Run (Time-Series CV) ---")
try:
    print("Fit with external predictions completed successfully.")
    print("Coefficients (theta):", res_ext.coef)
    print("Standard errors:", res_ext.se)
    print(dml_plr)

    print("\nEvaluating learners (based on manual OOS predictions from TimeSeriesSplit):")
    if oos_rmse_l_scores:
        mean_oos_rmse_l = np.mean(oos_rmse_l_scores)
        std_oos_rmse_l = np.std(oos_rmse_l_scores)
        print(f"Mean OOS RMSE (ml_l) across TS folds: {mean_oos_rmse_l:.4f}")
        print(f"Std OOS RMSE (ml_l) across TS folds: {std_oos_rmse_l:.4f}")
        # print(f"  All ml_l RMSEs: {[f'{s:.4f}' for s in oos_rmse_l_scores]}")
    if oos_rmse_m_scores:
        mean_oos_rmse_m = np.mean(oos_rmse_m_scores)
        std_oos_rmse_m = np.std(oos_rmse_m_scores)
        print(f"Mean OOS RMSE (ml_m) across TS folds: {mean_oos_rmse_m:.4f}")
        print(f"Std OOS RMSE (ml_m) across TS folds: {std_oos_rmse_m:.4f}")
        # print(f"  All ml_m RMSEs: {[f'{s:.4f}' for s in oos_rmse_m_scores]}")

except Exception as e:
    print(f" Post-fit error (external predictions): {e}")

logger.info(f"Total script time: {time.time() - t0_total:.2f}s")

1 reply

SvenKlaassen Jun 3, 2025
Maintainer

Could you adapt your example such that you use a simple generated dataset (so I can test it)?

paul-jdfagan · 2025-06-03T18:09:35Z

paul-jdfagan
Jun 3, 2025
Author

Hey Sven,

The only fully time-aware solution I can make work is to decouple the approach: nuisance models (ml_l and ml_m), internally tuned by FLAML using its split_type="time" cross-validation, trained once on an initial historical data segment to generate a single set of leak-safe "external predictions" for the entire dataset. These predictions are passed to DoubleML, where the n_rep parameter was adjusted to 1 to match the single prediction set, DoubleML performs its cross-fitting on the residuals derived from these pre-computed, time-aware predictions, hopefully! thus ensuring both temporal integrity and valid causal estimation.

===== Results =====
θ = 2.14
SE = 0.20
True θ = 2.50


"""
Pre-quential DoubleML demo (leak-safe, no cross-fitting)
Truth θ = 2.50
"""

# ------------------------------------------------------------------
# Imports & synthetic data 
# ------------------------------------------------------------------
import numpy as np, pandas as pd, random, logging
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyRegressor
from flaml import AutoML
import doubleml as dml

np.random.seed(42)
random.seed(42)
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)

# 1. Synthetic generator 
N_DAYS = 1_000
dates  = pd.date_range("2021-01-01", periods=N_DAYS, freq="D")
alpha, LAGS, true_theta = 0.9, 7, 2.50

df = pd.DataFrame({"Date": dates})
df["10_yr_ir"]            = np.random.normal(2, 0.2,  N_DAYS)
df["dj_close_open_ratio"] = np.random.normal(1, 0.02, N_DAYS)
df["App DAUs (iOS+Android)"] = np.random.normal(50_000, 5_000, N_DAYS)
df["posts"]               = np.random.poisson(80, N_DAYS)
df["Amount spent (USD)"]  = np.random.gamma(5, 200,  N_DAYS)

w = alpha ** np.arange(LAGS)
df["adstock_spend"] = (
    df["Amount spent (USD)"]
    .rolling(LAGS, 1)
    .apply(lambda x: np.dot(x, w[len(w)-len(x):]), raw=True)
)

baseline    = 100_000
seasonality = 10_000*np.sin(2*np.pi*df["Date"].dt.dayofyear/365.25)
noise       = np.random.normal(0, 5_000, N_DAYS)
df["gmv"]   = baseline + true_theta*df["adstock_spend"] + seasonality + noise
logger.info("Synthetic data ready: %d rows", len(df))

# 2. Feature builder 
class PureTemporalFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, date_col="Date"): self.date_col = date_col
    def fit(self, X, y=None): return self
    def transform(self, X):
        X = X.copy()
        X[self.date_col] = pd.to_datetime(X[self.date_col])
        X.sort_values(self.date_col, inplace=True)
        X["month_sin"] = np.sin(2*np.pi*X[self.date_col].dt.month/12)
        X["month_cos"] = np.cos(2*np.pi*X[self.date_col].dt.month/12)
        X["dow_sin"]   = np.sin(2*np.pi*X[self.date_col].dt.weekday/7)
        X["dow_cos"]   = np.cos(2*np.pi*X[self.date_col].dt.weekday/7)
        X["doy_sin"]   = np.sin(2*np.pi*X[self.date_col].dt.dayofyear/365.25)
        X["doy_cos"]   = np.cos(2*np.pi*X[self.date_col].dt.dayofyear/365.25)
        X["time_trend"]    = np.arange(len(X))
        X["time_trend_sq"] = X["time_trend"]**2
        return X.drop(columns=[self.date_col])

# 3. FixedSchema & FLAML wrapper 
class FixedSchema(BaseEstimator, TransformerMixin):
    def __init__(self, fill_value=0.0): self.fill_value = fill_value
    def fit(self, X, y=None):
        self.columns_ = list(X.columns)
        return self
    def transform(self, X):
        Xf = X.copy()
        for c in set(self.columns_) - set(Xf.columns):
            Xf[c] = self.fill_value
        return Xf[self.columns_]

class FlamlRegressor(BaseEstimator):
    _estimator_type = "regressor"
    def __init__(self, time_budget=20):
        self.time_budget = time_budget
        self.auto_ml = AutoML()
    def fit(self, X, y):
        # Ensure column names are strings for FLAML
        X_df = pd.DataFrame(X)
        X_df.columns = [str(col) for col in X_df.columns]
        self.auto_ml.fit(
            X_train=X_df, y_train=y, task="regression",
            time_budget=self.time_budget, split_type="time", # Split_type 'time' is for FLAML's internal CV
            metric="rmse", verbose=0
        )
        self.model_ = self.auto_ml.model.estimator
        # Store stringified column names for prediction
        self.training_features_ = list(X_df.columns)
        self.n_expected_ = getattr(self.model_, "n_features_in_", len(self.training_features_))
        return self
    def predict(self, X):
        X_df = pd.DataFrame(X)
        X_df.columns = [str(col) for col in X_df.columns] # Ensure prediction columns are also strings

        # Reindex to match training features, adding missing columns as 0.0
        X_df = X_df.reindex(columns=self.training_features_).fillna(0.0)

        # Ensure numeric width matches what the estimator finally expects
        current_n_features = X_df.shape[1]
        if current_n_features > self.n_expected_:
            X_df = X_df.iloc[:, :self.n_expected_]
        elif current_n_features < self.n_expected_:
            pad_width = self.n_expected_ - current_n_features
            pad_df = pd.DataFrame(np.zeros((len(X_df), pad_width)), index=X_df.index,
                                    columns=[f"pad_{i}" for i in range(pad_width)])
            X_df = pd.concat([X_df, pad_df], axis=1)

        return self.model_.predict(X_df.to_numpy())

def make_automl():
    return make_pipeline(
        FixedSchema(fill_value=0.0), # FixedSchema ensures column consistency
        SimpleImputer(strategy="median"),
        StandardScaler(),
        FlamlRegressor(time_budget=20)
    )

# 4. Model frame & DoubleML container 
business = ["10_yr_ir","dj_close_open_ratio","App DAUs (iOS+Android)","posts"]
temps    = ["month_sin","month_cos","dow_sin","dow_cos","doy_sin","doy_cos",
            "time_trend","time_trend_sq"]
x_cols   = business + temps

feat  = PureTemporalFeatures("Date")
df_x  = feat.fit_transform(df)
df_x["gmv"]           = df["gmv"]
df_x["adstock_spend"] = df["adstock_spend"]

dml_data = dml.DoubleMLData(
    df_x, y_col="gmv", d_cols=["adstock_spend"], x_cols=x_cols
)

# 5. Pre-quential split & nuisance models
# Define the split for pre-quential prediction
# This defines the "training cutoff" for external predictions TRAIN_END = 750
# The remaining part of the data will be used for testing. For a true pre-quential setup, you'd typically have multiple training/test splits, but for this simplified example, we'll use a single split for external predictions that covers the entire dataset.

# Train your learners on the initial training segment
ml_l = make_automl()
ml_m = make_automl()

# Convert to DataFrame for FixedSchema and FLAML's internal use (especially for column names)
X_train_df = pd.DataFrame(dml_data.x[:TRAIN_END, :], columns=x_cols)

ml_l.fit(X_train_df, dml_data.y[:TRAIN_END])
ml_m.fit(X_train_df, dml_data.d[:TRAIN_END])

# Generate predictions for the *entire* dataset using the models trained
# on the initial segment.
# Convert the entire dml_data.x to DataFrame for prediction consistency
X_full_df = pd.DataFrame(dml_data.x[:, :], columns=x_cols)

preds_l = ml_l.predict(X_full_df)[:, None]
preds_m = ml_m.predict(X_full_df)[:, None]

external = {"adstock_spend": {"ml_l": preds_l, "ml_m": preds_m}}

# 6. DoubleML PLR 
# For DoubleML, when using external_predictions,  use its own cross-fitting scheme (default K-fold) to aggregate the nuisance estimates.

N_FOLDS_DML = 5 # DoubleML's default number of folds for internal cross-fitting

plr = dml.DoubleMLPLR(
    dml_data,
    ml_l=DummyRegressor(),
    ml_m=DummyRegressor(),
    n_folds=N_FOLDS_DML,
    n_rep=1, # Since you're providing predictions for all data points, 1 repetition is usually fine.
    score="partialling out",
    draw_sample_splitting=True # Allow DoubleML to generate standard cross-fitting splits
                               # to aggregate the externally provided predictions.
)

res = plr.fit(external_predictions=external, store_models=False)

print("\n===== Results =====")
print(f"θ  = {res.coef[0]:.2f}")
print(f"SE = {res.se[0]:.2f}")
print(f"True θ = {true_theta:.2f}")

1 reply

SvenKlaassen Jun 4, 2025
Maintainer

@paul-jdfagan, thanks for the example.

Working with pre-defined sample splits seems not possible in a time-series setting as the package currently expects scores for all observations (which is not possible for the first training period, even if the cv-split respects the time structure).

That looks already fine to me.
The only extension to your example would be a rolling/expanding window type prediction setting. This would respect the time-structure but allow for a larger sample size of the predictions.

OliverSchacht · 2025-06-04T12:41:19Z

OliverSchacht
Jun 4, 2025
Collaborator

Hi @paul-jdfagan thanks for the example.

I think the suggested solution by @SvenKlaassen works great.

If your use case requires more sophisticated splitting, i would recommend you to work with a custimized splitter as proposed in this discussion

As you see in the code, you can hand over a custom TimeSeriesSplit()-object to flaml in order to get reproducable splits. This is what flaml does internally, too, when you call it as above. You can then use the splitter to generate out-of-sample external predictions for DoubleML. You can also fine tune more.

Your code could look something like below. An advantage would be that this is more data efficient potentially as you only throw out the first $n/k$ samples in $k$-fold CV. On the other side, it uses a different CV scheme than default DoubleML. You might want to read up the caveats in the double machine learning literature.

import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import root_mean_squared_error

from flaml import AutoML

# create dummy time series
n = 200
date_idx = pd.date_range("2021-01-01", periods=n, freq="D")
X = pd.DataFrame({
        "trend": np.arange(n),
        "sin":   np.sin(np.arange(n)/12),
        "cos":   np.cos(np.arange(n)/12)
    }, index=date_idx)
rng = np.random.RandomState(42)
y = 0.3 * X["trend"] + X["sin"] + rng.normal(scale=0.5, size=n)

# define custom splitter (equivalent to setting `split_type="time"` in flaml)
my_splitter = TimeSeriesSplit(n_splits=5) # you can do customizations here

# define AutoML
automl = AutoML()
automl_settings = dict(
    task="regression",
    metric="rmse",           
    time_budget=60,            
    split_type=my_splitter,
    verbose=0,
)
automl.fit(X_train=X, y_train=y, **automl_settings)

# Create predictions with same splits
cv_preds = np.full_like(y, fill_value=np.nan, dtype=float)

best_estimator = automl.model.estimator

for train_idx, test_idx in my_splitter.split(X):
    best_estimator.fit(X.iloc[train_idx], y.iloc[train_idx])
    cv_preds[test_idx] = best_estimator.predict(X.iloc[test_idx])

mask = ~np.isnan(cv_preds)
cv_loss = root_mean_squared_error(y[mask], cv_preds[mask])

# mask first data that is never predicted before creating DoubleMLData object
X_for_doubleml = X.iloc[mask]
y_for_doubleml = y[mask]

Best, Oliver

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensuring Fully Time-Aware Cross-Validation with FLAML & DoubleML to Prevent Data Leakage in Time Series #329

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ensuring Fully Time-Aware Cross-Validation with FLAML & DoubleML to Prevent Data Leakage in Time Series #329

Uh oh!

Uh oh!

paul-jdfagan May 30, 2025

Replies: 4 comments · 2 replies

Uh oh!

SvenKlaassen Jun 1, 2025 Maintainer

Uh oh!

paul-jdfagan Jun 2, 2025 Author

Uh oh!

SvenKlaassen Jun 3, 2025 Maintainer

Uh oh!

Uh oh!

paul-jdfagan Jun 3, 2025 Author

Uh oh!

SvenKlaassen Jun 4, 2025 Maintainer

Uh oh!

OliverSchacht Jun 4, 2025 Collaborator

paul-jdfagan
May 30, 2025

Replies: 4 comments 2 replies

SvenKlaassen
Jun 1, 2025
Maintainer

paul-jdfagan
Jun 2, 2025
Author

SvenKlaassen Jun 3, 2025
Maintainer

paul-jdfagan
Jun 3, 2025
Author

SvenKlaassen Jun 4, 2025
Maintainer

OliverSchacht
Jun 4, 2025
Collaborator