Ensuring Fully Time-Aware Cross-Validation with FLAML & DoubleML to Prevent Data Leakage in Time Series #329
Replies: 4 comments 2 replies
-
Hi @paul-jdfagan, generally I would agree that a time-aware CV procedure should be used if you would like to apply DoubleML to time series data. A possible alternative would be the use of external predictions. This way you would be able to fully control the CV-procedure. I will try to take a more detailed look at your issue in the upcoming week. |
Beta Was this translation helpful? Give feedback.
-
Thanks @SvenKlaassen for looking into the case. Much appreciated. Below is an updated flow trying to build around the external_predictions argument. Apologies if its overcomplicated.
Thanks again!
|
Beta Was this translation helpful? Give feedback.
-
Hey Sven, The only fully time-aware solution I can make work is to decouple the approach: nuisance models (ml_l and ml_m), internally tuned by FLAML using its split_type="time" cross-validation, trained once on an initial historical data segment to generate a single set of leak-safe "external predictions" for the entire dataset. These predictions are passed to DoubleML, where the n_rep parameter was adjusted to 1 to match the single prediction set, DoubleML performs its cross-fitting on the residuals derived from these pre-computed, time-aware predictions, hopefully! thus ensuring both temporal integrity and valid causal estimation. ===== Results =====
|
Beta Was this translation helpful? Give feedback.
-
Hi @paul-jdfagan thanks for the example. I think the suggested solution by @SvenKlaassen works great. If your use case requires more sophisticated splitting, i would recommend you to work with a custimized splitter as proposed in this discussion As you see in the code, you can hand over a custom Your code could look something like below. An advantage would be that this is more data efficient potentially as you only throw out the first
Best, Oliver |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Huge fan of the package. I'm trying to apply it and FLAML to time series data but Im getting tripped up ensuring I've a fully time-aware CV fit and avoid data leakage etc.
Below is the base code i'm using. I use
"split_type="time"
for the FlamlRegressorDoubleML, but the final DoubleML Cross-Fitting will not be time-aware.My ask: I would appreciate any advice for feeding time‐stamped data into a standard K-Fold splitter and avoid the data leakage issues. Maybe im deep down a rabbit hole of overthinking it
I believe I'd need to change the automatic splitting 'draw_sample_splitting=False', and do my own TimeSeriesSplit , but previous attempts would not run properly - run into partitions errors. An alternative is I don't respect the final CV time series ordering and just accept that my final model results are not time sensitive. Maybe its good enough that my two nuisance models are. All very confusing...
I'd greatly appreciate any support. Thank you.
Paul
Beta Was this translation helpful? Give feedback.
All reactions