Model Convergence Issues: Missing Data with Pygformula
Problem Statement
Several participant submissions for the Covid Causal Diagram DREAM Challenge Mid-Point Evaluation have encountered model convergence failures due to data limitations. Upon investigation, these convergence issues appear to be primarily caused by missing data patterns.
Technical Background
The pygformula package relies fits parametric models for time-varying covariates and outcomes. When substantial missing data is present, several issues can arise:
Model fitting failures: Insufficient complete cases during automated data validation prevents model fitting
Numerical instability: Missing data patterns can lead to singular matrices or convergence warnings
Biased estimates: Listwise deletion can introduce selection bias if missingness is not completely at random
Current Impact on Challenge
This issue is affecting participants' ability to complete the model validation phase successfully.
Proposed Solutions
We are seeking participant input on the preferred approach to address this issue. Please review the following options and provide feedback:
Option A: Curated Complete-Case Variable List
Approach: Provide a curated list of variables with no missing data that participants can use for data layering during structural causal model development.
Advantages:
✅ Improves model convergence and stability
✅ Eliminates need for imputation methodology decisions
✅ Faster model development and iteration
✅ Consistent data availability across all participants
Disadvantages:
❌ May exclude clinically important variables
❌ Reduces model complexity and potential insights
Option B: Multiple Imputation Framework
Approach: Implement a standardized multiple imputation procedure for the dataset, potentially using methods such as:
Multiple Imputation by Chained Equations (MICE)
Missingness pattern-aware imputation
Domain-specific imputation strategies for clinical variables
Advantages:
✅ Preserves full variable set for analysis
✅ Maintains statistical power with full sample size
Disadvantages:
❌ Adds methodological complexity
❌ May introduce imputation-related bias
❌ Longer computational time for model fitting
Implementation Considerations
For Option A (Complete-Case Variables)
Variable chosen for causal identification and estimation would be limited based on missingness patterns
Documentation would include missingness percentages and counts for transparency
For Option B (Multiple Imputation)
Standardized imputation would be performed by organizers
Multiple imputed datasets (e.g., m=10 or m=20) would be provided
Description describing the pooling results across imputations would be provided
Request for Participant Feedback
Please respond to this post with your preference and rationale by [INSERT DEADLINE].
Consider the following questions in your response:
Which approach better aligns with your research objectives?
How important is it to include variables with missingness (10-20%) in your causal models?
Next Steps
Based on participant feedback, we will:
Implement the preferred solution within 1 week of the feedback deadline
Provide an updated data dictionary describing missingness rates
Extend relevant deadlines if needed to accommodate the model submission time
Offer additional technical support for the chosen approach
Additional Resources
Pygformula Documentation: https://pygformula.readthedocs.io/en/latest/
Missing Data in Causal Inference: Recommended readings on handling missingness in SCMs
https://arxiv.org/pdf/2310.17434
https://arxiv.org/pdf/2310.16207
https://journals.sagepub.com/doi/10.1177/09622802251316971
Technical Support: Contact [challenges@cstructure.io] for specific convergence issues
Timeline Impact: Depending on the chosen solution, the final submission deadline may be extended by 1-2 weeks to ensure all participants have adequate time to adapt their models.
Questions? Please post them as replies to this wiki entry or contact the organizers directly.
Created by Erick Scott scottie Update:
The SDY1760 Data Dictionary has been updated with missingness information on the percentage of patients with missing values and the total number of missing values for each variable.
To see the updated Data Dictionary information, please complete the following in the cStructure interface:
1) Open the Data Sidebar
2) Turn the Data Dictionary switch to the On position
3) Delete the current sdy1760_data_dictionary.csv file by scrolling to the right of the row in the Data Sources component and click the trashcan button
4) Reload the webpage using your browser's reload button
5) Open the Data Sidebar
6) Turn the Data Dictionary switch to the On position
7) Select sdy1760_data_dictionary.csv option in the Stored Dictionaries dropdown in the Challenge Data Dictionary component.
This additional information should help participating teams select variables that are more likely to converge.
Dear organizer,
I am Tsai-Min Chen, on behalf of our team, Metformin-121. We select Option A based on the following considerations:
Validation datasets must be clear and correct to accurately assess model performance. Using complete-case data ensures that the validation dataset is free from imputation-induced bias, providing a transparent and stable foundation for performance evaluation. Utilizing a clear, complete-case sdy1760 dataset enhances the reproducibility and generalizability of causal inference models. The resulting consistency ensures that insights generated are valid and applicable beyond the sdy1662 dataset.
Given that all participants built their causal models without seeing the missingness of variables in sdy1662 dataset, missingness in sdy1760 dataset should not affect our causal models' performance when competing to other models. However, it might distort our causal models' performance when evaluating transferability across different datasets.
Yours,
Tsai-Min, on behalf of Metformin-121
Drop files to upload
Mid-Point Evaluation Model Convergence Issues: Missing Data with Pygformula page is loading…