Model Convergence Issues: Missing Data with Pygformula Problem Statement Several participant submissions for the Covid Causal Diagram DREAM Challenge Mid-Point Evaluation have encountered model convergence failures due to data limitations. Upon investigation, these convergence issues appear to be primarily caused by missing data patterns. Technical Background The pygformula package relies fits parametric models for time-varying covariates and outcomes. When substantial missing data is present, several issues can arise: Model fitting failures: Insufficient complete cases during automated data validation prevents model fitting Numerical instability: Missing data patterns can lead to singular matrices or convergence warnings Biased estimates: Listwise deletion can introduce selection bias if missingness is not completely at random Current Impact on Challenge This issue is affecting participants' ability to complete the model validation phase successfully. Proposed Solutions We are seeking participant input on the preferred approach to address this issue. Please review the following options and provide feedback: Option A: Curated Complete-Case Variable List Approach: Provide a curated list of variables with no missing data that participants can use for data layering during structural causal model development. Advantages: ✅ Improves model convergence and stability ✅ Eliminates need for imputation methodology decisions ✅ Faster model development and iteration ✅ Consistent data availability across all participants Disadvantages: ❌ May exclude clinically important variables ❌ Reduces model complexity and potential insights Option B: Multiple Imputation Framework Approach: Implement a standardized multiple imputation procedure for the dataset, potentially using methods such as: Multiple Imputation by Chained Equations (MICE) Missingness pattern-aware imputation Domain-specific imputation strategies for clinical variables Advantages: ✅ Preserves full variable set for analysis ✅ Maintains statistical power with full sample size Disadvantages: ❌ Adds methodological complexity ❌ May introduce imputation-related bias ❌ Longer computational time for model fitting Implementation Considerations For Option A (Complete-Case Variables) Variable chosen for causal identification and estimation would be limited based on missingness patterns Documentation would include missingness percentages and counts for transparency For Option B (Multiple Imputation) Standardized imputation would be performed by organizers Multiple imputed datasets (e.g., m=10 or m=20) would be provided Description describing the pooling results across imputations would be provided Request for Participant Feedback Please respond to this post with your preference and rationale by [INSERT DEADLINE]. Consider the following questions in your response: Which approach better aligns with your research objectives? How important is it to include variables with missingness (10-20%) in your causal models? Next Steps Based on participant feedback, we will: Implement the preferred solution within 1 week of the feedback deadline Provide an updated data dictionary describing missingness rates Extend relevant deadlines if needed to accommodate the model submission time Offer additional technical support for the chosen approach Additional Resources Pygformula Documentation: https://pygformula.readthedocs.io/en/latest/ Missing Data in Causal Inference: Recommended readings on handling missingness in SCMs https://arxiv.org/pdf/2310.17434 https://arxiv.org/pdf/2310.16207 https://journals.sagepub.com/doi/10.1177/09622802251316971 Technical Support: Contact [challenges@cstructure.io] for specific convergence issues Timeline Impact: Depending on the chosen solution, the final submission deadline may be extended by 1-2 weeks to ensure all participants have adequate time to adapt their models. Questions? Please post them as replies to this wiki entry or contact the organizers directly.

Created by Erick Scott scottie
Update: The SDY1760 Data Dictionary has been updated with missingness information on the percentage of patients with missing values and the total number of missing values for each variable. To see the updated Data Dictionary information, please complete the following in the cStructure interface: 1) Open the Data Sidebar 2) Turn the Data Dictionary switch to the On position 3) Delete the current sdy1760_data_dictionary.csv file by scrolling to the right of the row in the Data Sources component and click the trashcan button 4) Reload the webpage using your browser's reload button 5) Open the Data Sidebar 6) Turn the Data Dictionary switch to the On position 7) Select sdy1760_data_dictionary.csv option in the Stored Dictionaries dropdown in the Challenge Data Dictionary component. This additional information should help participating teams select variables that are more likely to converge.
Dear organizer, I am Tsai-Min Chen, on behalf of our team, Metformin-121. We select Option A based on the following considerations: Validation datasets must be clear and correct to accurately assess model performance. Using complete-case data ensures that the validation dataset is free from imputation-induced bias, providing a transparent and stable foundation for performance evaluation. Utilizing a clear, complete-case sdy1760 dataset enhances the reproducibility and generalizability of causal inference models. The resulting consistency ensures that insights generated are valid and applicable beyond the sdy1662 dataset. Given that all participants built their causal models without seeing the missingness of variables in sdy1662 dataset, missingness in sdy1760 dataset should not affect our causal models' performance when competing to other models. However, it might distort our causal models' performance when evaluating transferability across different datasets. Yours, Tsai-Min, on behalf of Metformin-121

Mid-Point Evaluation Model Convergence Issues: Missing Data with Pygformula page is loading…