Differences in variables between datasets

Hi Organizers, I noticed there are only a very few variables that share common variable names between the 1662 and 1760 datasets (such as AGE). A few quick questions to clarify: 1. Should we only use variables that are available in both datasets? 2. What if my model uses a variable in the testing dataset 1662 but this variable is not available in the scoring dataset 1760 (i.e. ALBUMIN)? Will my model still run but without using this variable, or it will fail? 3. Do you have a mapping between the two datasets, where they have different variable names in the two datasets but actually the same meaning (such as SEX vs Gender)? BTW to my understanding, the 1662 dataset is used for testing models during development, and the 1760 dataset is used to score midpoint and final submissions. Hope that is correct? Thanks, Morgan

Created by Morgan Su Morgan_Su101
Scott, I sent an email regarding a separate issue to challenges@cstructure.io. I would appreciate it if you could read the email. Thanks, Morgan
Yes, associate Gender to feature Gender for 1662 and feature Sex for 1760. I do not have a mapping table between the features, seems like a very easy task for an llm. 'using all information available' is not the focus of the challenge. The ultimate goal is for participants to build knowledge-based causal graphs that can retrieve the causal effect (risk ratio) of systemic steroids (versus no steroids) on 28d survival in hospitalized patients stratified by disease severity/oxygen requirements at the start of therapy using real world data. To accomplish this, the causal graph must reduce confounding. The primary evaluation criteria is the number of bootstrapped point estimates that overlap the published confidence intervals from the RECOVERY trial using the cohort data. Additional criteria to break ties include the plausibility of the causal model's edges and the predictive calibration of the model using the cohort data.
Scott, Thanks a lot for your reply. That is very helpful. So when I develop the model, I should associate the "Gender" node to feature Gender for the 1662 dataset and feature Sex for the 1760 dataset? Do you have a mapping table between features of the two datasets? The case of Gender vs Sex is easy to identify. Others are not so clear. Also I want to ask a naive question - is the ultimate goal of the model to: 1) predict the survival probability as accurate as possible using all information available (steroid treatment included)? 2) assess the effect of steroid treatment on survival probability, with all confounding factors adjusted? Again really appreciate your very kind help. Thanks, Morgan
Hi Morgan, I know understand your question. The node labels do not have any impact on statistical estimation, the feature name associated with a node is what is used for model fitting. In your example, you have a node labeled Gender. This 'Gender' node is associated to the dataset sdy1662 feature Gender. This 'Gender' node is also associated to the dataset sdy1760 feature Sex. When you submit the model for benchmarking by selecting the sdy1760 dataset in the dropdown of the Estimate tab, you should see the available features in the Baseline/Time-Varying component update to Sex, if sdy1662 dataset is selected in the Estimate tab dropdown, then you should see Gender as a variable option in the Baseline/Time-Varying component. I hope that answers your question. Cheers.
Scott, Thanks a lot for your reply. Really appreciate it! For Question 1, seems I did not ask my question clearly, and let me try it a gain with a specific example. In the testing dataset (1162), there is variable "Gender", and assume I use this variable in my model. However, in the scoring dataset there is not a variable called "Gender". Instead, there is a variable called "Sex". I imagine it means the same thing but with a different variable name. Now my question is, when I develop my model, I will assign "Gender" to a node and run my model with the test dataset. But before I submit my model which will be running against the scoring dataset, should I change the variable name of that node to "Sex" in my model? Thanks, Morgan
Hi Morgan, 1. A strong benefit of structural causal models is flexibility in which variables are used as the adjustment set. Best practice is to build a structural causal model using domain knowledge, then assess which sets of nodes can be used as an adjustment set to achieve d-separation (confounding control, absence of orange edges in cStructure), there may be more than 1 adjustment set that achieves d-separation, however different adjustment sets may yield causal estimates with different levels of precision. Participants are free to choose any variables in a dataset that they believe will reduce confounding and yield the best precision. 2. The statistical estimation for 1662 and 1760 are separate procedures, a failure to converge in one dataset will not affect the convergence of the other. 3. We have provided data dictionaries for both datasets, it is up to the participants to assess which variables are similar. Yes, 1662 can be used for model development, 1760 is used as an evaluation dataset for the midpoint and final submissions.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Differences in variables between datasets page is loading…