I have found that the fingerprint data provided on the Test file (Step1_TestData_Target2035.parquet) have a different number of bits when compared to the training dataset provided for model training. This issue makes it impossible to run the machine learning models trained based on the provided dataset due to the divergent number of elements on each fingerprint vector. We can ignore inconsistent lines, but even the lines with the correct number of elements might be unreliable.

Is it possible to AIRCHECK provide a fix on the dataset, and is an extension in the deadline for the result submission?
Created by frederico kremer fredericokremer Hi,
Thanks for looking into the data. We're happy to apply edits if there's an error.
However, please note that using len(str) counts the number of characters, not the number of integers in the list.
The data should first be split by commas, and then len() can be used to count the elements.
Alternatively, you can use the provided code to read the Parquet file correctly. It works fine for both the train and test data:
https://github.com/StructuralGenomicsConsortium/Target2035_Aircheck_Utils/tree/main/ReadingParquetFiles
Please feel free to reach out if you have any other concerns.
Best of luck in the challenge! Hello,
the fingerprints in the test dataset are given as strings, not as arrays.
As a consequence their lengths might differ from the training set, or from other entries, since they include the separator character ',' and the features can be represented by more than one character (if the value is 10 or more).
We provide scripts to convert the fingerprint from a string to an array [here](https://github.com/StructuralGenomicsConsortium/Target2035_Aircheck_Utils/).
Hope this solved your issue.
Drop files to upload
Issues on DREAM Challenge step 1: test dataset page is loading…