Hi all!
I tried to find tools to help download pGWAS data from the UKB‑PPP project but could not find any suitable solutions. In addition, there is a large amount of data to download and process, while in the end one usually keeps only a tiny fraction of it—namely significant QTLs above a given LOG10P threshold.
I have therefore created a Python package `UKBPPP-DL` for easy, robust, and memory‑efficient downloading of pGWAS data from UKB-PPP, with the possibility to filter on a given −log10(P) threshold on the fly.
I hope this package can help fellow scientists and avoid having each of us reinvent the wheel. If you spot any problems or think additional features would be useful, please don’t hesitate to contact me.
You can check the [Github repository](https://github.com/nglm/ukbppp_dl) and [documentation](https://ukbppp-dl.readthedocs.io/en/latest/index.html#) for more information.
Natacha Galmiche
### In short:
#### Installation
```bash
pip install ukbppp-dl
```
#### Usage
```python
from ukbppp_dl.pgwas import keep_significant_qtls_from_region, PGWAS_REGIONS
# Synapse directory containing pQTL summary statistics (here for Europe)
REGION = PGWAS_REGIONS["European"]
# Significance threshold for pQTLs (LOG10P > 7 corresponds to p-value < 1e-7)
LOG10P_THRESHOLD = 7
# Whether to create a log file
# (0: no log file, >0: create different levels of log files)
CREATE_LOG = 2
# Whether to have an output text describing the function's run
# (0: no text, >0: create different levels of verbosity)
VERBOSE = 3
# set to a list of protein tar file names or synapse IDs if you want to process only specific proteins
# PROTEIN_TO_PROCESS = ["ACOT13_Q9NPJ3_OID31522_v1_Oncology_II.tar", "syn52363271"]
# otherwise set to None to process all proteins in the region
PROTEIN_TO_PROCESS = None
all_significant_qtls, log_reg = keep_significant_qtls_from_region(
synapse_folder_id=REGION,
download_location="./data",
res_location="./results",
log10p_threshold=LOG10P_THRESHOLD,
create_log=CREATE_LOG,
verbose=VERBOSE,
delete_downloaded_tar=True,
delete_chr_csv=True,
protein_to_process=PROTEIN_TO_PROCESS,
delete_tar_csv=False,
delete_tar_log=False,
delete_partial_logs=False,
delete_partial_outputs=False,
)
```