This directory contains all codes necessary to run RegVar and to reproduce the figures and results in the RegVar manuscript. Directories =========== Python_scripts/ -- contains the Python scripts to annotate SNP-TSS samples (RegVar_annotate_variants.py for DHS filtered models and RegVar_annotate_variants_full_model.py for full models), to train RegVar models (RegVar_training.py for DHS filtered models and RegVar_training_full_model.py for full models), and to retrieve RegVar scores for any annotated samples (RegVar_prediction.py for DHS filtered models and RegVar_prediction_full_model.py for full models). Also scripts to annotate the HGMD variants (annotate_variants_hgmd.py), to train the pathogenic model (RegVar_training_hgmd.py), and to retrieve pathogenic scores (RegVar_prediction_hgmd.py) are included. RegVar_paper/ -- contains all result files and R scripts to generate all figures in the RegVar munuscript. Tmp/ -- initially empty, but used by the annotation script for temporary annotation files. Compressed file =========== Annotation_profiles.tar.gz -- contains processed sequential, evolutionary, epigenetic profiles used as RegVar features. Training_sets.tar.gz -- contains GTEx SNP-TSS samples and HGMD variants used in RegVar training, SNP-TSS samples used in external evaluation, variants in chromosome 22, and 100,000 SNPs randomly selected across the genome. Running RegVar ============== Required -------- The software requires the following programs and packages, and we recommend a Linux set up. Our own clusters were running in Ubuntu 18.04 with an NVIDIA TITAN Xp GPU for boosting the training process. -- Python 2.7.* https://www.python.org/ -- tensorflow https://www.tensorflow.org/ -- pandas https://pandas.pydata.org/ -- scikit-learn https://scikit-learn.org/stable/ -- bedtools https://bedtools.readthedocs.io/en/latest/ Files preparation ----------------- SNP-TSS samples should be in a 6 column bed file, with the following tab delimited columns (you could also go into the Training_sets/ directory to see some examples): Annotation ---------- You will need to go into the Python_scripts/ directory to run the RegVar_annotate_variants.py script to annotate the variant samples (this should also be set to run other Python scripts described in this Readme file). In the following command line, option -i is followed by the variant file containing the SNP-TSS samples (eqtl.bed); option -o is followed by the output file containing the annotated features of SNP-TSS samples (annotated.txt); option -t is followed by the tissue in which you want to annotate the samples (tissue). With a single option -h, the help information would be displayed. python RegVar_annotate_variants.py -i eqtls.bed -o annotated.txt -t tissue Model training -------------- RegVar models could be trained in the RegVar_training.py script. In the following command line, options -p and -n are followed by the annotated files of positive (positive_annotation.txt) and negative (negtive_annotation.txt) samples, respectively; option -m is followed by the file path where you want to save the trained model (model_path); option -t is followed by the tissue for which you want to train a DNN model (tissue). It would take dozens of minutes to train a DNN model depending on the number of training samples. python RegVar_training.py -p positive_annotation.txt -n negtive_annotation.txt -m model_path -t tissue Prediction ---------- Prediction for a specific annotated file could be run in the RegVar_prediction.py. In the following command line, option -i is followed by the annotated files of your SNP-TSS samples (annotated.txt); option -m is followed by the file path where the trained model was saved (model_path); option -t is followed by the tissue in which you want to predict the samples (tissue). python RegVar_prediction.py -i annotated.txt -m model_path -t tissue Usage of scripts processing the HGMD variants is generally the same to above description, except that there is no need to provide a parameter of tissue. Also the help information would be displayed with a single option -h. Reproducing figures from the manuscript ======================================= All figures in the RegVar manuscript could be repreduced from the files and R scripts in the RegVar_paper/ directory. You could go into the corresponding directory and run the R scripts to get the corresponding figures. Example command lines: cd Figure2/ Rscript Figure2.R