This directory contains all codes necessary to run RegVar and to reproduce the figures and results in the RegVar manuscript.

Directories
===========
Python_scripts/ -- contains the Python scripts to annotate SNP-TSS samples (RegVar_annotate_variants.py for DHS filtered models and RegVar_annotate_variants_full_model.py for full models), to train RegVar models (RegVar_training.py for DHS filtered models and RegVar_training_full_model.py for full models), and to retrieve RegVar scores for any annotated samples (RegVar_prediction.py for DHS filtered models and RegVar_prediction_full_model.py for full models). Also scripts to annotate the HGMD variants (annotate_variants_hgmd.py), to train the pathogenic model (RegVar_training_hgmd.py), and to retrieve pathogenic scores (RegVar_prediction_hgmd.py) are included.

RegVar_paper/ -- contains all result files and R scripts to generate all figures in the RegVar munuscript.

Tmp/ -- initially empty, but used by the annotation script for temporary annotation files.

Compressed file
===========
Annotation_profiles.tar.gz -- contains processed sequential, evolutionary, epigenetic profiles used as RegVar features.

Training_sets.tar.gz -- contains GTEx SNP-TSS samples and HGMD variants used in RegVar training, SNP-TSS samples used in external evaluation, variants in chromosome 22, and 100,000 SNPs randomly selected across the genome.

Running RegVar
==============
Required
--------
The software requires the following programs and packages, and we recommend a Linux set up. Our own clusters were running in Ubuntu 18.04 with an NVIDIA TITAN Xp GPU for boosting the training process.

-- Python 2.7.* https://www.python.org/
-- tensorflow https://www.tensorflow.org/
-- pandas https://pandas.pydata.org/
-- scikit-learn https://scikit-learn.org/stable/
-- bedtools https://bedtools.readthedocs.io/en/latest/

Files preparation
-----------------
SNP-TSS samples should be in a 6 column bed file, with the following tab delimited columns (you could also go into the Training_sets/ directory to see some examples):
<chr>	<start>	<end>	<SNP_id>	<gene>	<unique_id>

Annotation
----------
You will need to go into the Python_scripts/ directory to run the RegVar_annotate_variants.py script to annotate the variant samples (this should also be set to run other Python scripts described in this Readme file). In the following command line, option -i is followed by the variant file containing the SNP-TSS samples (eqtl.bed); option -o is followed by the output file containing the annotated features of SNP-TSS samples (annotated.txt); option -t is followed by the tissue in which you want to annotate the samples (tissue). With a single option -h, the help information would be displayed.
python RegVar_annotate_variants.py -i eqtls.bed -o annotated.txt -t tissue

Model training
--------------
RegVar models could be trained in the RegVar_training.py script. In the following command line, options -p and -n are followed by the annotated files of positive (positive_annotation.txt) and negative (negtive_annotation.txt) samples, respectively; option -m is followed by the file path where you want to save the trained model (model_path); option -t is followed by the tissue for which you want to train a DNN model (tissue). It would take dozens of minutes to train a DNN model depending on the number of training samples.
python RegVar_training.py -p positive_annotation.txt -n negtive_annotation.txt -m model_path -t tissue

Prediction
----------
Prediction for a specific annotated file could be run in the RegVar_prediction.py. In the following command line, option -i is followed by the annotated files of your SNP-TSS samples (annotated.txt); option -m is followed by the file path where the trained model was saved (model_path); option -t is followed by the tissue in which you want to predict the samples (tissue).
python RegVar_prediction.py -i annotated.txt -m model_path -t tissue

Usage of scripts processing the HGMD variants is generally the same to above description, except that there is no need to provide a parameter of tissue. Also the help information would be displayed with a single option -h.

Reproducing figures from the manuscript
=======================================
All figures in the RegVar manuscript could be repreduced from the files and R scripts in the RegVar_paper/ directory. You could go into the corresponding directory and run the R scripts to get the corresponding figures. Example command lines:
cd Figure2/
Rscript Figure2.R