How to
This how-to tutorial is written according to the normal workflow from demultiplexing, through processing and to the various analyses. Although there are many different options (methods) for each step, the general how-to in GPM is the same. The methods are different in the scripts or files within the methods. Of course you can refer to any step according to your need.
Demultiplex
If you start with BCL raw data, you should start with:
gpm demultiplex --help
Currently, there are the following methods available for demultiplexing:
bcl2fastq
cellranger_mkfastq
cellranger_atac_mkfastq
evercode_WT
By defining the method for demultiplexing, the template scripts and files will be generated as well as the instructions.
For example, if you want to run cellranger_mkfastq, you can do:
gpm demultiplex --method cellranger_mkfastq \
--raw /path/to/BCL/folder \
--output /path/where/new/folder/is/created
This command will create a folder with the same name as the BCL folder under the defined output folder, and then add the following files in it:
run_cellranger_mkfastq.sh
run_merge_lanes.sh
samplesheet_cellranger.csv
Note
The reason to keep the output folder with the same name as the BCL folder is that it is easier for tracing back. In addition, one sequencing run might include the reads from several projects and we have to do the demultiplexing together. This is why we don’t want to make demultiplexing step project specific.
These files are everything you need for this task and you need to go through these files and follow the instruction inside to modify it for your need. Then you can run it with:
bash run_cellranger_mkfastq.sh
Note
It is recommended to run the command within a detachable session such as screen or tmux.
GPM just populates the template files for you. You still need to read and understand how cellranger or bcl2fastq work, and how to define the corresponding samplesheets.
Processing
After you have FASTQ files, you can initiate a new project by gpm init. Please check the help message by:
gpm init --help
Before initiating a project, you have to know the followings:
Project name in the format of YYMMDD_Name1_Name2_Institute_Application.
If available, PATH/to/FASTQ/.
If available, PATH to
project.inifrom demultiplexing which contains the information of the raw data.How you want to process the data (see available methods by
gpm processing --help)
If you have everything ready, you can do:
gpm init --from-config /PATH/FASTQ/project.ini \
--fastq /PATH/FASTQ/FASTQ_FOLDER \
--name YYMMDD_Name1_Name2_Institute_Application \
--processing nfcore_RNAseq
This command will:
Create a new folder with the name, YYMMDD_Name1_Name2_Institute_Application
Duplicate the previous
project.iniand add new informationCreate a subfolder, nfcore_RNAseq and generate the template files and scripts for executing this pipeline
If later you want to add any other processing methods in this project, you can do:
gpm processing --fastq /PATH/FASTQ/FASTQ_FOLDER \
--processing nfcore_miRNAseq \
/PATH/TO/PROJECT/project.ini
Analysis
After processing the data, now you want to perform some customized analyses according to the experimental design or the initial results. GPM also provides a wide range of analyses ready to use. You can check the help messages by:
gpm analysis --help
--report can be generated from our templates according to the application; --add can specify which analysis method you need and generate the template scripts and files. You can view all the available analyses by:
gpm analysis --list project.ini
For example, you have a 3’mRNA-Seq run and want to generate the report and do differential expression analysis, you can do the followings:
gpm analysis --report RNAseq \
--add DGEA_RNAseq \
project.ini
This command will:
Create a folder
analysisGenerate a
Analysis_Report_RNAseq.Rmdfor rendering a html reportCreate the folder
analysis/DGEA_RNAseqand add the scripts and files needed for this analysis
Then you need to check the files within the analysis folder for learning how to continue the analysis. There might be Rmd or JupyterNotebooks for guiding the analysis.
Export
After the analysis is done and now you want to export the data to the clients. GPM provides the command export for soft-linking everything to the export destination (such as web server) and create the .htaccess and .htpasswd for your project.
Note
The purpose for soft-linking the files is to avoid duplicating any file or folder.
Please check the help message by:
gpm export --help
You should run this command from the root of the project folder where project.ini is.
gpm export --config project.ini \
--symprefix /mnt/nextgen/
/mnt/web/var/www/html/data/YYMMDD_Name1_Name2_Institute_Application
This command will do the followings:
Load all the information in
project.iniCreate the folder in the web server
/mnt/web/var/www/html/data/YYMMDD_Name1_Name2_Institute_ApplicationGenerate
.htaccessin this folder according to your configurationGenerate a user and its login credential and write into
.htpasswd. This user name will be extracted from the folder nameName1fromYYMMDD_Name1_Name2_Institute_Application. However, you can also specify by--user.Export the folders according to
config/export.configby symbolic links.
Note
--symprefix is crucial here because it defines how the source files are referred from the export destination to the source. In this example, /mnt/nextgen/ refers to the mounting point of the computational server on the web server.
In case you want to create an empty export folder, you can do the following on the web server where you export your data:
gpm export --user myclient YYMMDD_Name1_Name2_Institute_Application
This command will still generate .htaccess and .htpasswd, but leaves the folder empty for you.
Eventually, you can tar those exporting folders for the users to download.
gpm tar-export .
This command needs to be executed in the export project folder (web server) and it will:
Create
compressed_tarfolderIterate through every subfolders except
compressed_tarand compress each subfolder including softlinked files/foldersThe file name of the tar files is Project_name_Subfolder_name.tar
md5file is also generated.
In case you want to re-tar any subfolder, you need to delete that tar file first and redo this step. When you are not sure, you can run the script with --dry-run to see what is going to happen without actually tarring anything.
Clean
GPM also provides clean command to remove the files or folders which you don’t want to archive after the projects closed. The regex patterns of those files/folders are defined in config/gpm.ini section [CLEAN] by the key PATTERNS. Please read the help message by:
gpm clean --help
You can clean multiple projects at the same time:
gpm clean ./2022*
Or you can simulate by --dry-run:
gpm clean -v -d ./2022*
Archive
The last stage in the life cycle of a project is archiving, which means to backup the whole project to the archive destination and delete the source files. Please read the help messages:
gpm archive --help
You can archive multiple projects at the same time:
gpm archive ./2022* /PATH/TO/ARCHIVE/SPACE
If you are not sure how much data will be archived or removed, you can use --dry-run:
gpm archive --dry-run --verbose ./2022* /PATH/TO/ARCHIVE/SPACE