How to

This how-to tutorial is written according to the normal workflow from demultiplexing, through processing and to the various analyses. Although there are many different options (methods) for each step, the general how-to in GPM is the same. The methods are different in the scripts or files within the methods. Of course you can refer to any step according to your need.

Demultiplex

If you start with BCL raw data, you should start with:

gpm demultiplex --help

Currently, there are the following methods available for demultiplexing:

  • bcl2fastq

  • cellranger_mkfastq

  • cellranger_atac_mkfastq

  • evercode_WT

By defining the method for demultiplexing, the template scripts and files will be generated as well as the instructions.

For example, if you want to run cellranger_mkfastq, you can do:

gpm demultiplex --method cellranger_mkfastq \
--raw /path/to/BCL/folder \
--output /path/where/new/folder/is/created

This command will create a folder with the same name as the BCL folder under the defined output folder, and then add the following files in it:

  • run_cellranger_mkfastq.sh

  • run_merge_lanes.sh

  • samplesheet_cellranger.csv

Note

The reason to keep the output folder with the same name as the BCL folder is that it is easier for tracing back. In addition, one sequencing run might include the reads from several projects and we have to do the demultiplexing together. This is why we don’t want to make demultiplexing step project specific.

These files are everything you need for this task and you need to go through these files and follow the instruction inside to modify it for your need. Then you can run it with:

bash run_cellranger_mkfastq.sh

Note

It is recommended to run the command within a detachable session such as screen or tmux.

GPM just populates the template files for you. You still need to read and understand how cellranger or bcl2fastq work, and how to define the corresponding samplesheets.

Processing

After you have FASTQ files, you can initiate a new project by gpm init. Please check the help message by:

gpm init --help

Before initiating a project, you have to know the followings:

  • Project name in the format of YYMMDD_Name1_Name2_Institute_Application.

  • If available, PATH/to/FASTQ/.

  • If available, PATH to project.ini from demultiplexing which contains the information of the raw data.

  • How you want to process the data (see available methods by gpm processing --help)

If you have everything ready, you can do:

gpm init --from-config /PATH/FASTQ/project.ini \
--fastq /PATH/FASTQ/FASTQ_FOLDER \
--name YYMMDD_Name1_Name2_Institute_Application \
--processing nfcore_RNAseq

This command will:

  • Create a new folder with the name, YYMMDD_Name1_Name2_Institute_Application

  • Duplicate the previous project.ini and add new information

  • Create a subfolder, nfcore_RNAseq and generate the template files and scripts for executing this pipeline

If later you want to add any other processing methods in this project, you can do:

gpm processing --fastq /PATH/FASTQ/FASTQ_FOLDER \
--processing nfcore_miRNAseq \
/PATH/TO/PROJECT/project.ini

Analysis

After processing the data, now you want to perform some customized analyses according to the experimental design or the initial results. GPM also provides a wide range of analyses ready to use. You can check the help messages by:

gpm analysis --help

--report can be generated from our templates according to the application; --add can specify which analysis method you need and generate the template scripts and files. You can view all the available analyses by:

gpm analysis --list project.ini

For example, you have a 3’mRNA-Seq run and want to generate the report and do differential expression analysis, you can do the followings:

gpm analysis --report RNAseq \
--add DGEA_RNAseq \
project.ini

This command will:

  • Create a folder analysis

  • Generate a Analysis_Report_RNAseq.Rmd for rendering a html report

  • Create the folder analysis/DGEA_RNAseq and add the scripts and files needed for this analysis

Then you need to check the files within the analysis folder for learning how to continue the analysis. There might be Rmd or JupyterNotebooks for guiding the analysis.

Export

After the analysis is done and now you want to export the data to the clients. GPM provides the command export for soft-linking everything to the export destination (such as web server) and create the .htaccess and .htpasswd for your project.

Note

The purpose for soft-linking the files is to avoid duplicating any file or folder.

Please check the help message by:

gpm export --help

You should run this command from the root of the project folder where project.ini is.

gpm export --config project.ini \
--symprefix /mnt/nextgen/
/mnt/web/var/www/html/data/YYMMDD_Name1_Name2_Institute_Application

This command will do the followings:

  • Load all the information in project.ini

  • Create the folder in the web server /mnt/web/var/www/html/data/YYMMDD_Name1_Name2_Institute_Application

  • Generate .htaccess in this folder according to your configuration

  • Generate a user and its login credential and write into .htpasswd. This user name will be extracted from the folder name Name1 from YYMMDD_Name1_Name2_Institute_Application. However, you can also specify by --user.

  • Export the folders according to config/export.config by symbolic links.

Note

--symprefix is crucial here because it defines how the source files are referred from the export destination to the source. In this example, /mnt/nextgen/ refers to the mounting point of the computational server on the web server.

In case you want to create an empty export folder, you can do the following on the web server where you export your data:

gpm export --user myclient YYMMDD_Name1_Name2_Institute_Application

This command will still generate .htaccess and .htpasswd, but leaves the folder empty for you.

Eventually, you can tar those exporting folders for the users to download.

gpm tar-export .

This command needs to be executed in the export project folder (web server) and it will:

  • Create compressed_tar folder

  • Iterate through every subfolders except compressed_tar and compress each subfolder including softlinked files/folders

  • The file name of the tar files is Project_name_Subfolder_name.tar

  • md5 file is also generated.

In case you want to re-tar any subfolder, you need to delete that tar file first and redo this step. When you are not sure, you can run the script with --dry-run to see what is going to happen without actually tarring anything.

Clean

GPM also provides clean command to remove the files or folders which you don’t want to archive after the projects closed. The regex patterns of those files/folders are defined in config/gpm.ini section [CLEAN] by the key PATTERNS. Please read the help message by:

gpm clean --help

You can clean multiple projects at the same time:

gpm clean ./2022*

Or you can simulate by --dry-run:

gpm clean -v -d ./2022*

Archive

The last stage in the life cycle of a project is archiving, which means to backup the whole project to the archive destination and delete the source files. Please read the help messages:

gpm archive --help

You can archive multiple projects at the same time:

gpm archive ./2022* /PATH/TO/ARCHIVE/SPACE

If you are not sure how much data will be archived or removed, you can use --dry-run:

gpm archive --dry-run --verbose ./2022* /PATH/TO/ARCHIVE/SPACE