2.1 Core Concepts

2.1.1 Manifest File

The Manifest.txt file is a tab-separated-variable file with two columns: key and filepath. This file specifies the locations of input files for an analysis and a unique key. This unique key identifies a unique sample (single row) inside the sample sheet (discussed below). The manifest file may contain fewer samples than the sample sheet and can be revised based on QC etc. without changing the sample sheet.

An example manifest file: -

key filepath
tavij MS/BATCH1_outputs/C36/outs/raw_feature_bc_matrix
gavon MS/BATCH1_outputs/C48/outs/raw_feature_bc_matrix
sisos MS/BATCH1_outputs/C54/outs/raw_feature_bc_matrix
vubul MS/BATCH1_outputs/MS411/outs/raw_feature_bc_matrix
larim MS/BATCH1_outputs/MS430/outs/raw_feature_bc_matrix
famuv MS/BATCH1_outputs/MS461/outs/raw_feature_bc_matrix
pobas MS/BATCH1_outputs/MS513/outs/raw_feature_bc_matrix
dovim MS/BATCH1_outputs/MS527/outs/raw_feature_bc_matrix
honiz MS/BATCH1_outputs/MS530/outs/raw_feature_bc_matrix
kurus MS/BATCH1_outputs/MS535/outs/raw_feature_bc_matrix

In this example, the key column is a single word proquint (pronouncable-quintuplet) identifier generated using the ids package in R. Any unique identifier is valid.

The filepath column of the manifest file should point to folders containing matrix.mtx.gz, features.tsv.gz, barcodes.tsv.gz for individual samples.

2.1.2 Sample Sheet File

The SampleSheet.tsv file is a tab-separated-variable file with sample metadata. Some specifications for the sample sheet are as follows: -

  • The sample sheet must include a manifest column containing the appropriate unique identifier from the manifest file.
  • Data which is unavailable for a particular sample should be left blank so that it is subsequently assigned NA when loaded into R.

An example sample sheet file: -

individual group diagnosis sex age capdate prepdate seqdate manifest
C36 Control Control M 68 20180802 20180803 201808 tavij
C48 Control Control M 68 20180803 20180803 201808 gavon
C54 Control Control M 66 20180806 20180807 201808 sisos
PDC05 Control Control M 58 20181002 20181008 201811 hajov
MS527 High MS M 47 20180807 20180808 201808 dovim
MS535 High MS F 65 20180806 20180807 201808 kurus
MS430 Low MS F 61 20180802 20180803 201808 larim
MS461 Low MS M 43 20180803 20180803 201808 famuv
MS530 Low MS M 42 20180806 20180807 201808 honiz

2.1.2.1 Sample Sheet Values

As sample sheet values may be used in figures and figure legends generated by the pipeline, relatively brief values in PascalCase are recommended. For example, Low is preferable to low, and MS is preferable to MultipleSclerosis. Spaces are not supported (e.g. Multiple Sclerosis).

2.1.2.2 Sample Sheet Variables

For 10X runs, sample sheet variables will often recur across experiments. Common variables include: -

variable description
capdate nuclei extraction, 10x capture, reverse transcriptase date
prepdate library preparation date
seqdata illumina sequencing date
RIN RNA integrity number
PMI postmortem interval
cdnaconc cDNA concentration is one of the initial readouts for library quality (full-length)
libraryconc post-library preparation DNA quantification (from PCR bands)
individual an identifier for an individual sample (not necessarily unique, e.g. multiple seqdate)
group fine disease groups (e.g. Control, Low, High)
diagnosis coarse disease groups (e.g. Control, MS)
age (e.g. 67)
sex M, F

For certain downstream applications (e.g. database accessibility) it may be useful to implement ontologies in the sample sheet for consistency across experiments.

2.1.3 Parameters File

The parameter configuration for an analysis is defined in the scflow_analysis.config file in the conf directory. This file defines over 120 tunable analysis parameters. Defaults are recommended for most parameters, while those defining experimental design should be configured before a run. The pipeline supports parameter tuning with cache-based workflow resume using the -resume NextFlow feature. This feature is highly recommended for optimizing parameters which are highly sensitive to dataset size/quality differences (e.g. clustering, dimensionality reduction).

Documentation and schema for each parameter is defined in the nextflow_schema.json file. After release, the schema and help information will be available in human-readable format at https://nf-co.re/scflow/{version}/parameters

2.1.4 Ensembl Mappings File

The optional Ensembl gene mappings file specified by the ensembl_mappings parameter in nextflow.config is a .tsv file containing mappings between ensembl_gene_id and external_gene_name. Additional gene-level annotations can also be included. If this file is not provided, the mappings will be retrieved from the biomaRt database which can be slow (and occasionally offline/down). It is highly recommended to pass in this file to accelerate the analysis. The head of an example ensembl_mappings.tsv file: -

ensembl_gene_id gene_biotype external_gene_name
ENSG00000000003 protein_coding TSPAN6
ENSG00000000005 protein_coding TNMD
ENSG00000000419 protein_coding DPM1
ENSG00000000457 protein_coding SCYL3
ENSG00000000460 protein_coding C1orf112
ENSG00000000938 protein_coding FGR

2.1.5 Cell-type Data Folder

Automated cell-type annotation functionality in scFlow is currently implemented using the EWCE package by Dr Nathan Skene. The ctd_folder parameter specified in nextflow.config is the location of a folder with reference dataset marker gene files, or cell-type data (CTD), saved in .rds format. Currently, only human/mouse brain cell-type annotations are supported and the ctd_folder should contain both the Allan2019.rds and Zeisel2018.rds files.

2.1.6 Cell-type Mapping File

scFlow will automatically annotate cell-types and generate a celltype_mappings.tsv file indicating predicted cell-types for each numbered cluster. This file can be manually edited in Excel/Calc, copied to the refs folder, and passed as the celltype_mappings parameter to manually revise the cell-types (e.g. based on preferred nomenclature, marker expression, etc.). This revised file should be comprised of two columns, clusters containing an entry for each numbered cluster, and cluster_celltype specifying the final cell-type annotation, e.g.

clusters cluster_celltype
1 Oligo
2 Micro
3 IN-SST
4 Micro

2.1.7 Genes of Interest File

The optional reddim_genes.yml file allows genes of interest to be specified for expression plotting on a reduced dimension plot; this kind of plot is typically used to highlight cluster marker genes. The genes are specified using a YAML format organized by user-defined categories which become folder names for the generated plots. Here is an example reddim_genes.yml file: -

EN:
  - SYT1
  - RBFOX3
  - CUX2
  - SATB2
  - RORB
  - TLE4
IN:
  - GAD1
  - GAD2
  - PVALB
  - SST
  - VIP
  - SV2C
MICROGLIA:
  - P2RY12
  - CD74
  - CD68
  - FTL
  - APOE
  - SPP1
OTHER:
  - TP53

2.1.8 Working Directory

The workDir parameter defined in nextflow.config specifies the working directory where temporary outputs and cached data are stored. Further details are available in the NextFlow documentation. It is important not to use the same workDir for different analyses.