2.1 Core Concepts

2.1.1 Manifest File

The Manifest.txt file is a tab-separated-variable file with two columns: key and filepath. This file specifies the locations of input files for an analysis and a unique key. This unique key identifies a unique sample (single row) inside the sample sheet (discussed below). The manifest file may contain fewer samples than the sample sheet and can be revised based on QC etc. without changing the sample sheet.

An example manifest file: -

key	filepath
tavij	MS/BATCH1_outputs/C36/outs/raw_feature_bc_matrix
gavon	MS/BATCH1_outputs/C48/outs/raw_feature_bc_matrix
sisos	MS/BATCH1_outputs/C54/outs/raw_feature_bc_matrix
vubul	MS/BATCH1_outputs/MS411/outs/raw_feature_bc_matrix
larim	MS/BATCH1_outputs/MS430/outs/raw_feature_bc_matrix
famuv	MS/BATCH1_outputs/MS461/outs/raw_feature_bc_matrix
pobas	MS/BATCH1_outputs/MS513/outs/raw_feature_bc_matrix
dovim	MS/BATCH1_outputs/MS527/outs/raw_feature_bc_matrix
honiz	MS/BATCH1_outputs/MS530/outs/raw_feature_bc_matrix
kurus	MS/BATCH1_outputs/MS535/outs/raw_feature_bc_matrix

In this example, the key column is a single word proquint (pronouncable-quintuplet) identifier generated using the ids package in R. Any unique identifier is valid.

The filepath column of the manifest file should point to folders containing matrix.mtx.gz, features.tsv.gz, barcodes.tsv.gz for individual samples.

2.1.2 Sample Sheet File

The SampleSheet.tsv file is a tab-separated-variable file with sample metadata. Some specifications for the sample sheet are as follows: -

The sample sheet must include a manifest column containing the appropriate unique identifier from the manifest file.
Data which is unavailable for a particular sample should be left blank so that it is subsequently assigned NA when loaded into R.

An example sample sheet file: -

individual	group	diagnosis	sex	age	capdate	prepdate	seqdate	manifest
C36	Control	Control	M	68	20180802	20180803	201808	tavij
C48	Control	Control	M	68	20180803	20180803	201808	gavon
C54	Control	Control	M	66	20180806	20180807	201808	sisos
PDC05	Control	Control	M	58	20181002	20181008	201811	hajov
MS527	High	MS	M	47	20180807	20180808	201808	dovim
MS535	High	MS	F	65	20180806	20180807	201808	kurus
MS430	Low	MS	F	61	20180802	20180803	201808	larim
MS461	Low	MS	M	43	20180803	20180803	201808	famuv
MS530	Low	MS	M	42	20180806	20180807	201808	honiz

2.1.2.1 Sample Sheet Values

As sample sheet values may be used in figures and figure legends generated by the pipeline, relatively brief values in PascalCase are recommended. For example, Low is preferable to low, and MS is preferable to MultipleSclerosis. Spaces are not supported (e.g. Multiple Sclerosis).

2.1.2.2 Sample Sheet Variables

For 10X runs, sample sheet variables will often recur across experiments. Common variables include: -

variable	description
`capdate`	nuclei extraction, 10x capture, reverse transcriptase date
`prepdate`	library preparation date
`seqdata`	illumina sequencing date
`RIN`	RNA integrity number
`PMI`	postmortem interval
`cdnaconc`	cDNA concentration is one of the initial readouts for library quality (full-length)
`libraryconc`	post-library preparation DNA quantification (from PCR bands)
`individual`	an identifier for an individual sample (not necessarily unique, e.g. multiple seqdate)
`group`	fine disease groups (e.g. Control, Low, High)
`diagnosis`	coarse disease groups (e.g. Control, MS)
`age`	(e.g. 67)
`sex`	M, F

For certain downstream applications (e.g. database accessibility) it may be useful to implement ontologies in the sample sheet for consistency across experiments.

2.1.3 Parameters File

The parameter configuration for an analysis is defined in the scflow_analysis.config file in the conf directory. This file defines over 120 tunable analysis parameters. Defaults are recommended for most parameters, while those defining experimental design should be configured before a run. The pipeline supports parameter tuning with cache-based workflow resume using the -resume NextFlow feature. This feature is highly recommended for optimizing parameters which are highly sensitive to dataset size/quality differences (e.g. clustering, dimensionality reduction).

Documentation and schema for each parameter is defined in the nextflow_schema.json file. After release, the schema and help information will be available in human-readable format at https://nf-co.re/scflow/{version}/parameters

2.1.4 Ensembl Mappings File

The optional Ensembl gene mappings file specified by the ensembl_mappings parameter in nextflow.config is a .tsv file containing mappings between ensembl_gene_id and external_gene_name. Additional gene-level annotations can also be included. If this file is not provided, the mappings will be retrieved from the biomaRt database which can be slow (and occasionally offline/down). It is highly recommended to pass in this file to accelerate the analysis. The head of an example ensembl_mappings.tsv file: -

ensembl_gene_id	gene_biotype	external_gene_name
ENSG00000000003	protein_coding	TSPAN6
ENSG00000000005	protein_coding	TNMD
ENSG00000000419	protein_coding	DPM1
ENSG00000000457	protein_coding	SCYL3
ENSG00000000460	protein_coding	C1orf112
ENSG00000000938	protein_coding	FGR

2.1.5 Cell-type Data Folder

Automated cell-type annotation functionality in scFlow is currently implemented using the EWCE package by Dr Nathan Skene. The ctd_folder parameter specified in nextflow.config is the location of a folder with reference dataset marker gene files, or cell-type data (CTD), saved in .rds format. Currently, only human/mouse brain cell-type annotations are supported and the ctd_folder should contain both the Allan2019.rds and Zeisel2018.rds files.

2.1.6 Cell-type Mapping File

scFlow will automatically annotate cell-types and generate a celltype_mappings.tsv file indicating predicted cell-types for each numbered cluster. This file can be manually edited in Excel/Calc, copied to the refs folder, and passed as the celltype_mappings parameter to manually revise the cell-types (e.g. based on preferred nomenclature, marker expression, etc.). This revised file should be comprised of two columns, clusters containing an entry for each numbered cluster, and cluster_celltype specifying the final cell-type annotation, e.g.

clusters	cluster_celltype
1	Oligo
2	Micro
3	IN-SST
4	Micro

2.1.7 Genes of Interest File

The optional reddim_genes.yml file allows genes of interest to be specified for expression plotting on a reduced dimension plot; this kind of plot is typically used to highlight cluster marker genes. The genes are specified using a YAML format organized by user-defined categories which become folder names for the generated plots. Here is an example reddim_genes.yml file: -

EN:
  - SYT1
  - RBFOX3
  - CUX2
  - SATB2
  - RORB
  - TLE4
IN:
  - GAD1
  - GAD2
  - PVALB
  - SST
  - VIP
  - SV2C
MICROGLIA:
  - P2RY12
  - CD74
  - CD68
  - FTL
  - APOE
  - SPP1
OTHER:
  - TP53

2.1.8 Working Directory

The workDir parameter defined in nextflow.config specifies the working directory where temporary outputs and cached data are stored. Further details are available in the NextFlow documentation. It is important not to use the same workDir for different analyses.