2.1 Core Concepts
2.1.1 Manifest File
The Manifest.txt
file is a tab-separated-variable file with two columns: key
and filepath
. This file specifies the locations of input files for an analysis and a unique key. This unique key identifies a unique sample (single row) inside the sample sheet (discussed below). The manifest file may contain fewer samples than the sample sheet and can be revised based on QC etc. without changing the sample sheet.
An example manifest file: -
key | filepath |
---|---|
tavij | MS/BATCH1_outputs/C36/outs/raw_feature_bc_matrix |
gavon | MS/BATCH1_outputs/C48/outs/raw_feature_bc_matrix |
sisos | MS/BATCH1_outputs/C54/outs/raw_feature_bc_matrix |
vubul | MS/BATCH1_outputs/MS411/outs/raw_feature_bc_matrix |
larim | MS/BATCH1_outputs/MS430/outs/raw_feature_bc_matrix |
famuv | MS/BATCH1_outputs/MS461/outs/raw_feature_bc_matrix |
pobas | MS/BATCH1_outputs/MS513/outs/raw_feature_bc_matrix |
dovim | MS/BATCH1_outputs/MS527/outs/raw_feature_bc_matrix |
honiz | MS/BATCH1_outputs/MS530/outs/raw_feature_bc_matrix |
kurus | MS/BATCH1_outputs/MS535/outs/raw_feature_bc_matrix |
In this example, the key
column is a single word proquint (pronouncable-quintuplet) identifier generated using the ids
package in R. Any unique identifier is valid.
The filepath
column of the manifest file should point to folders containing matrix.mtx.gz
, features.tsv.gz
, barcodes.tsv.gz
for individual samples.
2.1.2 Sample Sheet File
The SampleSheet.tsv
file is a tab-separated-variable file with sample metadata. Some specifications for the sample sheet are as follows: -
- The sample sheet must include a
manifest
column containing the appropriate unique identifier from the manifest file. - Data which is unavailable for a particular sample should be left blank so that it is subsequently assigned
NA
when loaded into R.
An example sample sheet file: -
individual | group | diagnosis | sex | age | capdate | prepdate | seqdate | manifest |
---|---|---|---|---|---|---|---|---|
C36 | Control | Control | M | 68 | 20180802 | 20180803 | 201808 | tavij |
C48 | Control | Control | M | 68 | 20180803 | 20180803 | 201808 | gavon |
C54 | Control | Control | M | 66 | 20180806 | 20180807 | 201808 | sisos |
PDC05 | Control | Control | M | 58 | 20181002 | 20181008 | 201811 | hajov |
MS527 | High | MS | M | 47 | 20180807 | 20180808 | 201808 | dovim |
MS535 | High | MS | F | 65 | 20180806 | 20180807 | 201808 | kurus |
MS430 | Low | MS | F | 61 | 20180802 | 20180803 | 201808 | larim |
MS461 | Low | MS | M | 43 | 20180803 | 20180803 | 201808 | famuv |
MS530 | Low | MS | M | 42 | 20180806 | 20180807 | 201808 | honiz |
2.1.2.1 Sample Sheet Values
As sample sheet values may be used in figures and figure legends generated by the pipeline, relatively brief values in PascalCase are recommended. For example, Low
is preferable to low
, and MS
is preferable to MultipleSclerosis
. Spaces are not supported (e.g. Multiple Sclerosis
).
2.1.2.2 Sample Sheet Variables
For 10X runs, sample sheet variables will often recur across experiments. Common variables include: -
variable | description |
---|---|
capdate |
nuclei extraction, 10x capture, reverse transcriptase date |
prepdate |
library preparation date |
seqdata |
illumina sequencing date |
RIN |
RNA integrity number |
PMI |
postmortem interval |
cdnaconc |
cDNA concentration is one of the initial readouts for library quality (full-length) |
libraryconc |
post-library preparation DNA quantification (from PCR bands) |
individual |
an identifier for an individual sample (not necessarily unique, e.g. multiple seqdate) |
group |
fine disease groups (e.g. Control, Low, High) |
diagnosis |
coarse disease groups (e.g. Control, MS) |
age |
|
sex |
M, F |
For certain downstream applications (e.g. database accessibility) it may be useful to implement ontologies in the sample sheet for consistency across experiments.
2.1.3 Parameters File
The parameter configuration for an analysis is defined in the scflow_analysis.config
file in the conf
directory. This file defines over 120 tunable analysis parameters. Defaults are recommended for most parameters, while those defining experimental design should be configured before a run. The pipeline supports parameter tuning with cache-based workflow resume using the -resume
NextFlow feature. This feature is highly recommended for optimizing parameters which are highly sensitive to dataset size/quality differences (e.g. clustering, dimensionality reduction).
Documentation and schema for each parameter is defined in the nextflow_schema.json
file. After release, the schema and help information will be available in human-readable format at https://nf-co.re/scflow/{version}/parameters
2.1.4 Ensembl Mappings File
The optional Ensembl gene mappings file specified by the ensembl_mappings
parameter in nextflow.config
is a .tsv
file containing mappings between ensembl_gene_id
and external_gene_name
. Additional gene-level annotations can also be included. If this file is not provided, the mappings will be retrieved from the biomaRt database which can be slow (and occasionally offline/down). It is highly recommended to pass in this file to accelerate the analysis. The head of an example ensembl_mappings.tsv
file: -
ensembl_gene_id | gene_biotype | external_gene_name |
---|---|---|
ENSG00000000003 | protein_coding | TSPAN6 |
ENSG00000000005 | protein_coding | TNMD |
ENSG00000000419 | protein_coding | DPM1 |
ENSG00000000457 | protein_coding | SCYL3 |
ENSG00000000460 | protein_coding | C1orf112 |
ENSG00000000938 | protein_coding | FGR |
2.1.5 Cell-type Data Folder
Automated cell-type annotation functionality in scFlow is currently implemented using the EWCE package by Dr Nathan Skene. The ctd_folder
parameter specified in nextflow.config
is the location of a folder with reference dataset marker gene files, or cell-type data (CTD), saved in .rds
format. Currently, only human/mouse brain cell-type annotations are supported and the ctd_folder
should contain both the Allan2019.rds
and Zeisel2018.rds
files.
2.1.6 Cell-type Mapping File
scFlow will automatically annotate cell-types and generate a celltype_mappings.tsv
file indicating predicted cell-types for each numbered cluster. This file can be manually edited in Excel/Calc, copied to the refs
folder, and passed as the celltype_mappings
parameter to manually revise the cell-types (e.g. based on preferred nomenclature, marker expression, etc.). This revised file should be comprised of two columns, clusters
containing an entry for each numbered cluster, and cluster_celltype
specifying the final cell-type annotation, e.g.
clusters | cluster_celltype |
---|---|
1 | Oligo |
2 | Micro |
3 | IN-SST |
4 | Micro |
2.1.7 Genes of Interest File
The optional reddim_genes.yml
file allows genes of interest to be specified for expression plotting on a reduced dimension plot; this kind of plot is typically used to highlight cluster marker genes. The genes are specified using a YAML format organized by user-defined categories which become folder names for the generated plots. Here is an example reddim_genes.yml
file: -
EN:
- SYT1
- RBFOX3
- CUX2
- SATB2
- RORB
- TLE4
IN:
- GAD1
- GAD2
- PVALB
- SST
- VIP
- SV2C
MICROGLIA:
- P2RY12
- CD74
- CD68
- FTL
- APOE
- SPP1
OTHER:
- TP53
2.1.8 Working Directory
The workDir
parameter defined in nextflow.config
specifies the working directory where temporary outputs and cached data are stored. Further details are available in the NextFlow documentation. It is important not to use the same workDir
for different analyses.