Reporting of bulk-RNA-Seq data analysis

Otoniel Maya

Warning!

  • We’re going to touch on a lot of topics and move pretty quickly.
  • You won’t be an expert by the end of this (yet), but that’s okay!
  • My main goal is to make sure you know these tools are out there.

About:

This tutorial is mainly focus on three topics:

  1. Bulk-RNASeq
  2. Data Analysis
  3. Reporting

Bulk-RNASeq

Accordind to Conesa 2019, et al. a generic (Bulk) RNASeq analysis includes three main steps:

  • Preprocessing: includes experimental design, sequencing design and quality control steps.
  • Core analysis: includes transcriptome profiling, differential geen expression, and functional profiling.
  • Advance analysis: includes visualization, other RNASeq technologies and data integration. ̶

And you should keep register of all of them, but…

Sometimes research it is not FAIR

FAIR stands for findability, accessibility, interoperability, and reusability of data.

  • Findable (F): how do you find the data?
  • Accessible (A): how do you gain access to the data?
  • Interoperable (I): are the data and metadata interoperable?
  • Reusable (R): is it possible for others to use the data in the future?

What does ‘the data’ mean in a RNASeq-FAIR context?

The data refers to:

  • Raw sequencing files

  • Metadata in a plain text file

  • Expresion profile files: quantification, annotations, differential expresion files in binary or text format.

  • Analysis source code files

  • Plots

Findable (F):

  • Assign a globally unique and persistent identifier to your data (such as a DOI or UUID)
  • Describe your data with rich metadata that includes essential information.
  • Explicitly link metadata to the data they describe.
  • Ensure that your data are registered or indexed in a searchable resource.

Accessible (A)

  • Make your data retrievable by their identifier.
  • Use an open, free, and standar implementable protocol.
  • If necessary, incorporate an authentication and authorization procedure.
  • Consider international privacy regulations (e.g. GDPR, CCPA)

Interoperable (I)

  • Represent your data using a formal, shared, and broadly language, format, and structure.

Reusable (R)

  • Richly describe your metadata with accurate and relevant attributes.
  • Associate your data with a clear and accessible data usage license.
  • Provide detailed provenance information.

HOW:

  • ENA/SRA identifiers
  • Zenodo/Figshare
  • Github/Gitlab/Bitbucket repository
  • pip/CRAN/Bioconductor package
  • Plain text (json, yaml, csv, tsv, etc.)
  • Common formats (hdf5, gzip, pickle, RData)
  • Workflow managers: Snakemake, Nextflow
  • Containerization: Docker/Podman, Singularity/Apptainer

Reporting

You are familiar with:

  • Introduction
  • Methods
  • Results
  • Discussion
  • Conclusion

Have you tried to reproduce a bioinformatic Methods section?

Good practice

  • Select the right SHELL: Bash
  • Use environments: renv, python environments (conda, pyenv, pipenv, poetry, pipx, etc.)
  • Select a good text editor/IDE
  • Use version control
  • Use notebooks: Rmd, Quarto, Jupyter, Pluto
  • Add a README file, and LICENSE (optional)
  • When possible, use Workflow manager
  • Structure your code: function, class, modules, etc.
  • Avoid hardcoded paths
  • Follow code-style rules: PEP 8, Tidyverse style

Working platform

Clone a repository to get a shared project

  1. In RStudio , go to “File > New Project”
  2. Click on “Version Control: Checkout a project from a version control repository”
  3. Click on “Git: Clone a project from a repository”
  4. Fill in the info: URL: https://github.com/ATGenomics/rnaseq_report
  5. Browse to where you would like to create this folder: ~/workshop