Accordind to Conesa 2019, et al. a generic (Bulk) RNASeq analysis includes three main steps:
- Preprocessing: includes experimental design, sequencing design and quality control steps.
- Core analysis: includes transcriptome profiling, differential geen expression, and functional profiling.
- Advance analysis: includes visualization, other RNASeq technologies and data integration. ̶
And you should keep register of all of them, but…
Sometimes research it is not FAIR
FAIR stands for findability, accessibility, interoperability, and reusability of data.
- Findable (F): how do you find the data?
- Accessible (A): how do you gain access to the data?
- Interoperable (I): are the data and metadata interoperable?
- Reusable (R): is it possible for others to use the data in the future?
What does ‘the data’ mean in a RNASeq-FAIR context?
The data refers to:
Raw sequencing files
Metadata in a plain text file
Expresion profile files: quantification, annotations, differential expresion files in binary or text format.
Analysis source code files
Plots
Findable (F):
- Assign a globally unique and persistent identifier to your data (such as a DOI or UUID)
- Describe your data with rich metadata that includes essential information.
- Explicitly link metadata to the data they describe.
- Ensure that your data are registered or indexed in a searchable resource.
Accessible (A)
- Make your data retrievable by their identifier.
- Use an open, free, and standar implementable protocol.
- If necessary, incorporate an authentication and authorization procedure.
- Consider international privacy regulations (e.g. GDPR, CCPA)
Interoperable (I)
- Represent your data using a formal, shared, and broadly language, format, and structure.
Reusable (R)
- Richly describe your metadata with accurate and relevant attributes.
- Associate your data with a clear and accessible data usage license.
- Provide detailed provenance information.
HOW:
- ENA/SRA identifiers
- Zenodo/Figshare
- Github/Gitlab/Bitbucket repository
- pip/CRAN/Bioconductor package
- Plain text (json, yaml, csv, tsv, etc.)
- Common formats (hdf5, gzip, pickle, RData)
- Workflow managers: Snakemake, Nextflow
- Containerization: Docker/Podman, Singularity/Apptainer
Reporting
You are familiar with:
- Introduction
- Methods
- Results
- Discussion
- Conclusion
Have you tried to reproduce a bioinformatic Methods section?
Good practice
- Select the right SHELL: Bash
- Use environments: renv, python environments (conda, pyenv, pipenv, poetry, pipx, etc.)
- Select a good text editor/IDE
- Use version control
- Use notebooks: Rmd, Quarto, Jupyter, Pluto
- Add a README file, and LICENSE (optional)
- When possible, use Workflow manager
- Structure your code: function, class, modules, etc.
- Avoid hardcoded paths
- Follow code-style rules: PEP 8, Tidyverse style
Clone a repository to get a shared project
- In RStudio , go to “File > New Project”
- Click on “Version Control: Checkout a project from a version control repository”
- Click on “Git: Clone a project from a repository”
- Fill in the info: URL: https://github.com/ATGenomics/rnaseq_report
- Browse to where you would like to create this folder:
~/workshop