ENA Raw Read Submission Pipeline
Raw reads associated with a study must be deposited in a public repository, such as ENA, SRA NCBI, DDBJ, etc. I found the submission process for ENA to be the easiest.
This pipeline walks you through submitting raw sequencing reads to the European Nucleotide Archive (ENA)
. See Resource 1
below for the official documentation from ENA
:clipboard: Before uploading raw reads to ENA
The first step is to register as user on the Webin Submissions Portal
of ENA. This profile can be used to submit multiple studies
Supported read file formats are: CRAM, BAM, Single-end FASTQ, Paired-end FASTQ, HDF5 and FAST5
Read submission can be completed in 3 steps:
- Register study on web interface
- Sample metadata upload via web interface
- Read upload via command line interface
- Read metadata upload via web interface
For required mandatory metadata information, see Metadata collection
ALWAYS complete a test version
first and then repeat the steps for the production version
📤 Uploading data to ENA Webin Submissions Portal
-
Register study on the test service. Release date can be pushed back or forward. Abstract and description can be added later. Study acession will then become available in “Studies Report”. Save it for later use
The 3 lines on top left part of the screen opens the dashboard for navigation
-
Register samples -> Download spreadsheet. For human or mouse RNA-seq data, choose “Other checklists”. Next, for human/mouse data, choose “default checklist”. Fill it up (
see metadata collection for help
), go back to “Register samples” on dashboard and select Upload. Save assigned sample accession IDs from “Samples Report” for the read files checklistFor environmental and organismal (host-associated) samples, check
resource 2
-
Reads upload: To fill out the read files checklist, follow Step 4 in
resource 2
a. First, generate md5 checksum to verify read file integrity after upload. In the folder with the read files, run in terminal:
for f in *fastq.gz; do md5 $f | awk '{ gsub(/\(|\)/,""); print $1"\t" $2 }'; done > md5sums.tsv
Note the correct file extension used for your files - fastq.gz, fq.gz, etc. This command creates a tab-separated file with the md5 checksums and the file names. It should look like this (or in the reverse order of the columns):
| 3b078583e52381db7d88abf7912b76c1 | i712_0001_CGGCGCA_i512A_0001_GTCTCCCT_R1.fastq.gz | | de91c8fc0d76dbfe05a45e7431109c97 | i712_0002_AAAATCC_i512A_0002_TTATGGGT_R1.fastq.gz |
Troubleshoot: If the correct info is not on the tsv file, run the following in the terminal with one file name. Examine which output columns contain the filename and checksum.
md5sum i712_0031_AAAATCCCAGTT_i512A_0031_AACGTTTAGGGG_R1.fastq.gz | awk '{ gsub(/\(|\)/,""); print $1"\t" $2 }'
For example, print different columns such as $1 and $3 instead of $1 and $2 in the command above.
This works for Mac and Linux.
Example on Windows
)b. Next, upload the read files to the ftp server of ENA, preferrably remotely, as it can take a long time.
## Connect to ENA FTP server lftp webin2.ebi.ac.uk -u Webin-XXXXX ## Expected response: lftp Webin-XXXXX@webin2.ebi.ac.uk:~> ## Transfer read files mput ~/read-files/*.fastq.gz ## Expected response: Total x files transferred ## Disconnect from server bye
This works for Mac and Linux.
Example on Windows
-
Reads metadata upload: to upload the read files checklist along with the md5 checksum for each file, go to Dashboard -> Submit Reads -> Select download option based on file format -> fill it up (
see here for help
) -> upload.
Here, it is critical that the md5sums on the checklist should match those on the FTP server for each file. Once they do, “Run Files Report” on dashboard will indicate this with “File archived” or similar. If there are errors, you will see it immediately.
Troubleshoot, if md5 checksums don’t match: Check md5sums of data at source to check if downloading was the issue. Redo md5sum step. For further help, contact ENA via “Support” in Webin user area. “Internal errors” usually resolve without intervention.
That’s it!! All done for the test service 🎉
Now log in to production service
and repeat all steps.
Caution: Study and sample accessions will be different to test service
For other errors like “internal checking error”, etc, give it a couple of days to resolve on the production service. Contact support if errors persist.
Helpful resources:
Resource 1 - ena-docs.readthedocs.io
Resource 2 - biodiversitydata-se.github.io