

For Illumina sequencing, the barcodes that are in the so-called third read position should not be present in the sequence.

The reads in FASTQ files are unfiltered, i.e., barcodes, adapter sequences, and spike-ins remain in the files.

For experiments that produce paired-end reads, the two reads in each pair are stored in two separate files, with the reads in the same order in the two files. If multiple lanes are used for the same biological or technical replicate, they are stored in the same file (after a QC check to eliminate failed lanes), with information on flow cell and lane ID included. Reads from different replicates are stored in separate files and should include flow cell and lane ID. Biological replicates are contrasted with technical replicates, for which different sequencing libraries are prepared from the same sample, or different sequencing lanes for the same library. Biological replicates involve different biological samples, e.g., different tissue preparations for cell growth and expansion when cell lines are used. The files are accompanied by documentation detailing how the sequencing libraries were constructed to inform the end-user about how they might want to process the data, the strengths and limitations of the various options of data processing, and how these may apply according to the user's biological questions of interest.ĮNCODE produces replicate data for most experiments to quantify reliability. FASTQ FASTQ file contentįASTQ files are submitted as they come off the sequencing instrument to allow for maximal decision making of downstream users. Additional information about the file formats can be viewed at the UCSC Genome Browser ENCODE-specific File Formats. In comparison, the other file formats are geared towards data visualization and dissemination, thus their specifications aim to facilitate user-friendliness. The consortium considers FASTQ as the basic file format for archival purpose and thus the FASTQ format's specifications aim to preserve the raw sequence data. As the ENCODE consortium is a collaborative effort, the consortium has made several specifications on the file formats to facilitate data archival, presentation, and distribution, as well as integrative analysis on the data. These file formats were originally designed to be generic and flexible.

FASTQ: a text-based format for storing nucleotide sequences (reads) and their quality scores.The ENCODE consortium uses several file formats to store, display, and disseminate data: Common File Formats Used by the ENCODE Consortium Overview
