Published on 27 January 2021 |
Danaus plexippus genome annotation
View DatasetDescription
Funannotate v1.5.3 docker image was used to train Augustus v3.2.3, predict gene models, and perform functional annotation. As input for optimizing the performance of Augustus v3.2.3, funannotate used 2,404 PASA v2.3.3 gene models. To obtain this training gene model set, transcripts were de novo assembled with Trinity v2018-2.8.3 under settings --SS_lib_type RF, using all poly(A) RNA-seq paired reads after adapter removal with Trimmomatic v0.32. These transcripts were aligned to the genome under PASA using BLAT v36, obtaining a first set of gene models. The 500 longest non-redundant ORFs associated with the PASA gene models were used to train TransDecoder v5.2.0. Then the gene models were selected according to their abundance as estimated by Kallisto v0.44.0 under settings --rf-stranded using the Trinity normalized reads. Ultimately, BRAKER v2.0.3b trained Augustus with the retained gene models.
For gene prediction, funannotate aligned mRNAs and proteins from the previous annotation (official gene set 2, OGS2) with minimap v2.14-r883 under settings -ax splice --cs -u b -G 3000, and Diamond blastx v0.8.22, respectively. Protein alignments were further refined by funannotate, including 3 kb upstream and downstream of the region of alignment, and subsequently executing Exonerate v2.4.0. Additionally, funannotate parsed the introns supported by alignments of poly(A) RNA-seq reads generated with HISAT v2.1 under settings --rna-strandness RF --max-intronlen 10,000. This combination of hints (protein alignments, transcript alignments, and intron locations) was used by Augustus to predict a second set of 16,756 gene models. Of them, 9,695 were dubbed as highly supported, i.e. had more than 90% of their model supported either by intron hints, transcript alignments, or protein alignments. GeneMark-ET v4.35, under settings --max_intron 3,000 --soft_mask 2,000, was also run independently to predict a third set of gene models but only relying on intron hints.
The PASA, Augustus highly supported, Augustus not highly supported, and GeneMark prediction sets were combined by EVidenceModeler, assigning them 10, 5, 1, and 1 relative weights, respectively. The predictions were further filtered by removing genes shorter than 50 aa in length, or that had high sequence similarity (diamond blastp --sensitive --evalue 1e-10) to the repeat database included in funannotate, or that had more than 90% of the model intersecting regions masked by RepeatMasker. The filtered set of gene models was updated in order to include UTR information by two executions of the PASA annotation comparison using the Trinity transcripts and filtering gene models according to transcripts per million as calculated by Kallisto. Alternative transcripts were only kept if they were at least 10% as highly expressed as the most highly expressed transcript per gene.
Non-coding genes were annotated with the following tools: tRNA genes, tRNAscan-SE v.2.0; rRNA genes, RNAmmer v.1.2; and for a variety of other RNA genes, Infernal v1.1.1. Specifically, for miRNA-encoding genes, we used BLASTn to locate the most recent annotation of these genes. Lastly, FEELnc classified lncRNAs from the transcripts assembled by StringTie v1.3.2d , and considering protein-coding predictions described above.
Citations (0)
No citations found
Mentions (0)
No mentions found
Metrics Over Time
Publication Details
Subfield
Cellular and Molecular Neuroscience
Field
Neuroscience
Domain
Life Sciences
Confidence Score
94%
Source
Open Alex