Tutorial for the Grapevine Genome Annotation

Exhaustive identification of the genes within the family:

The first step of the analysis is to determine a preliminary set of genes that potentially belong to the family that you are curating. There are different types of strategies depending of each type of families. We have set up a color code for indicating which step you should follow according to your category. The search by protein domain is however possible for most of the genes, independently of the category. When possible, we recommend using this approach combining to another one to get an exhaustive list. The category “metabolic and signaling pathways” requires using different approaches according to individual genes.

For details, see the attached PDF ( Horticulture Research, Volume 10, Issue 5, May 2023 ):

By protein domain:

If the members of the family that you are studying are known to have a common protein domain you should identify all the genes in the genome that present this domain with the GBrowse. The first step is to identify the domain code corresponding to your search. The query page of the GBrowse allows you to search with identifier from 3 databases PFAM, PROSITE and SMART. By typing relevant keywords in their website you should retrieve the corresponding code. Once you have the code you can go to the query page of the GBrowse and enter the code in the code entry. Alternatively you can browse for domain or enter keyword to retrieve the corresponding domain. You should then obtain the list of the genes containing the domain.

Transcription factor families:

There are two main databases for plant transcription factors that you can find. For vitis the analysis has been done on the 8X genome for the first one and on a re-annotation of the 8X made at the NCBI for the second. Once you have retrieved the list of the genes corresponding to your family in the websites, you can blast them at the Gbrowse to retrieve the genes from the 12X v1. In both websites, you have to click on the transcription factor family and on the vitis specie to access the page with the list of genes. That page in the first website also told you which domains shall or have to be present in the sequences.

Transporters:

The main database for transporters can be found at the Transporter Classification Database. This database is not plant-specific. You can retrieve the proteins of a given family; you can type you keywords in the search form of the website. Once you have identified the corresponding family, you can blast the plant proteins from the list against the grape sequences in the GBrowse.

The difference of the blast-based approaches and the domain approach is that if a domain is missing because of a problem in the structural annotation made by GAZE or JIGSAW you might be able to identify potential candidates with blast. However it depends on how the sequences are conserved beside the important domains.

Metabolic and signaling pathways:

Here you have to retrieve your genes list through homology with known genes (vitis and other species) previously identified in the bibliography. Depending of the type of gene you might want to use one of the approaches previously mentioned or to perform Blast against public databases to retrieve your sequences. Also some of your genes might belong to a family studied by another group and we are asking you to work in collaboration with the corresponding group, especially for defining symbol names.

Other families:

Most of these families have probably a common protein domain, although you might want to perform a blast against the vitis genes in the GBrowse with genes from other species, particularly with Arabidopsis genes.

Verification of the structure:

The goal is to validate that the structural annotation made by GAZE and JIGSAW are accurate. Pay a particular attention on genes in your preliminary set with a name indicating proximity on the genome they might be two parts of the same gene. You can perform a blastx analysis against the public databases. Make sure that the coverage of the query and the target are near 100% on the best hits. The potential function of the target don’t matter here you will probably match unknown poplar genes first. If the percentage of the coverage of the query is low, it might be two different genes wrongly assembled. To double-check it you should remove the matching part of the sequence and blast the remaining part again to see if you are matching a completely different gene, however it can just be non-coding sequence. If the coverage of the target is low, it might be because the grapevine gene has been split in multiple parts wrongly. To identify the missing part, blast the target against the grape genes in the GBrowse to retrieve the genes corresponding to different part of the actual gene, the parts have to be adjacent on the genome sequence. Parts of the gene might also be not detected by GAZE and JIGSAW, in that condition, you might want to blast the heterologous gene against the complete gape genome sequence. If both the coverage of the query and the target are low, it might be because of a poor homology between the grape gene and genes from other species. However it could also be because of a complex miss-assembly of the genes and you should perform both analysis describe earlier to determine what is odd. If you still have trouble you might want to contact Jerome Grimplet for help. If there is a problem the sequence need to be edited in the GBrowse. We are currently defining a procedure to do that.

Characterization of the role of the genes:

Database analysis:

Blast against public database to check the function

You can use different platforms such as the NCBI, Kegg, or Uniprot. You can use the blast results from the step of verification of the structure. You have to verify that the best hits are actually corresponding to the function that you are expecting: However it would be normal to obtain within the bests hits a lot of unknown sequences for vitis or poplar for example. Also be careful when your genes are matching poplar and ricinus communis genes, very few of them have been manually verified and the automatic criterions to define their function appear to be quite loose, it is quite possible that an Arabidopsis gene with a smaller score will give you better indication about the function. In some case you will have to run the Blast analysis on a relatively high number of sequences and you should know that the Blast analysis in the NCBI allows, since recently, entering more than one sequence per analysis. On the result page you will see a drop down menu with the sequences that you have uploaded.

Domain analysis:

If you have to perform the protein domain analysis here are the corresponding addresses of the database used in the GBrowse and the INTERPRO database, which regroup all of these database and a few other.

Identification of subfamilies:

Within the families, it might be important to consider and identify members of subfamilies following the model of the architecture proposed in other species. In some case sub-families can be defined according to the presence or absence of a specific domain, but they can also be identified according to the proximity between global proteins structures, most likely the two characters have to be taken into account. It might come helpful to do multiple sequences alignment and phylogenetic analysis to decipher the relative proximity between the members of a family. Here we are proposing several options to do this analysis.

Multiple sequence alignments:

One of the most popular multiple sequence alignment (MSA) programs is ClustalW. However, its results are often far from perfect, and in recent years many new MSA programs have been developed. Among these, two of the most promising are MUSCLE and MAFFT. MUSCLE is renowned for its speed and accuracy, but uses a progressive alignment algorithm reminiscent of that employed by ClustalW. MAFFT version 6, on the other hand, introduced new programs in its suite which are inspired by the T-COFFEE and Probcons approach. The new programs (G-INSI, L-INSI and E29 INSI) try to align the sequences using a consistency criterion in the refinement step. While G-INSI is suited for aligning globally similar sequences, LINSI and EINSI take into account the possibility of large gaps in the alignments. Although at the time of the writing the author advised against using these algorithms for large datasets (more than 200 sequences at most) due to the heavy computational burden, on a modern laptop (1,73 GHz Intel Core Duo processor, 2 GB RAM, operating system: Ubuntu 9.10 32 bit) they performed very well even on datasets of almost 300 sequences. It has been demonstrated that LINSI is probably the best protein sequence alignment program at disposal at this time, along with PROBCONS; however, MAFFT was consistently faster in the simulations performed. For these reasons, LINSI and EINSI were our programs of choice for MSA. The outputs were analyzed using Jalview 2.4 .For phylogenetic analyses, the protein sequences of the genes of interest instead of their corresponding DNA sequences can be chosen as input. In fact, DNA or cDNA sequences tend to display saturation problems; if gaps are allowed into the alignment, as is the case with the output of most MSA programs, on average the residues which can be randomly identical between two sequences is as high as 50% . Moreover, the so-called substitution saturation for the wobble base can pose significant problems to the analysis, as the third codon position becomes rapidly randomized in distantly related sequences. This problem is much reduced in the case of protein sequences; on average, only 10-15% of the residues between two aligned sequences could be randomly identical.

Phylogenetic analysis:

Generally we perform two different phylogenetic analyses for each family. In order to ascertain their evolution in grape, a first tree is generated using MrBayes 3.1.2. The analysis is performer using the prior Unconstrained:Exponential, default in the program, which should be un informative enough for the dataset to obtain reasonable results. Moreover, the instruction “prset = aamodel(mixed) was always introduced, so that the program would estimate which substitution matrix best described the dataset. The resulting tree is highly informative about the evolution of each family in the grape lineage, and can be used for studies of selection on the genes in future studies, for example with CODEML. The analyses are performed on the BioHPC server of the Cornell University. A second phylogenetic analysis performs for each family creating a new dataset comprising also the Arabidopsis thaliana sequences of interest. This plant model species is selected because, albeit it is more evolutionary distant from Vitis than, for example, Populus trichocarpa, the number of its characterized genes exceeds those available for any other plant organism. Therefore, the inferred phylogeny makes it possible to propose working hypotheses on the role of many Vitis genes, on the basis of the function in Arabidopsis of other genes in the clades. The retrieval information for each protein family is described. A new dataset is then created comprising both Arabidopsis thaliana and Vitis vinifera sequences. However, although the Bayesian inference of phylogeny provides good results, it is found to be too computationally expensive for the analysis of the combined datasets. Therefore, these are analyzed using the maximum likelihood program PhyML. The best protein model is chosen looking at both the results of the MrBayes run on the Vitis subset and the results of the ProtTest suite, which always converged. The option of using the SPR algorithm instead of the classic NNI is selected; this should minimize the probability of the program remaining blocked in a local maximum of the likelihood function. The reliability of the branches is estimated testing the aLRT against a Chi2 distribution.

Level of evidence:

If during the blast analysis, a gene appears to be identical to a previously reported vitis genes, you should report the publication giving evidence to its functional characterization and flag it as well characterized gene.

Naming convention of the symbol:

If a gene has been described previously in grape you might want to use the corresponding symbol preferentially. If several symbols have been already given to a gene, the other symbols have to be reported.

Full name:

The full name allows more freedom than the symbol naming. However we ask you to be consistent amongst genes within family. Please also try to be as descriptive as possible. You will have the possibility to input different name in the gene product report if there is ambiguities.

Field in the manual curation form (Gene product report):

Locus:

Locus name provided by Giorgio Valle. You should not have to modify it.

Symbol:

It should give indication about the function, but also as concise as possible. It shoud be between 3 and 8 characters, for Arabidopsis, the TAIR recommend 5.

Full name:

Complete name similar to bests hits. If there is no evidence of the characterization of the gene in grapevine, you should add”–like” at the end of the function.

Other names:

If a gene has been previously characterized with different names they should be reported, especially for enzymes which different names correspond to the same function. Can be found on the KEGG.

E.C number:

They have been obtained automatically for enzymes in the KEGG but have to be reviewed.

KO number:

As for E.C numbers, they have been obtained automatically for a wider range of proteins in the KEGG but also have to be reviewed.

Description (of the potential role):

If it is inferred from another organism it has to be indicated.

Learn about genome annotation in this webinar:

in News