Virtual-SAGE: A new Approach to EST Data Analysis

. Valeriy Poroyko, Vladimir Calugaru, Mark Fredricksen, Hans J. Bohnert 

Department of Plant Biology and Department of Crop Sciences, University of Illinois, 1201 W. Gregory Drive, Urbana, IL 61821, USA


The V-SAGE
The approach described here, termed 'Virtual SAGE' (V-SAGE), takes the efficiency, speed, and reliability of data mining from classical SAGE, and combines it with the expediency of gene identification that characterizes EST analysis. The concept is based on establishing a correlation between several tags extracted from EST sequence collections at different distances from the poly (A) region. By extracting tags at the extreme 3'-end and internal tags from the EST sequences, complexity is reduced, clustering of similar transcripts into larger groups (populations of tags) is possible, BLAST analysis marks contigs, and 3'-terminal variants, mostly in the 3'-UTR, within these groups are rapidly identified. 

The V-SAGE software structure.
The V-SAGE software that has been generated is an in silico emulation of the established SAGE protocol. The input data consists of a FASTA file that contains a string or strings of EST sequences from the library of interest that have been determined from the 3' end. Data processing includes the following steps: 1. Program identifying the poly-A region, (minimum eight A residues) 2. Extracting the first 10 bases tag immediately upstream of the poly(A)+ region. 3. Collecting the set of 10-base tags, which are immediately 3'-adjacent to the site that is most 3' of a selected restriction endonuclease cleavage site, e.g., for NlaIII (CATG), and records the distance from this site to the poly(A)+-adjacent tag in nucleotides. 4. Assigning a clone name identifier to each pair of tags. The result of the data processing consists of the set of tags located upstream of the poly(A)+ region of each EST. This set is a unique identifier for any transcript, which can then be used for further analyses, such as a digital representation of cellular gene expression, for studies of 3'-UTR variability, and also as a map for regular SAGE transcript profiling. The principle of V-SAGE is applicable to any number of sequence collections and short oligonucleotide strings with certain precautions. The use of four-bp cutting restriction endonucleases, for example, is determined by the average length of the sequences targeted. In our example, the average length of approximately 500 nucleotides (theoretical optimum is 256 bp, the experimental data provide a range from 100 to 600 nucleotides) provided suitable frequency. This length does however not allow for the extraction of tags by restriction sites for 6 bp recognizing enzymes, which would require sequence length of at least 1 kb, and preferably higher. When longer sequences are available, splice variants within the coding region of transcripts, for example, can also be identified by V-SAGE.

Web executable V-SAGE script.
Browse local FASTA file. 
Hit submit button.


CGI programming by 
Michael Brukman 
http://misha.brukman.net
Perl script by Lihua Jiang.

To run of VSAGE on a local computer you need to download file 
perlcode_v2.pl


 
To run of VSAGE on a local computer you need to have Perl installed on you mashine and download file perlcode_v2.pl
Use comand line:
c:\...\perl perlcode_v2.pl [filename.txt]
 
Last modified 10/14/03