Background to the disease
The Wellcome Trust
Current control measures
Objectives of the project Partner laboratories Positions available Publications arising from the project Microarray software Useful links

Microarray software

During the course of this project, a bioinformatic pipeline was developed to facilitate the analysis of expression data from NimbleGen custom microarray experiments. The software was specifically designed to process T. annulata microarray data, however many of the scripts are suitable for use with alternate array designs and target genomes without modification. The software suite comprises a collection of inter-dependent scripts implemented in Perl, which generate and process a series of files using a variety of custom file formats. Although the software was developed using ActivePerl on a PC running Windows XP, the code should be platform-independent. The software pipeline and documentation are available for download -

- Custom microarray Perl scripts (zipped)
- Script and filetype documentation (pdf)

1. Extracting data from genomic annotation files
Genomic sequence and annotation data for T. annulata are contained in a series of files in EMBL format. Unfortunately, EMBL files cannot be loaded into the viewing software, IGB (The Integrated Genome Browser - Affymetrix). The information must be parsed in order to generate a genomic annotation file, a FASTA file and a contig information file for viewing. This task is performed by the script process_embl_files.pl (Figure 1), which also performs the important function of generating an exon table, i.e. a list of the chromosomal location of all the exons and their ‘parental’ genes in the genome. The exon table is used as a reference for position-mapping probes to chromosomal features such as exons and genes.

2. Mapping probes to exons and genes
Two independent approaches were undertaken to associate parasite genes with probes on the array. In each case either ‘sense’ or ‘antisense’ probe sets can be identified.

2.1. Mapping probes to exons and genes based on chromosomal position
The first approach integrates data from the exon table (i.e. chromosomal location of exons) with that from probe-mapping (i.e. chromosomal location of probes). If a probe is identified as lying completely within an exon, then it is designated as mapping to that exon (Figure 4). This function is performed by the script map_probes_to_exons.pl (Figure 1). In turn, this information is assimilated by the script map_probes_to_genes.pl to produce a gene-probe mapping file, where each gene is represented by all the probes mapping to its exons.
Figure 4.  Position-mapping

2.2. Mapping probes to genes based on sequence similarity
The second approach uses BLAT, the BLAST-like alignment tool, to match probe sequences to gene sequences (Figure 5). BLAT is more accurate and many times faster than existing tools for nucleic acid alignment and operates by indexing all non-overlapping k mers in the genome. The sequence of each probe on the array is compared with the coding sequence of the genome to produce a PSL file (similar to a tabular BLAST report).
Figure 5.  BLAT-mapping
The script filter.pl reads the information in the PSL file in order to generate a list of genes and the probes that map to them by employing a flagging system similar to the web-based application, ProbeLynx. Briefly, a flag value of 1 represents a perfect, full-length alignment between a probe and gene in the BLAT-database, while a flag value of 5 represents a poor alignment. For each individual probe, if a clear best match in the coding sequence can be identified, that coding sequence (i.e. gene) is designated as that probes target sequence and any poorer scoring BLAT-aligned sequences are designated as cross-hybridisation candidates. Both target gene and cross-hybridisation candidates are assigned a flag. When generating a BLAT-mapped probe set, different threshold values can be used, for example the default is to only select probes which have a target flag of 1 or 2 (perfect or highly similar) and which have no cross-hybridisations candidates with a flag better than 3. These particular settings are based on experimental sensitivity and specificity studies of oligonucleotide arrays and similar to the position-mapping process, a gene-probe mapping file is created.

3. Sliding window and interval analysis
The process of mapping probes directly to genes and calculating their expression levels relies on the accuracy of pre-conceived gene models. An alternative approach is to perform a sliding window analysis, summarising probe signal across entire chromosomes in a series of overlapping frames, independent of gene information. The scripts sliding_window.pl and interval.pl were created to perform this function in a similar manner to the Affymetrix Tiling Array Software application, TAS (Figure 3). This form of analysis may be used to compare results from two ‘conditions’ with a single hybridisation or replicate hybridisations representing each condition. Two conditions may represent treatment and control or two different life-cycle stages. For example, the majority of TashAT genes are believed to be expressed in the macroschizont and down-regulated during differentiation and this is illustrated in Figure 6.

Figure 6. Sliding windows and interval analysis
Predicted transcripts compare favourably with the TashAT gene models. To summarise, a sliding window is moved along the chromosome in a series of steps and each successive window (or band) represents a different collection of neighbouring probes. The values of the probes in the first condition are compared to the values in the second condition using a Wilcoxon Rank-sum Test. This tests the hypothesis that the values in each group are different and calculates a p value for each band. The Hodges-Lehmann Estimator (HLE) is also calculated and may be used as a measure of the difference in signal intensity between the two conditions. Further details regarding the implementation of this test can be found in the documentation and in the Perl source code. The p values and HLE values for each window may be visualised directly in IGB to give an impression of differential expression between conditions.

The results of sliding window analysis can be used to predict the location of up-regulated and down-regulated transcripts and this is achieved using a three-step process called interval analysis. The result for each band is examined in turn and a band is designated as ‘valid’ if it has a p value below a preset threshold and an HLE value above a threshold. Following this, neighbouring valid bands are grouped together to form blocks of bands, containing gaps no larger than a pre-determined length. Finally, predicted transcripts are identified as blocks of bands above a minimum length. Details of each predicted transcript can be saved as genomic annotation and used for down-stream analysis.