[ ISCB ]      [ News ]      [ Forum ]      [ Wiki ]
:: MAIN MENU ::
:: Home
:: Structure
:: Newsletters
:: Forum
:: Wiki
:: History
:: Conferences
:: Contributors
:: Feedback
:: COMMITTEES ::
:: Events
:: Business Relations
:: Marketing & Operations
:: Education & Careers
:: RSGs


ISCB Student Council at ISMB 2006
[Special Page on our Wiki]


Poster Abstracts

2nd ISCB Student Council Symposium, August 6th, Fortaleza, Brazil

 Expand All       Collapse All

1. Peter Van Loo, Stein Aerts, Diether Lambrechts, Sunit Maity, Bert Coessens, Frederik De Smet, Leon-Charles Tranchevent, Bart De Moor, Peter Marynen, Bassem Hassan, Peter Carmeliet, Yves Moreau. University of Leuven, Herestraat 49, box 602, Leuven, B-2230, Belgium. [ PDF ]

Gene prioritization by genomic data fusion

Short Abstract: We developed a novel bioinformatics method, ENDEAVOUR, to prioritize candidate genes underlying pathways or diseases, based on similarity to genes known to be involved in these processes. ENDEAVOUR can fuse information from multiple heterogeneous data sources. We successfully validated ENDEAVOUR computationally, as well as in vitro and in vivo.


Long Abstract: The identification of genes involved in health and disease remains a formidable challenge. Here, we describe a novel bioinformatics method to prioritize candidate genes underlying pathways or diseases, based on their similarity to genes known to be involved in these processes. It is freely accessible as an interactive software tool, ENDEAVOUR, at http://www.esat.kuleuven.be/endeavour. Unlike previous methods, ENDEAVOUR generates distinct prioritizations from multiple heterogeneous data sources, which are then integrated, or fused, into one global ranking using order statistics. The data sources include sequence information (Blast, Interpro, regulatory motifs and cis-regulatory modules, disease probability), annotation (Gene Ontology, KEGG, text mining), expression data (EST libraries, microarray Gene Atlas), and protein-protein interaction. In addition, ENDEAVOUR offers the flexibility of including external data sources, such as in-house microarray data. ENDEAVOUR prioritizes candidate genes in a three-step process. First, information about a disease or pathway is gathered from a set of known �training� genes by consulting multiple data sources. Next, the candidate genes are ranked based on similarity with the training properties obtained in the first step, resulting in one prioritized list for each data source. Finally, ENDEAVOUR fuses each of these rankings into a single ranking, providing an overall prioritization of the candidate genes. We validated ENDEAVOUR by a large-scale leave-one-out cross-validation. In each validation run, one gene was removed from a set of training genes and added to 99 random genes. We then determined the ranking of this left-out gene. We used 627 known disease genes from 29 different diseases and 76 known pathway genes from 3 receptor signaling pathways. The median rank of the left-out disease genes was 3 out of 100 and of the pathway genes 2 out of 100. Thus, ENDEAVOUR can efficiently prioritize both disease and pathway genes. To assess whether ENDEAVOUR can also identify novel monogenic and polygenic disease genes, we performed 16 prioritizations of recently identified disease genes, each time using literature information only up to one year prior to their identification. For monogenic disease, 50% of disease genes were prioritized within the top 2% of candidate genes. For polygenic diseases, 50% were prioritized within the top 15%. Furthermore, in a study of the myeloid differentiation pathway, we prioritized genes that had been linked to this pathway by microarray and cis-regulation analysis. In vitro validation showed that prioritization resulted in a significant increase in the number of true regulatory targets.. Finally, as a most stringent test, we validated ENDEAVOUR in an animal model in vivo. The DiGeorge syndrome (DGS) is a common congenital disorder, in which craniofacial dysmorphism and other defects result from abnormal development of the pharyngeal arches. ENDEAVOUR prioritization of 58 candidate genes in a 2 Mb region, involved in atypical cases of DGS, identified YPEL1 as a novel putative DiGeorge syndrome gene. In vivo validation in zebrafish by morpholino knockdown revealed that this gene is indeed involved in pharyngeal arch development. As a conclusion, ENDEAVOUR offers novel opportunities for gene discovery.

2. Shay Zakov, Avraham A. Melkman. Computer Science Department, Ben Gurion University, Beer Sheva, 84105, Israel. [ PDF ]

Power CoClustering - detecting multiple regulatory influences in gene expression data

Short Abstract: Thermodynamic models of transcription regulation suggest that gene expression levels may obey a power law dependence on the activity level of the regulatory mechanism. We present a new method of CoClustering micro-array data, which aims to detect such power law relations, as well as encouraging preliminary results.


Long Abstract: The usual methods of coClustering, or bi-clustering, for the analysis of microarray expression data are ad-hoc in the sense that they attempt to identify groups of co-expressed genes without referring to a model of transcription regulation. Often it is implicitly assumed that the expression level of a gene depends linearly on the concentration of the transcription factors that regulate its expression, even though thermodynamic models indicate the existence of complex relationships between the expression level and the concentrations of the factors. We report here on encouraging preliminary results of the analysis of gene expression data, using a novel algorithm that allows for the possibility of a power law relation between a gene expression level and the activity level of the regulating mechanism. Such a relation may be expected, for example, in case the activation of transcription of a gene is driven by binding of the same transcription factor to several upstream binding sites. If transcription initiation takes place only when all regulatory sites are occupied by the transcription factor, the rate of transcription initiation is proportional to a power of the concentration of the transcription factor. In order to accommodate such power law behavior we define a new measure of distance between genes, which is smaller the closer one gene comes to being a constant power of the other in the set of experiments under consideration. The measure can also be generalized to sets of genes, in which case its value is smaller the closer each and every one of the genes comes to being modeled as a constant power of a single underlying activity variable. Although, as we prove, the problem of finding the exact value of this measure is computationally intractable, we present an iterative computational method that efficiently finds near-optimal solutions for most instances, and provably finds the optimal value for the case of two genes. Because genes usually serve in more than one capacity it is to be expected that they are co-regulated only in certain experiments, whereas in others there is no similarity in their behavior. In order to deal with this possibility we have developed a Monte-Carlo algorithm, based on our distance measure, which identifies subsets of genes and imultaneously subsets of experiments, such that the measure of the subset of genes in the subset of experiments does not exceed some pre-specified bounding value. Preliminary results over synthetic data have shown that this algorithm has a high success rate in identifying planted coClusters in a large data set, suggesting that the algorithm has the potential to become a powerful tool for uncovering previously unknown regulatory mechanisms.


3. S�rgio Manuel Serra da Cruz, Carla Corr�a Tavares dos Reis. Funda��o Oswaldo Cruz, Av. Comandante Guaranis, 447, Jacarepagu�, Rio de Janeiro,22775610, Brazil. [ PDF ]

Text Mining Provenance Logs of Bioinformatics Workflows

Short Abstract: The mining provenance of the bioinformatics datasets allows e-scientists to gain insight into the data and come up with new scientific hypotheses. So, in order to explore this opportunity we present TM-BioWSLogA (Text Mining Bioinformatics Web Services Log Architecture), a framework which relies on mining text of Web Services log data captured on bioinformatics in silico experiments.


Long Abstract: 1. Introduction Genome data and genome analysis initiatives are growing fast over the last years, giving rise to huge amounts of data available over the Internet [1]. Such amount of data and the complexity involved to analyze them have originated the area of Bioinformatics, where in silico scientific experiments are applied to solve Biological problems, encompassing multiple ombinations of computational resources and mathematical algorithms. In bioinformatics environments, program composition is a frequent operation, requiring complex management [2]. A scientist faces many challenges when building an experiment: finding the right program to use, the adequate parameters to tune, managing input/output data, building and reusing workflows, and last but not least the visualization of the results in order to enhance its cognition about the problem. The emergence of Web services technology represents a significant contribution to the reuse of scientific applications, since it provides unprecedented infrastructure for connecting otherwise isolated computing resources. Recently, they have been pointed out in the bioinformatics area as a potential technology to allow heterogeneous distributed biological data to be fully exploited [2,3]. Another feature is related to the data�s mining, where huge amounts and a wide variety of formats are used in different scientific experiments; it often requires transformations and predictive methods for analyzing the unstructured information. 2. Text Mining of Web Services Utilization Logs Recording the execution context of Bioinformatics Web services experiments into proper logs is an important phase of a scientific workflow execution. A concrete usage of a Web services log is to help e-scientists to avoid redundant efforts when repeating experiments. Usually, Bioinformatics tools have several input parameters, which can modify the behavior of their algorithm, and consequently modify the service results [5]. Another issue regarding service monitoring in bioinformatics environments is that e-scientists need efficiency. Some queries consist of hundred of sequences at a time and can take several days to run. Besides, due to the confidential nature of such experiments, it may be useful to keep of track of securities issues, such as recording the scientists that are submitting queries and when. Finally, e-scientists should also keep track program outputs they have used if they are willing to obtain faster results and able to reproduce the same experiments in another occasion. In order to solve those issues, we previously presented an architecture named BioWSLogA which supports a flexible log generation without changing the code of existing services, generating XML based logs repositories about services execution [5]. It can store complex and heterogeneous data structures and, at the same time, describe them through encapsulated metadata. In spite of this facility, e-scientists still need to understand the results of Bioinformatics experiments and workflows, so we propose the use of text mining techniques to help them to elucidate structures in complex datasets. This approach can play an important role in exploratory data analysis, where mining representations can help them to build up an understanding of the content of their in silico datasets and experiments. The text mining resources can be applied to refine the experimental biological methods by searching any workflows, annotations and derivation histories of bioinformatic data products. They are statistical methods that lack prior knowledge and counterbalance that deficiency with massive processing o data, finding patterns in word combinations that are recurring and predictive. The goal of this paper is to present a Web Services Log Text Mining Provenance tool, named TMP-BioWSLogA (Text Mining Provenance-Bioinformatics Web Services Log Architecture), which relies on mining provenance of Web Services log data captured in a Bioinformatics environment. The contribution of TMP-BioWSLogA is two-folded: first, it provides a human visual perception of biological experiments towards e-scientist personalization and auditing, supporting a new range of experimental strategies; and second, it also supports refined investigations about services quality monitoring, addressing administrative issues like performance, security and availability. TMP-BioWSLogA uses both Web server logs and intercepted SOAP messages recorded by BioWSLogA [4, 5]. 3. A Text Mining Provenance Web Services Log Mining Architecture TMP-BioWSLogA proposal outlines e-scientists needs for gathering knowledge about their in silico experiments. The architecture�s prototype can be used to amplify the perception of rules, patterns, regularities and behaviors, It aids e-scientists to visualize four different aspects of Bioinformatics experiment data sets, such as: (i) The usage of suitable experiment parameters; (ii) A information extraction method for analyzing unstructured information of Bioinformatics Web services composition; (iii) A easy way to audit and track the Web services utilization; (iv) A feasible way to keep track of data provenance. TMP-BioWSLogA, a multi-layered architecture capable to deal with both Web services XML based logs and traditional Web server logs as input data. The integration layer is set of programs used to prepare data for further processing. For instance: extraction, cleaning, transformation and loading. This layer uses XQuery, XSLT and XML Schemas to feed the data repository, i.e., a XML native database. The Web server log parser component is used to parse and transform plain ASCII files produced by a Web server to a standard XML format. This component is important to make the architecture independent from the Web server supplier. The sessionization layer is used to tie the instances of Web services and Web pages (through database foreign keys) to sessions and to user. This layer is important to investigate the usage of the Web services composition used through users sessions. The Database layer is a repository of input/output Bioinformatics experiments data. It also stores pre-processed logs, e-scientist sessions, and informations about the Web services execution. The Miner Engine Layer is a text mining engine and is in charge of bulk loading XML data from bioinformatic queries, executing the mining algoritms. It assumes the query collection in XML format to examine the unstructured text to identify useful features. The first step in handling text is to break the query stream of characters into words or, more precisely, tokens. The characteristic features of queries are the tokens or words they contain. So we can choose to describe each query by features that represent the most frequent tokens. This layer should be used to present implicit and useful knowledge from in silico experiments and Web services usage and composition. Data can be viewed at different levels of granularity and abstractions as paralled coordinates graphs [6, 7]. This model easily shows the interelationship and dependencies between different n-dimensions like users, experiments, services, parameters and results. Interactively, the model can be used to discover sensitivities and to do approximate optimization, provinding a simple decision support environment. 4. Conclusion and Future Work As far as we are concerned, there are no other initiatives of text mining provenance of Bioinformatics Web Services logs or Services composition. TMP-BioWSlogA is being tested with data originated by a collection of real world Bioinformatics Web services; we are involved in refining the architecture, which was implemented as Java prototype using Tomcat/Axis as SOAP engine and the Java classes of the miner engine layer which can run on a variety of platforms.


4. George S Vernikos, Julian Parkhill. The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton,Cambridge CB10 1SA, UK. [ PDF ]

Interpolated Variable Order Motifs for identification of horizontally acquired DNA: revisiting the Salmonella Pathogenicity Islands

Short Abstract: Interpolated Variable Order Motifs (IVOMs), exploit compositional biases using variable order motif distributions to predict more reliably Horizontal Gene Transfer (HGT) events compared to fixed-order methods. For optimal localization of the predicted boundaries, a 2nd order, 2-state Hidden Markov Model (HMM) is implemented in a change-point detection framework.


Long Abstract: There is a growing literature on the detection of Horizontal Gene Transfer (HGT) events by means of parametric, non-comparative methods. Such approaches rely only on sequence information and utilize different low (e.g. G+C content, ?* difference) and high (e.g. 6-9mers) order indices to capture compositional deviation from the genome backbone; the superiority of the latter over the former has been shown elsewhere. However even high order k-mers may be poor estimators of HGT, when insufficient information is available, e.g. in short sliding windows. Most of the current HGT prediction methods require pre-existing annotation, which may restrict their application on newly sequenced genomes, or bias the results. We introduce a novel computational method, Interpolated Variable Order Motifs (IVOMs), that combines the best of the two approaches (low and high order distributions) and requires no existing annotation. An IVOM approach exploits compositional biases using variable order motif distributions and captures more reliably the local composition of a sequence compared to fixed-order methods. For optimal localization of the boundaries of each predicted region, a 2nd order, 2-state Hidden Markov Model (HMM) is implemented in a change-point detection framework. We applied the IVOM approach to the genome of Salmonella enterica serovar Typhi CT18, a well-studied prokaryote in terms of HGT events, and we show that the IVOMs outperform state-of-the-art low and high order motif methods predicting not only the already characterized Salmonella Pathogenicity Islands (SPI-1 to SPI-10) but also three novel SPIs (SPI-13, SPI-14, SPI-15) and other HGT events. Validation of the predicted novel SPIs was carried out by comparative genome analysis between Escherichia coli K12 and 8 representatives of the Salmonella lineage. Availability: The software is available under a GPL license as a standalone application at http://www.sanger.ac.uk/Software/analysis /alien_hunter.


5. Mileidy Gonzalez, Stephen J. Freeland. University of Maryland Baltimore County, Biological Sciences, 1000 Hilltop Circle, Baltimore,MD 21250, USA. [ PDF ]

Analyzing the Effects of Generalizations Implicit within the BLAST Algorithm

Short Abstract: We have mapped the entire network of simplifying mathematical assumptions used by the BLAST algorithm (the most widely-used pairwise sequence comparison method) to identify the properties of unusual, natural sequences that could compromise the program's reliability. We present preliminary data that evaluates the magnitude of potential problems.


Long Abstract: Biological sequence comparison is one of the most widely used techniques of modern biology. In particular, because this method can be used to make quantitative estimates of whether and how two sequences are homologous, its use is implicit within many fundamental bioinformatics techniques (e.g. phylogenetic tree construction, genome annotation, threading, protein family assembly, etc.). The underlying algorithm has been developed over the course of sixteen years and has gone through one major conceptual change [Altschul et al 1990, Altschul et al 1997]. Thus, although the use of local pairwise alignment algorithms such as BLAST is extremely widespread.


6. Francisco M Couto, M�rio J Silva, Vivian Lee, Emily Dimmer, Evelyn Camon, Rolf Apweiler, Harald Kirsch, Dietrich Rebholz-Schuhmann. Faculdade de Ciencias da Universidade de Lisboa, Campo Grande, Lisboa,Lisboa 1749-016, Portugal. [ PDF ]

GOAnnotator: linking protein GO annotations to evidence text

Short Abstract: GOAnnotator is a tool for assisting the GO annotation of UniProt entries. GOAnnotator links the GO terms present in the uncurated annotations with evidence text automatically extracted from the documents linked to UniProt entries.


Long Abstract: This abstract illustrates how text-mining can be integrated in a biological database curation process, by describing GOAnnotator http://xldb.fc.ul.pt/rebil/tools/goa/), a tool for assisting the GO annotation of UniProt entries. GOAnnotator links the GO terms present in the uncurated annotations with evidence text automatically extracted from the documents linked to UniProt entries. Initially, the curator provides a UniProt accession number to GOAnnotator. GOAnnotator follows the bibliographic links found in the UniProt database and retrieves the documents. Additional documents are retrieved from the GeneRIF database. Curators can also provide any other text for mining. GOAnnotator then extracts from the documents GO terms similar to the GO terms present in the uncurated annotations. The extraction of GO terms is performed by FiGO, a method that receives text and returns the GO terms detected. The degree of similarity between two GO terms is calculated through the semantic similarity measure proposed by Lin. GOAnnotator ranks the documents based on the extracted GO terms from the text and their similarity to the GO terms present in the uncurated annotations. Any extracted GO term is an indication for the topic of the document, which is also taken from the UniProt entry. GOAnnotator displays a table for each uncurated annotation with the GO terms that were extracted from a document and found similar to the GO term present in the uncurated annotation. For each uncurated annotation, GOAnnotator shows the similar GO terms extracted from a sentence of the selected document. If any of the sentences provides correct evidence for the uncurated annotation, or if the evidence supports a GO term similar to that present in the uncurated annotation, the curator can store the annotation together with the document reference, the evidence codes and additional comments. The sentences from which the GO terms were extracted are also displayed. Words that have contributed to the extraction of the GO terms are highlighted. GOAnnotator gives the curators the opportunity to manipulate the confidence and similarity thresholds to modify the number of predictions. Assessment: From the set of UniProt/SwissProt proteins with uncurated annotations and without manual annotations, we selected 66 proteins for which GOAnnotator identified evidence texts with more than 40% similarity and 50% confidence. For 80 uncurated annotations to these proteins, GOAnnotator extracted 89 similar annotations and their evidence text from 118 MEDLINE abstracts. The 80 uncurated annotations included 78 GO terms. After analyzing the 89 evidence texts, GOA curators found that 83 were valid to substantiate 77 distinct uncurated annotations, i.e. 93% precision. In most cases, where the evidence text was correct, the GO term present in the extracted annotation was the same as the GO term present in the uncurated annotation (65 cases). Although the evidence text being correct, most of the times it did not exactly contain any of the known representations of the extracted GO term. In the other cases the extracted GO term was similar: in 15 cases the extracted GO term was in the same lineage of the GO term in the uncurated annotation; in 3 cases the extracted GO term was in a different lineage, but both terms were similar (share a parent). Discussion: Researchers need more than facts, they need the source from which the facts derive. GOAnnotator provides not only facts but also their evidence, since it links existing annotations to scientific literature. GOAnnotator uses text-mining methods to extract GO terms from scientific papers and provides this information together with a GO term from an uncurated annotation. In general, we can expect GOAnnotator to confirm the uncurated annotation using the findings from the scientific literature, but it is obvious as well that GOAnnotator can propose new GO terms. GOAnnotator provided correct evidence text at 93% precision, of which in 78% of the cases the GO term present in the uncurated annotation was confirmed. This performance meets the expectations of the curation process. However, sometimes, the displayed sentence from the abstract of a document did not contain enough information for the curators to evaluate an evidence text with sufficient confidence. Apart from the association between a protein and a GO term, the curator needs additional information, such as: the type of experiments applied and the species from which the protein originates. Unfortunately, quite often this information is only available in the full text of the scientific publication. GOAnnotator can automatically retrieve the abstracts, but in the case of the full text the curator has to copy and paste the text into the GOAnnotator interface, which only works for a limited number of documents. In addition, the list of documents cited in the UniProt database was not sufficient for the curation process. In most cases, the curators found additional sources of information in PubMed. GOAnnotator ensures high accuracy, since all GO terms that did not have similar GO terms in the uncurated annotations were rejected. This meets the GOA team�s need for tools with high precision in preference to those with high recall, and explains the strong restriction for the similarity of two GO terms: only those that were from the same lineage or had a shared parent were accepted. Thus, GOAnnotator not only predicted the exact uncurated annotation but also more specific GO annotations, which was of strong interest to the curators. To avoid general terms, GOAnnotator takes advantage of uncurated annotations by extracting only similar terms, i.e. popular proteins tend to be annotated to specific terms and therefore GOAnnotator will also extract specific annotations to them.


7. Scott, Luis P. B. Unifev - centro universit�rio de votuporanga, Rua pernambuco n. 4196, Votuporanga,S�o paulo 15500006, Brasil. [ PDF ]

Aplica��o de Redes MLP na Predi��o de Estrutura Secund�ria de Prote�nas

Short Abstract: The prediction of secondary structure of proteins can contribute to elucidate the protein folding problem. In order to predict these structures we used methods of Artificial Neural Networks (ANN) starting form the primary sequences of amino acids. The obtained data are compared with predictors described: PSA, PSIPRED and PHD in order to have an idea of the quality of the prediction.


Long Abstract: The term �protein� comes from the Greek (proteios) and it means �the first magnitude�. The proteins are complex molecules that have a specific tertiary structure. These macromolecules realize tasks like chemical reactions catalysis, transport, recognition and transmission of signs. Therefore we need to know the 3D structure of these molecules, because the prediction of secondary structure of proteins can contribute to elucidate the protein folding problem. In order to predict these structures we used methods of Artificial Neural Networks (ANN) starting form the primary sequences of amino acids. The ANNs are good tools to classify and recognize the patterns. Therefore they are good tools in the 1D prediction. Our main objective was to develop a software to 1D prediction in the Web. In this present work we used ANNs in the prediction of the secondary structures of proteins, taking as patterns the structures in helix form (H), beta sheet (E) and coil (C). The ANNs were trained with the Simulator of MATLAB. The obtained data are compared with predictors described: PSA, PSIPRED and PHD in order to have an idea of the quality of the prediction. The present work is composed of three networks level. The output form all levels 1 ANNs are then fed a single second level ANNs. The third level is composed of jury decision..


8. Diogo Fernando Veiga, F�bio Fernandes da Rocha Vicente, Marco Grivet, Ana Tereza Ribeiro Vasconcelos. Laborat�rio de Bioinform�tica, Laborat�rio Nacional de Computa��o Cient�fica, Av Getulio Vargas, 333 Petropolis, Rio de Janeiro 25651-075, Brasil. [ PDF ]

Recovering Regulatory Interactions in Escherichia coli through Partial Correlation Analysis of Microarray Data

Short Abstract: In this work, we analyzed a large microarray dataset of Escherichia coli Affymetrix GeneChips using partial correlation coefficients as a way to predict regulatory interactions in transcriptome data. We found that partial coefficients were able to correctly recover a large proportion of transcription factor interactions as well as coregulated operons.


Long Abstract: The transcriptional control is an essential regulatory mechanism employed by bacteria (Lin and Lynch, 1996). In this type of regulation, transcriptional factors (TFs) bind to an operon cis-regulatory region to induce or repress its expression. Nowadays, even for the most studied bacterium Escherichia coli, much about the regulation remains to be discovered. The genome annotation carried out using sequence analysis tools, such as motif detection, were not able to assign transcriptional units for over two thousand genes in E. coli, as found in RegulonDB (Salgado et al., 2006). At same time, transcriptomics-related techniques, such as high density oligonucleotide arrays as well as cDNA spotted arrays, have been produced invaluable datasets, that should be explored for the purpose of elucidating the underlying regulatory mechanisms of biological systems. In this work, we have analyzed a large microarray dataset of E. coli Affymetrix GeneChips using partial correlation coefficients as a way to predict regulatory interactions in transcriptome data. First, we assembled a dataset with 1077 genes (grouped by 434 operons and 137 transcription factors) and 58 observations, each one corresponding to a single hybridization performed with an E. coli Antisense Genome Array chip obtained from GEO (Barret et al., 2005). The genes used in this analysis make up the Regulon DB 5.0 transcriptional regulatory network, and were used as a gold-standard regulatory dataset for validation of the results. The preprocessing of raw data (CEL files), including quantification of probesets and normalization was carried out using mas5 algorithm available in Bioconductor (Gentleman et al., 2004). The annotation of probesets was performed with the aid of NetAffx online tool (Cheng et al., 2004). Then we applied low-order partial correlation - zeroth-order Pearson correlation (Pearson 0th), first-order Pearson correlation (Pearson 1st) and second-order Pearson correlation (Pearson 2nd) - using the software ParCorA (de La Fuente et. al, 2004). The partial coefficients are obtained by conditioning the original correlation between two variables in one or more controlling variables. Accordingly with this measure, if the controlled correlation vanishes, it means that correlation between variables can be fully explained by the control variables (i.e., indirect effect); otherwise it is a true correlation. Thus, the algorithm only selects the interactions for which the partial coefficients were significatively different from zero, in a defined p-value, producing an undirected dependence graph (UDG). In the graph, each node corresponds to a gene and an edge between a pair of genes indicates a direct dependence between their expression profiles. The classification of the inferred interactions was done using the version 5.0 of E. coli transcriptional network RegulonDB. In the Pearson 0th graph, obtained using the common correlation without control and a p-value of 10e-2, a large number of edges were retrieved, 93.641 and only 12,2% of them could be characterized as TF-gene or coregulated. As the order of correlation increases, the proportion of characterized interactions also rapidly increases because indirect edges have been eliminated. For instance, the Pearson 1st p-value 10e-2 graph had 886 interactions, among them 65,6% uncharacterized and 34,4% characterized. Nevertheless, better results were achieved with Pearson 2nd partial coefficients, where 70% of edges (p-value 10e-2) found correspond to TF-genes (22,8%) or co-regulated operons (47,2%). Also, as one could expect, as we decrease the p-value the more precise is the inference, with the drawback of being very stringent and discarding some direct edges. For example, with Pearson 2nd and p-value 10e-4, 75,4% of links is experimentally known, although there was a reduction in the number of links identified, from 81 (10e-2) to 40 interactions. In consequence, we found that partial coefficients were able to correctly recover a large proportion of TF-gene interactions as well as coregulated operons, mainly with first-order and second-order coefficients. Therefore, the partial correlation analysis can be employed as a method for prediction of putative regulatory interactions using the expression data, as a complementary approach to transcription factor binding site tools and other tools that aim to detect co-regulated genes. In this sense, the interactions classified as uncharacterized in this study can be seen as feasible hypothesis generated by the model and may be biochemically validated through an experimental assay, such as chromatin immunoprecipitation (ChIP). Only for the graph Pearson 1st p-value 10e-2, there are 68 predicted TF-gene interactions to be further studied. For future work the analysis of the whole transcriptome of E. coli is in progress. REFERENCES Barrett,T., Suzek,T.O., Troup,D.B. et al. (2005) NCBI GEO: mining millions of expression profiles -- database and tools. Nucleic Acids Res, 33(Database issue):D562-6. Cheng,J., Sun,S., Tracy,A. et al. NetAffx Gene Ontology Mining Tool: a visual approach for microarray data analysis. (2004) Bioinformatics, 20(9):1462-32 de La Fuente,A., Bing,N., Hoeschele,I., Mendes,P. (2004). Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics, 20(18):3565-74. Gentleman,R.C., Carey,V.J., Bates,D.M. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, 5(10):R80 Lin,E.C.C. and Lynch,A.S. (1996) Regulation of Gene Expression in Escherichia coli. Chapman & Hill, USA. Salgado,H., Gama-Castro,S., Peralta-Gil,M. et al. (2006) RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res, 34(Database issue):D394-7.


9. Todd A. Gibson, Debra S. Goldberg. University of Colorado Health Sciences Center, Mail Stop 8303, PO Box 6511, Aurora,CO 80045, USA. [ PDF ]

Modeling the evolution of gene and protein interactions

Short Abstract: Current protein network analyses either use abstract evolutionary models lacking gene context, or model current-day protein interactions without a dynamic evolutionary component. We present a generalizable method for evolving an organism's putative ancestral protein interaction network to its current-day interactions. We compare evolutionary parameters and topology of distinct protein families.


Long Abstract: Graph-theoretic models ('networks') which use nodes to represent genes and links between the nodes to represent interactions provide valuable insight to our understanding of genetic network evolution [4,1]. However, such evolutionary models are decontextualized, theoretical constructs.. Gene networks can be contextualized by mapping them to experimentally-derived protein interactions of model organisms. Although comparative analyses between organisms' gene and protein networks enable evolutionary inferences [3], they don't capture the dynamics of evolution present in their theoretical counterparts. We present a model which evolves the genetic network of an organism from a set of putative ancestral protein interactions to its current-day protein interaction network. Our evolutionary model incorporates both gene and genome duplication events for gene birth, and loss of gene duplicates for gene death. Recently derived rates for the birth and death of paralogous Arabidopsis thaliana genes (the 'paranome') [2] inform our model. Subfunctionalization and neofunctionalization are modeled as the gain and loss of interactions between genes. We use a duplication and divergence evolutionary model [4], modified to account for homodimers and genome duplication events. This evolving genetic network model of Arabidopsis thaliana permits views of the entire evolving paranome as well as comparative topological analyses between functional subnetworks of genes. The evolving protein interaction networks and subnetworks can also be visualized as an animation of gene births and deaths accompanied by the gain and loss of interactions. The method is applied to Arabidopsis thaliana but is generalizable to other model organisms. Bibliography 1 S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networks. Advances in Physics, 51:1079, 2002. 2 Steven Maere, Stefanie De Bodt, Jeroen Raes, Tineke Casneuf, Marc Van Montagu, Martin Kuiper, and Yves Van de Peer. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci U S A, 102(15):5454-5459, Apr 2005. 3 Roded Sharan and Trey Ideker. Modeling cellular machinery through biological network comparison. Nat Biotechnol, 24(4):427-433, Apr 2006. 4 Ricard V Sole, Romualdo Pastor-Satorras, Eric Smith, and Thomas B Kepler. A model of large-scale proteome evolution. Advances in Complex Systems, 5:43, 2002.


10. Raul Rodriguez-Esteban, Ivan Iossifov, Andrey Rzhetsky. Columbia University, 1130 St. Nicholas Av. #812, New York,New York 10032, USA. [ PDF ]

Imitating manual curation of text-mined facts in biomedicine

Short Abstract: Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedicine, it is critical to assess the extraction quality of individual facts. Using a large set of almost 100,000 manually produced evaluations, we implemented algorithms that mimic human evaluation of facts provided by an automated information-extraction system.


Long Abstract: Text-mining algorithms make mistakes in extracting facts from the natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the extraction quality (the probability that the message is correctly extracted) of individual facts. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system [1].. The algorithms that were used include several Bayesian classifiers, SVMs, Neural Networks and Maximum Entropy methods. The performance of our best automated classifiers, a second-order Maximum Entropy classifier, closely approached that of our human evaluators (ROC score close to 0.95). Were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. Hence we present a system that automatically curates the interactions that are extracted from the biomedical literature. This system is useful for enhancing the quality of information gathered by text-mining techniques. We illustrate our analysis by visualizing the predicted accuracy of the text-mined relations involving cocaine. [1] Rzhetsky A, et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53.


11. Jin-Wu Nam, Je-Geun Joung, Byoung-Tak Zhang. Center for Bioinformation Technology, Seoul National University, Silimdong kwanak-ku, Seoul, 151-742, Korea. [ PDF ]

Evolutionary optimization of parametric tree structured program to search RNA common structures

Short Abstract: Parameterized genetic programming (PGP) is an evolutionary algorithm to find common-structure descriptors (CSDs) from unaligned sequences for new RNA prediction. We searched the CSDs of several ncRNAs including pre-miRNAs. Applying the CSD, we performed generalized test with pre-miRNAs and obtained efficiency comparable to the previous methods.


Long Abstract: The functional classes of ncRNAs can be better described through their structures than through the base sequences because the structures of ncRNAs have been conserved by evolutionary pressure, making their functions maintain, despite substantial sequence variations (1). Hence, various computational methods in terms of structural prediction and modeling, and identification have been suggested to reveal the hidden function of ncRNAs (2-5). Among those methods, computational methods to probe the conserved structure might lead to not only the discovery of new RNAs but also the comprehension of functional and regulatory relationships among related RNAs (4;5). A straightforward way to identify conserved structures is by multiple sequence alignment and by its profiles. Profile hidden Markov models (HMMs), such as HMMer, based on the frequency and the transition probability of the sequences, are usually used to detect conserved motifs using multiple sequence alignment (2). Although it is a common approach to detecting conserved motifs, it does not consider the structural conservation. Covariance models, such as INFERNAL, are usually used to detect sequentially and structurally conserved motifs by introducing a concept of co-evolution of base pairs (6).. The success of covariance models, however, depends on the finely curated structural multiple alignments. An alternative way is by modeling all structural motifs to evaluate the presence or absence of these motifs and to optimize the stability of the structures. From this viewpoint, several modeling methods of RNA structure have been developed to define and search for abstract RNA structural motifs (7). These methods represent various RNA secondary structure elements such as loop, bulge, stem and mispair, as well as peculiar structural motifs such as pseudo-knot using context-sensitive models.. Meanwhile, structure definition languages such as RNAMotif allow abstraction of the structural pattern into a �descriptor� incorporating structural parameters so that it can give detailed information regarding base pairing, length and motif based on context-sensitive grammar (8). However, defining or searching for the structural descriptors, representing common structures of arbitrary RNAs, is computationally challenging. Therefore, the methods which can learn common structures or motifs of given RNAs are required.. Evolutionary algorithms make it possible to go beyond exhaustive search over a large search space by optimizing the stability and the similarity of structures using free energy, length, and sequence similarity (3;9;5). Actually, evolutionary algorithms have been broadly applied to various optimization problems in biological fields (10;11). However, these genetic algorithms do not evolve the structural language itself representing structural motifs with maintaining high level description, but evolve string individuals converted to describe structural information. A solution to overcome this drawback is the genetic programming (GP) which can automatically create a working genetic program (12), generally represented by tree structure. Specifically, genetic programming with grammatical information, used in the wide area such as data mining, will be promising for such a problem (13). Here, we propose the parameterized genetic programming (PGP) based on our previous work (14). PGP is a new type of genetic programming where genetic programs are augmented with parameters and thus natural for searching for RNA common-structural descriptor (CSD) without aligning the sequences. Alignment needs lots of computational time and causes the unwanted biases. In addition, structural alignment is a computationally intractable problem. PGP evolves CSD based on training data set through grammatical tree-structure by encoding the parameterized rules for structural descriptors. The definition of the rules made it possible to optimize the CSDs via genetic operators used in genetic programming. Our method has been applied to various RNA sequences including tRNAs, 5S rRNAs, U7 snRNA and pre-miRNAs. Importantly, the optimized CSDs can be used as classifiers to search for new RNAs in a database. In the prediction of ncRNAs with family types, the accuracy of the method could be improved by introducing the committee concept of CSDs. Finally, with the evolved CSDs, we predicted and verified 5 new pre-miRNAs on human chromosomes 20 and X. 1 Klosterman, P.S., Hendrix, D.K., Tamura, M., Holbrook, S.R. and Brenner, S.E. (2004) Three-dimensional motifs from the SCOR, structural classification of RNA database: extruded strands, base triples, tetraloops and U-turns. Nucleic Acids Res, 32, 2342-2352. 2 Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755-763. 3 Chen, J.H., Le, S.Y. and Maizel, J.V. (2000) Prediction of common secondary structures of RNAs: a genetic algorithm approach. Nucleic Acids Res, 28, 991-999. 4 Mathews, D.H. (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics, 21, 2246-2253. 5 Taneda, A. (2005) Cofolga: a genetic algorithm for finding the common folding of two RNAs. Comput Biol Chem, 29, 111-119. 6 Eddy, S.R. (2002) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, 3, 18. 7 Matsui, H., Sato, K. and Sakakibara, Y. (2005) Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics, 21, 2611-2617. 8 Macke, T.J., Ecker, D.J., Gutell, R.R., Gautheret, D., Case, D.A. and Sampath, R. (2001) RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res, 29, 4724-4735. 9 Fogel, G.B., Porto, V.W., Weekes, D.G., Fogel, D.B., Griffey, R.H., McNeil, J.A., Lesnik, E., Ecker, D.J.. and Sampath, R. (2002) Discovery of RNA structural elements using evolutionary computation. Nucleic Acids Res, 30, 5310-5317. 10 Cooper, L.R., Corne, D.W. and Crabbe, M.J. (2003) Use of a novel Hill-climbing genetic algorithm in protein folding simulations. Comput Biol Chem, 27, 575-580. 11 Fogel, G.B. and Corne, D.W. (2003). Morgan Kaufmann Publishers, San Francisco. 12 Saetrom, P., Sneve, R., Kristiansen, K.I., Snove, O., Jr., Grunfeld, T., Rognes, T. and Seeberg, E. (2005) Predicting non-coding RNA genes in Escherichia coli with boosted genetic programming. Nucleic Acids Res, 33, 3263-3270. 13 Wond, M.L. and Leung, K.S. (2000) Data Mining Using Grammar Based Genetic Programming. Springer. 14 Nam, J.W., Joung, J.G., Ahn, Y.S. and Zhang, B.T. (2004) Two-Step genetic programming for optimization of RNA common-structure. Lecture Notes in Computer Science, 3005, 73-83.


12. Alok Mishra, Duncan Gillies. Imperial College, 180 Queen's Gate, London, SW7 2AZ, UK. [ PDF ]

Effect of microarray data heterogeneity on regulatory gene module discovery algorithms

Short Abstract: To validate our hypothesis that as microarrays from different researchers (same experiment type) are merged, genes specific to particular experiment type are reinforced but when microarrays from diverse experiment types are merged, global housekeeping genes are reinforced, we are doing a systematic study of cluster variation with increasingly heterogeneous experiments.


Long Abstract: Microarrays allow us to study a large proportion of genome expression simultaneously. Initially clustering algorithms were used to study the co-expression of genes in order to make sense of growing amounts of this data. Recently, many researchers [SSR+ 03, BJGL+ 03, TSKS04] have incorporated prior knowledge in the form of known transcription factors or DNA binding data to guide the clustering process. Results of such clustering have been evaluated either by studying the robustness of the cluster itself or by analysing the biological significance of the clusters either with the help of domain experts or using gene ontology (GO) terms. Despite researchers' claims about the significance of the results of such methods, there hasn't been a systematic study of the effectiveness of such algorithms. Our hypothesis is that as microarrays from different types of conditions (but the same experiment type e.g. stress) are merged we should be able to obtain stronger signals i.e. the clusters obtained should be similar across such experiments and should thus reinforce the signal (co-expressed and co-regulated genes) while suppressing the noise (noisy data). But, when microarrays from various experiment types are merged together then the local signals (genes co-expressed under individual conditions) would be replaced by global signals (housekeeping genes). So, in the end when we mix microarrays from different types of experiments to obtain the clusters, the only clusters that should be significantly enriched should be ones that represent the housekeeping genes related to cellcycle that are active under all conditions. In order to validate our hypothesis we are doing a systematic study of the variation among the resulting gene clusters (where clustering process is guided with prior knowledge) starting from individual experiment types and gradually mixing it with heterogeneous data. As a side effect this also results in a study of the effectiveness of these algorithms under changing amounts of data (no. of experiments). Some of the other variations that we are interested in are � Are more experiments better or do they contribute more noise than signal? � Are separate types of experiments better or can we just mix them all together to obtain some background biological knowledge? When we merge a large number of experiments, the computation required soon starts becoming intractable. In future we would like to extend our study by using active learning to select the more informative experiments from a huge set of replicate experiments. If our hypothesis is true, our framework can incrementally use the growing tide of microarray data as they arrive to come up with a stable background model. This background information can be then subtracted from the individual condition experiments in order to come up with a more stable specific signals. Another direction of research if our hypothesis is true is that the background information can be encoded in terms of a pooled covariance matrix in order to shrink the individual covariance matrix (for specific experiments) towards a stable matrix that can be inverted for GGM (Graphical Gaussian Model) analysis. References [BJGL+ 03] Ziv Bar-Joseph, Georg K Gerber, Tong Ihn Lee, Nicola J Rinaldi, Jane Y Yoo, Franois Robert, D Benjamin Gordon, Ernest Fraenkel, Tommi S Jaakkola, Richard A Young, and David K Gifford. Computational discovery of gene modules and regulatory networks. Nature Biotechnology, 21(11):1337�1342, 2003. [SSR+ 03] Eran Segal, Michael Shapira, Aviv Regev, Dana Pe'er, David Botstein, Daphne Koller, and Nir Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics, 34(2):166�176, 2003. [TSKS04] Amos Tanay, Roded Sharan, Martin Kupiec, and Ron Shamir. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. PNAS, 101(9):2981�2986, 2004.


13. Alejandro Reyes, Juan Rodrigo Cubillos, Andr�s Cubillos, Mar�a Mercedes Zambrano. Molecular Genetics Group, Corpogen, Cra 5 #66A-34, Bogot�,Distrito Capital 00000, Colombia. [ PDF ]

Comparative Genomics in Mycobacteria: insights from Multiple Genome Alignments

Short Abstract: Since the first Mycobacterium tuberculosis genome sequence was published, a great effort has been made in identifying genomic differences among phenotypically different strains. Our studies based on Multiple Genome Alignment have allowed not only the characterization of main differences within the M.. tuberculosis complex but also among different Mycobacterium species.


Long Abstract: Genetic variability in mycobacteria has generated a lot of interest in an effort to understand differences among strains and closely related species. Since the first complete mycobacterial genome sequence was published, different molecular biology techniques have been employed with the aim of identifying genomic differences among M. tuberculosis strains, and to explain changes in pathogenicity among them. Understanding these changes is particularly interesting since variability in mycobacteria is thought to be generated mostly by mutations. The recent development of more accurate techniques in comparative genomics and the sequencing of new mycobacterial genomes have allowed a more detailed analysis of genetic variability, including the detection of Single Nucleotide Polymorphisms (SNP�s) among sequenced strains. As a result, great advances in the development of typing techniques used in molecular epidemiology for mycobacteria have been made. However, until now, it has not been possible to fully understand the mutations responsible for the differences in pathogenicity between strains. In order to learn more about the differences that characterize distinct mycobacteria, we have begun using bioinformatic tools to analyze mycobacterial genomes. To do this we utilized the program MAUVE, a multiple genome alignment program (Darling, A. et al; Genome Research 14:1394-1403, 2004), to align the different mycobacterial genomes available. An initial alignment of M. bovis AF2122/97 and two M. tuberculosis strains (H37Rv and CDC1551) allowed the identification of all the previously described regions of difference (RD and RvD�s). Additionally, when the variability within the species of the TB complex was analyzed, a great degree of conservation was found (> 90%) and an almost complete colinearity was evident between the genomes; the variability among those genomes was found mainly associated with IS and phage sequences, or proteins from the PE/PPE family. When genomes of mycobacteria not belonging to the TB complex were aligned, it was surprising to find a high number of conserved genes in spite of the large genomic rearrangements detected. This analysis also revealed the presence of Hot Spot regions for the accumulation of mutations and the occurrence of genomic rearrangements. This finding is in agreement with reports that identified accumulation of mutations within highly variable regions of the M.tuberculosis genome. Additionally, it has also been found that genes conserved in mycobacteria that do not belong to the TB complex and absent from mycobacteria of this group are found predominantly in regions of low variability within the TB complex strains. Genomic analysis of mycobacterial strains can therefore allow identification of regions and/or genes that could be responsible for the various phenotypes observed. Given that many of the changes observed among M. tuberculosis complex strains involve small insertions or deletions it is possible that differences in pathogenicity could be due to changes in genes regulation more than to the presence or absence of specific genes. This difference in regulation could be caused by rearrangements or the insertion or deletion of IS sequences and other mobile elements. Once having identified the differences among strains via genomic analysis the next step will be to study the biological significance of these changes in a laboratory setting.


14. Prerna Sethi, Chokchai Box Leangsuksun. Louisiana Tech University, 1109 Rita Lane, Ruston,LA 71270, USA. [ PDF ]

A Computational Paradigm for Fast, Parallel Knowledge Integration for Analysis of Gene Expression Data

Short Abstract: We present a novel computational framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data, specifically for cancer genomics applications, using distributed knowledge discovery and integration computational routines.The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are provided to demonstrate the strength of the algorithms.


Long Abstract: Rapid technological advancements in the field of microarray analysis, have generated enormous amounts of gene expression data, a trend that has been increasing exponentially over the last few years. However, the handling and analysis of such data has become one of the major bottlenecks in the utilization of the technology. We present a novel framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data using distributed knowledge discovery and integration computational routines, specifically for cancer genomics applications. The proposed paradigm consists of the following key computational steps: (a) preprocess, normalize the gene expression data; (b) discretize the data for knowledge mining application; (c) partition the data using the proposed methods; (d) perform knowledge discovery on the partitioned data-spaces for association rule discovery; (e) integrate association rules from partitioned data and knowledge spaces on distributed processor nodes using a novel knowledge integration algorithm; and (f) post-analysis and functional elucidation of the discovered gene rule sets. As our target gene expression dataset, we took the global cancer maps (GCM) as reported in [Ramaswamy et al. 2001]. Due to the inherent noise, the dataset is preprocessed and normalized. The dataset, now consists of 10,887 genes and 127 samples and is further discretized for the knowledge mining application. The following strategies are applied to partition the dataset. Overlapped Vertical Partitioning- The dataset is partitioned into w windows of equal sizes, with an overlap between consecutive windows. Thus, if the overlap is w � 1, then the resulting total number of windows is N � w + 1; where N is the number of total data elements and w is the specified window size. Adaptive Selection-Adaptive partitioning is achieved by the k-means clustering algorithm. This algorithm selects the first k initial clusters by choosing k rows of data, each representing a particular sample) randomly from the dataset. It then calculates the arithmetic mean of each cluster (the mean of all individual records) formed in the dataset. The degree deviation of each record from the center of the cluster is also calculated. If the deviation is less than the threshold value lambda then the record is assigned to that cluster. Based on the threshold value, then, each record in the dataset could belong to more than one cluster, creating complex overlap among the records. The association rule mining algorithm, FP-growth [Han et al. 2000] is applied in parallel to the partitioned datasets to obtain gene products and to find regulatory relationships among the genes. Implementation of the FP-Growth algorithm on the partitioned datasets generated frequent gene-sets pertaining to the processor node to which the partition belonged. In order for us to analyze, interpret, and mine the rules between inter-processor gene-sets, all rules must be collected on a single node. A novel and efficient algorithm, Genesetmining, is employed to merge the frequent gene-sets residing on various processor nodes. The algorithm discovers frequent gene-sets by using updates occurring in the form of new sample-spaces to aggregate frequent gene-sets. The details regarding the Genesetmining algorithm can be found in [Sethi et al. 2006]. The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are obtained to demonstrate the strength of the algorithms. We have presented a computational framework for fast knowledge discovery from multi-class gene expression data. The framework harnesses the power of distributed computing and novel knowledge discovery and integration computational routines. The framework has exhibited significant gains over traditional knowledge discovery on unpartitioned data-spaces, (running on a uni-processor machine) demonstrated in our experiments.


15. Marcio Succar Moreira, Sergio Manuel Serra da Cruz, Carla Tavares dos Reis. UFRJ, Rua Ivo do Prado, 94 Rio de Janeiro,Rio de Janeiro 23080200, Brazil. [ PDF ]

SIMaS: Retrieving biological images through the use semantic Web services

Short Abstract: Biological images are vital to manage natural resources. We present an architecture based on Web services that provides a semantic reasoning layer between users and stored data. It enables handing traditional data, getting semantically rich results through the images database queries. It also maintains provenance and annotations about images.


Long Abstract: There is an increasing awareness about environmental issues. Scientists, citizens and government need informatics to support their efforts in shaping public policies and managing natural resources. The recent group of information systems that deal with the environmental, ecological and biological problems from research to practical applications lead to a new field called environmental informatics (1). Environmental informatics needs to integrate vast amounts of information from different sources and types, ranging from biological/temporal/geographical data to database images. Therefore, those systems also need to handle semantic heterogeneity (3). Semantic aspects of information integration are drawing attention from research community and ontologies are a valuable artifact for databases integration (2). They capture the semantics of information and can be used to retrieve or store the related metadata. Although multiple engineering artifacts existin the domain of ecology, do not exist to the extent as, for instance, in cell biology. As a result, we can take advantage of lessons learned from developing ontologies in other biology disciplines, most notably in molecular biology. Environmental and ecological studies uses images as a vital part of the scientific records and the significance of such images for identifying the species from a given geo-spatio- temporal region cannot be underestimated. Yet, an appropriate set of images can even be used to help to describe new species. So, in order to help field biologists, we present an open-source tool that provides a semantic reasoning layer between the users and the stored metadata. It enables scientist to: manage traditional data; use a semantic matching algorithm which provides more accurate returns in response to semantically rich searches through the images database and finally; record annotations and maintain provenance logs of biological images in a convenient way. The problem of creating metadata for biological images has been of vital importance to Biology. Some approaches are commonly used in annotating images, such as: keywords, controlled vocabularies, classifications and free text descriptions. Unfortunately, they present open issues. So, in our case study, we use the ontological approach. We make ontological models of the concepts involved in the image repository. Our ontology is based both on the classification proposed by Linnaeus (4) and phenetics principles (7). For instance, they allow biologists to add annotations about species morphological and behavioral characteristics, the localization of its occurrence, the relationships among other groups of organisms, and so on. Like in (5), our ontology may be used for four purposes: Annotation terminology (the model provides the terminology and concepts by which metadata of the images is expressed); View-based search (the model, provide different views into the concepts); Semantic browsing (after finding an image, the ontology model together with image instance data can be used in finding out relations between the selected image and other images in the repository). Provenance log storage (biological image?s metadata should include information about the pedigree of the data, including facts such as: who, where, or what processes created the image, what instrument recorded it along with machine specific settings and parameters, when it was generated and/or used). SIMaS is a Java object-oriented open-source web enabled tool, it is a distributed architecture based on a set of semantic Web services (6) which aids biologists to: (i) store biological images and annotation data into a MySQL database; (ii) retrieve images and provenance data from semantic queries; (iii) browse the ontology to restore semantically related images; (iv) recover, on single search, a geographic visualization provided by a third party service provider, like GoogleEarth (8). SIMaS ontology was implemented as OWL file into Protege and the query mechanism makes use of SPARQL to make all inferences. Our system not only stores the image itself but its annotations too, it also registers the geographic coordinates (latitude, longitude) and the moment of which the image was taken. So, we are able to retrieve, at a single search, its attributes and its geographical position from GoogleEarth satellites point of view. Sometimes this point of view is very interesting on environmental impact studies. SIMaS semantic browsing is another interesting feature, for instance, if a biologist queries images about given specie, a phenetic tree (presented as directed acyclic graphic) is automatically built, showing both annotations and Linnaeus taxonomy. In light of the fact that we are concerned our on going work showed how Semantic Web Services can be used for precise information retrieval, helping scientists in annotation and provenance tasks. Furthermore, the use of ontologies enriched knowledge about the management of natural resources and enabled them to get more meaningful answers to queries. As future work, we should enhance the classification process of images through the improvement of ontologies. 1 - Radermacher, F. J., Riekert, W.-F., Page, B., Hilty, L. M. Trends in Environmental Information Processing, in Applications and Impacts, Information Processing 94, Volume 2, Proceedings of the IFIP 13th World Computer Congress, E. Raubold, Ed. Hamburg, Germany, pp. 597-604. 1994. 2- Gruber, T R., Guarino, N., Poli, R. Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In Formal Ontology in Conceptual Analysis and Knowledge Representation. Kluwer Academic Publishers, in press. Substantial revision of paper presented at the International Workshop on Formal Ontology, Padova, Italy, March, 1993. 3- Sheth, A., Changing Focus on Interoperability in Information Systems: from System, Syntax, structure to Semantics, in Interoperating Geographic Information Systems, C. Kottman, Ed. Norwell, MA: Kluwer Academic, pp. 5-29. 1999. 4- Animal Diversity Web. Avalilable at: http://animaldiversity.ummz.umich.edu. Last access: 26/05/2006. 5- Hyvonen E., Styrman A.; Saarela S. Ontology-Based Image Retrieval. Avalilable at: http://www.seco.tkk.fi/publications/2002/hyvonen-styrman-saarela-ontology-based- image-retrieval-2002.pdf 6- Wroe, C., Stevens, R., Goble, C., Roberts, A., Greenwood, M.: A Suite of DAML+OIL Ontologies to Describe Bioinformatics Web Services and Data. The International Journal of Cooperative Information Systems 12 pp. 597-624. 2003. 7- Legendre, P., Legendre L., Numerical ecology. Elsevier Science. 2nd edition BV, Amsterdam.1998. 8- Google Earth. Available at http://earth.google.com/. Last access: 26/05/2006.


16. Je-Keun Rhee, Byoung-Tak Zhang. Graduate Program in Bioinformatics, Seoul National University, Silimdong Kwanak-ku, Seoul, 151-742, Korea. [ PDF ]

Analysis of Primary microRNAs by Structural Position-Weight-Matrix Mixture Model

Short Abstract: We make a structural chain of each primary microRNA based on its secondary structure. The structured pri-miRNAs are clustered by mixture model of structural position weight matrixes, learned by EM algorithm. In our results, we find some distinct characteristics in miRNA secondary structures from each cluster.


Long Abstract: MicroRNAs (miRNAs) are ~22nt molecules which act for post-transcriptional suppressors [1]. Generally, miRNAs bind to 3� UTR region of target genes, and down-regulate expression of the genes by translational repression or mRNA destabilization. Like this, miRNAs are important molecules in cellular process, including gene regulation. However, the process of generating mature miRNAs is not known, clearly. MiRNA genes are transcribed by RNA polymerase II, and then primary miRNA, of several kilobases long, are generated [2]. When these molecules are processed by a member of RNAse III enzyme, called by Drosha, hairpin-shaped precursor miRNAs are created, and then the pre-miRNAs are cleaved by Dicer to generate mature miRNAs. Among the overall processes of miRNAs, the binding of Drosha to primary miRNAs occurs very precisely and is a critical step in miRNA biogenesis [3]. However, it is less known about how Drosha recognizes its substrates, pri-miRNAs. Although it is a known fact that Drosha can crop the pri-miRNAs, no common sequence motif has been found. Therefore, it is plausible that the cut site of Drosha is recognized by rather structural information than sequence information. With this information, we aimed at making structural characteristics of primary miRNAs clear by computational learning approaches. We collected pri-miRNA sequences of 110 nt in length from miRBase release 7.0. We predicted the secondary structure of 280 human primary miRNAs using mfold. By mfold program, thermodynamic stability at each position in primary miRNA secondary structures was calculated, including stacking energy, interior loop, bulge loop, external loop, and hairpin loop. We abbreviated each structural name to one letter, that is, stack is S, interior loop is I, bulge loop is B, external loop is E, and hairpin loop is H. Therefore secondary structures of primary miRNAs are represented by 5 letters, that is, S, I, B, E, H. We made position matrix of whole primary miRNA secondary structures, which are aligned in line based on cleavage sites by Drosha. From these, we can derive structural position weight matrix of human primary miRNAs. It is different from general position weight matrix in the points of using abbreviated letters of secondary structures, not nucleic acid or amino acid sequences. However, miRNAs are not found sufficiently in human. In previous works, the number of human miRNAs is estimated to ~1000 [4]. That is, the known information in present is insufficient, and it may be biased information. In addition, there exist 5�-donors and 3�-donors from pri-miRNAs. That means the structural characteristics of each one are distinct. Therefore, it needs to analyze the characteristics of secondary structures in clustered groups. We investigated the structural characteristics using multiple structural position weight matrices, instead of a single one. Given the basis for a structural PWM, we clustered the primary miRNA secondary structures using mixture models. The whole parameters of position weight matrix mixture model are learned by Expectation-Maximization (EM) algorithms [5]. The EM algorithm maximizes the expected likelihood over structural position weight matrices and weight values. This algorithm iterates until the likelihood converges. After learning of the mixture model, we can cluster the primary miRNAs by using hidden variables, which indicates the miRNA structure is explained well from a specific structural position weight matrix. Due to variants of the lengths of aligned pri-miRNA secondary structures, we focused on the structural positions near the cleavage site by Drosha in our computational experiments. After convergence of our optimization process, we clustered the total 280 primary miRNA secondary structures. From each cluster, we could verify the distinct structural characteristics on secondary structures, including 5�-donor and 3�-donor structures, which can not be found in whole one structural position matrix. In our works, we can find the useful structural characteristics. Due to limited number of known miRNAs, general position weight matrices may be strongly affected by noise information. However, our model overcomes the problem by probabilistic mixture models. There is one of major reasons why our mixture model carries more information about pri-miRNA secondary structures. Our analyses will help us to understand the molecular processing of primary miRNAs by Drosha. Besides, the cleavage of small RNA transcripts by Drosha is one of the most important factors determining whether it becomes a real miRNA or not. Therefore, it will be usefully available to prediction and verification of novel miRNAs. References [1] Bartel, D.P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281-297. [2] Kim, V.N. (2005) MicroRNA biogenesis: coordinated cropping and dicing. Nat. Rev. Mol. Cell Biol., 6, 376-385. [3] Han, J., Lee, Y., Yeom, K.-H., Nam, J.-W., Hur, I., Rhee, J.-K., Son, S., Cho, Y., Zhang B.-T., and Kim, V.N. (2006) Molecular basis for the recognition of primary microRNAs by the Drosha-DGCR8 complex, Cell, 125, 887-901. [4] Bentwich, I., Avniel, A., Karov, Y., Aharonov, R., Gilad, S., Barad, O., Barzilai, A., Einat, P., Einav, U., Meiri, E., Sharon, E., Spector, Y., Bentwich, Z. (2005) Identification of hundreds of conserved and nonconserved human microRNAs, Nat. Genet., 37(7), 766-770 [5] Hannenhalli, S. and Wang, L.S. (2005) Enhanced position weight using mixture models, Bioinformatics, 21, Suppl. 1, i204-i212.


17. Talapady N Bhat. Bldg 227, NIST, 100 Bureau drive, Gaitehrsburg,MD 20899, USA.. [ PDF ]

Semantic Web and Chemical Ontology for Fragment-Based Drug Design and AIDS

Short Abstract: Chemistry and Biology are emerging to be one of the most active areas of information technology and Semantic Web advocated by the W3C. We discuss and illustrate a use case for Chemical Taxomomies for Semantic Web. http://esw.w3.org/topic/HCLS/ChemicalTaxonomiesUseCase.


Long Abstract: The popular paradigm in fragment-based drug discovery seeks to collect, compare, and test many chemically similar compounds. Success of such a method critically depend on three factors; 1) how complete and reliable the chemical ontology is, 2) how well the compounds in the database have been annotated, classified, and indexed to produce meaningful and manageable hits, 3) what is the relevance of steps (1) and (2) with respect to enzyme-drug interactions in the real world. 3-D structures narrate the secret stories of enzyme-drug interactions and they unravel the nature�s secrets of drug resistance. Scientists use these stories to discover new drugs and to develop strategies to combat drug resistance. Solving 3-D structures is an expensive process; for this reason, only a small fraction of the over 1000 2-D structures have been solved in 3-D. A successful use of this vast amount of information on 2-D structures for drug-development requires efficient techniques and standards to cross-index 2-D and 3-D structures and their fragments. We have developed a novel fragment-based technique called Chem-BLAST (Chemical Block Layered Alignment of Substructure Technique) to dynamically establish structural relationships among compounds using their fragments. Using this method, for the first time we have enabled the users of HIVSDB to criss-cross between 2-D and 3-D data for drug design purposes. This novel fragment-based technology readily circumvents the difficulty faced by medicinal chemists in taking advantage of 2-D structural data to understand their mode interaction with the HIV enzymes. HIVSDB (http://xpdb.nist.gov/hivsdb/hivsdb.html) addresses all these issues through the development and deployment of state of the art fragment-based techniques and tools for the 21st century. HIVSDB has over thousand 2-D structures of potent HIV protease inhibitors collected mostly from published literature and it has about 50% more 3-D structures of HIV protease inhibitor complexes than found in the PDB. HIVSDB uses a unique indexing and annotating technique for inhibitors that allows the organization of the information into a chemical data relationship http://xpdb.nist.gov/hivsdb/advanced_query_files/slide0002.htm or http://xpdb.nist.gov/hiv2_d/advanced_query_files/slide0002.htm ( See fig below) that may be readily translated into Semantic Web concepts either through �text strings� (pyridine-2-sulfonamide-or-2-pyridinesulfonamide is-type pyridine) or �visual tools� - is-type For additional details see http://esw.w3.org/topic/HCLS/ChemicalTaxonomiesUseCase


18. Laura Kavanaugh, David Curiel, Fred Dietrich, David Messina. Duke University, 2325 Stagecoach Drive, Hillsborough,NC 27278, USA. [ PDF ]

BioPerl Deobfuscator: Bridging the gap between programmers and biologists

Short Abstract: The BioPerl toolkit is a well-known open-source software package of tremendous value to biologists for accomplishing computational tasks.However, its object�oriented design introduces a level of complexity that makes it inaccessible to many users. The Deobfuscator has been developed to bridge the gap between BioPerl and its intended audience.


Long Abstract: The Deobfuscator program bridges the gap between the vast resources of the BioPerl toolkit and users in the biological community wishing to tap into these resources. The BioPerl toolkit is an open source software package of tremendous potential value to biologists to accomplish computational tasks common to many bio-logical problems. However, its object�oriented design introduces a level of complexity that can make it difficult for users to determine how the different parts of BioPerl interact. The BioPerl Deobfuscator addresses this issue by providing a searchable web resource that shows the inheritance relationships among the BioPerl modules, along with the documentation for those modules, to enable users to navigate the toolkit. This capability is extremely important given that many of the BioPerl toolkit users are expert in biology, not computer programming. At a broader level, the conceptual design of the Deob-fuscator can be applied to any complex object�oriented system written in Perl. The Deobfuscator program is freely available via the internet at http://bioperl.org/cgi-bin/deob_interface.cgi


19. Feng Chen, Aaron J Mackey, Jeroen K Vermunt, David S Roos. Biological Chemistry Graduate Program, University of Pennsylvania, 231 S 34 St, #314, Philadelphia,PA 19104, US. [ PDF ]

Evaluating Orthology Detection Approaches in the Absence of a Gold Standard

Short Abstract: Latent Class Analysis permits performance evaluation for multiple tests based on observed patterns of agreement. Applying this statistical methodology to various orthology identification methods reveals that INPARANOID, Orthostrapper, and OrthoMCL exhibit the best balance between sensitivity and specificity. OrthoMCL also offers additional advantages of speed and applicability to multi-species datasets.


Long Abstract: The rapid growth of genome sequence data, from an ever-increasing range of relatively obscure species, places a premium on the automated identification of orthologs to facilitate functional annotation, comparative genomics and evolutionary studies. The problem is particularly acute for eukaryotic genomes, because of their large size, the difficulty of defining accurate gene models, the complexity of protein domain architecture, and rampant gene duplications. Several strategies have been employed to distinguish probable orthologs from paralogs: phylogeny-based approaches include RIO (Resampled Inference of Orthology) and Orthostrapper/HOPS (Hierarchical grouping of Orthologous and Paralogous Sequences); strategies based on evolutionary distance metrics include RSD (Reciprocal Smallest Distance); BLAST-based approaches include Reciprocal Best Hits (RBH), COG/KOG (Cluster of Orthologous Groups), and INPARANOID. We have previously described the OrthoMCL algorithm, which improves on RBH by recognizing many-to-many co-ortholog relationships, using a normalization step to correct for systematic biases when comparing specific pairs of genomes, and using a Markov clustering algorithm to define ortholog groups. The OrthoMCL algorithm is fully automatable, requiring no manual curation. Despite the large number of orthology identification approaches now available, no comprehensive comparison has yet been reported, in part because the lack of a genomic-scale error-free 'gold standard' data set makes it difficult to analyze performance. Functional genomics data are sometimes used as a surrogate for true orthology in performance assessments, but such data are known to include many false positives (FP) and false negatives (FN), and are difficult to apply across large evolutionary distances. Latent Class Analysis (LCA) is a statistical approach that has been widely employed for the analysis of multivariate categorical data in clinical diagnostics, marketing research, sociology, and other areas. LCA uses the agreement or disagreement data between methods to infer FP and FN rates, permitting quantitative comparisons in the absence of a gold standard dataset. Many biological questions have been addressed by multiple methods yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive method for applications in Computational Biology. We have employed LCA to evaluate orthology identification methods, including all of the above algorithms, in addition to some homology detection algorithms, such as BLAST and TribeMCL. Whether a given pair of cross-species proteins is or is not orthologous, is an unobserved or 'latent' class, and all of the algorithms under consideration are used to make yes/no predictions as to orthology. Given the prediction results for a large set of homologous protein pairs, the likelihood function for this model can be expressed using the overall orthology probability and the FP and FN error rates for each algorithm. A maximum likelihood estimate of these model parameters is then used to represent benchmarking result. For this analysis, LCA was used to examine homologous protein pairs among 18,202 protein sequences from six complete eukaryotic genomes (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, and Schizosaccharomyces pombe), representing 1092 Pfam protein families. From this analysis, most methods can be seen to balance sensitivity versus specificity. For example, homology detection methods display FP rates of 53-59% and FN rates of ~5%, while orthology detection methods based on phylogeny or evolutionary distance exhibit FP of 1-4% and FN of 21-67%. Between these two extremes are the diverse performances of BLAST-based orthology identification methods. As the first step for many BLAST-based approaches, RBH displays a low FP (9%), but its inability to recognize many-to-many co-ortholog relationships results in a high FN (33%). The consideration of co-orthologs by INPARANOID reduces FN to 21%, and FN rates are further decreased in methods that consider multiple genomes: 11% for OrthoMCL, and 4% for KOG. Ortholog clustering across multiple genomes inevitably bears a cost of increased FP rates, however: 18% for OrthoMCL, and 37% for KOG. Three algorithms exhibit both FP and FN error rate < 25%, and can therefore be considered the best orthology identification methods: Orthostrapper, INPARANOID and OrthoMCL. In further studies, we have investigated the performance and optimization of these methods under different parameters, varying the orthology bootstrapping cutoff in phylogeny based approaches, the E-value cutoff in BLAST based approaches, and the MCL inflation value in Markov clustering algorithms (OrthoMCL, TribeMCL). To compare OrthoMCL with the widely-used KOG algorithm, the stand-alone version of OrthoMCL was applied to the KOG reference dataset, including all proteins from seven eukaryotic genomes (the six species noted above, plus Encephalitozoon cuniculi). More than 50% of KOG groups were identically grouped by OrthoMCL, indicating that automated application of this method performs comparably to the manually-curated KOG database. ~35% of KOG groups were split into smaller groups by OrthoMCL, but comparison with Enzyme Commission (EC) annotations and protein domain architecture suggests that OrthoMCL groups exhibit a higher degree of consistency in the identification of putative protein function. Similar trends were observed in the application of OrthoMCL to the prokaryotic COG dataset. To facilitate applications of OrthoMCL's orthology prediction capabilities, proteome data from 55 complete genomes (all available eukaryotes, plus a representative selection of eubacteria and archaebacteria) was clustered into ortholog groups, and this data may be queried and downloaded from http://orthomcl.cbil.upenn.edu.


20. Gabriele Schweikert, Georg Zeller, Richard Clark, Stefan Ossowski, Norman Warthmann, Paul Shinn, Kelly Frazer, Joe Ecker, Daniel Huson, Detlef Weigel, Bernhard Schoelkopf, Gunnar Raetsch . Friedrich Miescher Laboratory of the Max Planck Society, Spemannstr. 39, Tuebingen, 72076, Germany. [ PDF ]

Machine Learning Algorithms for Polymorphism Detection

Short Abstract: Based on high-density oligo-nucleotide array measurements and sophisticated machine learning methods, we obtain a genome-wide inventory of polymorphisms (including SNPs, deletions and highly polymorphic regions) in natural populations of Arabidopsis thaliana, representing a unprecedented resource for the study of genetic variation in a multicellular model organism.


Long Abstract: As extensive studies of natural variation require the identification of sequence differences among complete genomes, there exists a high demand for precise, yet inexpensive high-throughput sequencing techniques. While high-density oligo-nucleotide arrays are capable of rapid and comparatively cheap genomic scans, algorithmic approaches for the accurate identification of sequence polymorphisms from this kind of data remain a challenge [1]. We will present two machine learning based methods tackling the problem of identifying SNPs as well as deletions and highly polymorphic regions. We have collaborated with Perlegen Sciences Inc., to use array hybridization technology for polymorphism discovery in 20 wild strains of the model plant Arabidopsis thaliana, which has a genome of about 125Mb. From this project we obtained nearly 19.2 billion measurements (four 25nt probes for each base on each genomic strand and strain). For the identification of SNPs, we trained support vector machines (SVMs) on a set of known sequences [2] from each of the 19 non-reference strains. Our approach was two-layered incorporating information from each strain and also across the strains to maximize the SNP prediction accuracy. An important feature of our approach is that each called SNP is given a confidence level that has been lacking in earlier approaches. This allows us to adjust the recovery and accuracy along a ROC curve according to experimentalists' needs. Neighboring polymorphisms (distance $<$25nt) disrupt the signal for SNP detection so that algorithms tend to predict the fewest SNPs in the most highly polymorphic regions. A comparison of our method to a previously used model based method (similarly in [3]) revealed that our approach excels in polymorphic regions that make up much of the genome. With our approach we recover many SNPs with a distance to the next polymorphic feature as small as 10bp. Per strain we are able to identify on average about 166,000 SNPs (34% recovery rate) at a false discovery rate of 5% leading to a total of 747,000 predicted SNPs. Many of them cause major functional effects (e.g. premature stop codons, discruption of splice sites, or deletions of coding sequence), of which 300 were confirmed by dideoxy sequencing. Considerable portions of genomes consist of regions with very high SNP density, were single SNP detection algorithms fail to reliably identify SNPs. We have therefore developed methods to detect highly polymorphic regions containing clusters of SNPs, insertions and deletions. Based on a comparison of the hybridization signal from the target strain and the reference array, we distinguish between meaningful candidates and other regions with low signal intensity caused by experimental variability. We are currently developing algorithms based on novel label sequence learning techniques (following a discriminative approach related to Hidden Markov SVMs [4]) in order to predict highly polymorphic regions including their composition. In a preliminary attempt we predicted about 700 such regions per strain and a subset of the predictions were validated by dideoxy sequencing, confirming about 100 deletions of length 150bp to 10kb with major effects on genes. First results indicate that more advanced techniques will reduce the length of polymorphic regions that are reliably detectable to 30bp. Summarizing, we obtain the first genome-wide inventory of polymorphisms in natural populations, representing an unprecedented resource for the study of genome-wide genetic variation in an experimentally tractable, multicellular model organism. References: [1] D. Gresham, D.M. Ruderfer, S.C. Pratt, J. Schacherer, M.J. Dunham, D. Botstein, L. Kruglyak. Genome-wide d etection of polymorphisms at nucleotide resolution with a single DNA microarray.. Science 311(5769):1932-1936, 2 006. [2] M. Nordborg, T.T. Hu, Y. Ishino, J. Jhaveri, C. Toomajian, H. Zheng, E. Bakker, P. Calabrese, J. Gladstone, R. Goyal, M. Jakobsson, S. Kim, Y. Morozov, B. Padhukasahasram, V. Plagnol, N.A. Rosenberg, C. Shah, J.D. Wall, J.Wang, K. Zhao, T. Kalbfleisch, V. Schulz, M. Kreitman, J. Bergelson. The pattern of polymorphism in A. thaliana.. PLoS Biology, 3(7):e196, May 2005. [3] D.A. Hinds, L.L. Stuve, G.B. Nilsen, E. Halperin, E. Eskin, D.G. Ballinger, K.A. Frazer, D.R. Cox. Whole-g enome patterns of common DNA variation in three human populations. Science 307(5712):1072-1079, 2005. [4] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453-1484, 2005.


21. Caroline Farrelly, Douglas Kell, Joshua Knowles. University of Manchester, Faraday Tower, North Campus, P.O. Box 88, Sackville Street, Manchester,Lancashire M60 1QD, UK. [ PDF ]

From Spectrum to Structure Using Machine Learning

Short Abstract: Current trends in the use of high throughput techniques often synthesise novel compounds, so have created a demand for rapid analytical techniques. This project addresses structure elucidation of carbon NMR spectra for organic molecules (<500 molecular weight) through machine learning, by applying Ants Colony Optimisation algorithms to a known dataset.


Long Abstract: From Spectrum to Structure Caroline Farrelly, Joshua D. Knowles and Douglas B. Kell School of Chemistry, The University of Manchester, Faraday Building, PO Box 88, Sackville Street, Manchester M60 1QD, UK Current trends in the use of high throughput techniques have created a demand for rapid analytical techniques. In the areas of combinatorial chemistry and metabolomics alike, huge datasets are generated, often relating to novel compounds. This project addresses structure elucidation of carbon NMR spectra for small organic molecules (<500 molecular weight). NMR is a powerful analytical tool, able to generate spectra from tiny sample sizes; in as little as a few seconds for protons. However, spectral analysis is time consuming and traditionally involves high levels of human expertise. This has fuelled the development of automated techniques for structure elucidation from NMR spectra. Previous efforts have focused on empirical methods or expert systems, with some work in the semi-empirical area. This project offers a fresh perspective by using Ant Colony Optimisation methods to learn associations between spectral peaks and substructural features. When a molecule is placed in an external magnetic field, each nucleus gives rise to a different spectral peak according to its electronic environment. The positioning of peaks on the spectrum depends on the external field strength and the total secondary magnetic field. (Hornak 1997). Certain shift ranges are associated with functional groups, but these are distorted by adjacent environments. Automated spectrum prediction is an established field and has been developed as an aid in structure elucidation. By predicting the spectrum for an expected molecule a chemist can quickly establish any similarity between the two. The inverse problem of automated structure elucidation is much more computationally challenging. Secondary magnetic fields mean that chemical shifts are dependant upon a molecule�s total electronic configuration. Subsequently conventional algorithms are not suited to the levels of pattern recognition required for spectral analysis, so artificial intelligence methods are relied upon. Although expert systems and neural networks have been applied with some success to this problem, swarm intelligence methods remain largely unexplored (see Gasteiger, 2003). Ant Colony Optimisation (ACO) algorithms are the result of modeling the behaviour of ants foraging for a food source and finding the shortest path to return to their colony. Algorithms use artificial ants, initially taking random search paths, depositing pheromone trails as they travel. Ants who have taken the shorter path will return to the nest more quickly, so a stronger scent pheromone will remain on this path. Hence, more ants will be attracted to take this same path. Some algorithms allow for pheromone evaporation, making the system more flexible, in case of the emergence of new paths or closure of existing paths. Ant �memory� has been incorporated into other models. In this case the ants remember the location of their nest and so have a better sense of direction, so can adjust their path accordingly. (Bonabeau et al. 1999) Starting off with a dataset of ~3000 fully assigned C13 NMR Spectra, each molecule has been broken down into its substructural components, allowing a substructural matrix may be constructed, which reveals any correlations between functional groups and chemical shifts. A new method has been developed to identify an exhaustive set of substructures in any given structure by applying graph theory to molfiles. This increases the searchable chemical space. By varying substructure size and looking at the strength of electronic environment, patterns can begin to be identified. For the application phase a molecule�s C-13 NMR spectrum and empirical formula must be available. Using the empirical formula, any compatible substructures present in the matrix are identified. These can be used to build a list of potential substructures giving rise to the experimental spectrum being analysed. Thus, the chemical space is drastically reduced Chemical shift information from the substructural matrix is used to assign the probability of each compatible substructure being present in the spectrum�s underlying structure. An additional function is utilised to increase the incorporation of low frequency elements present in the empirical formula, thus increasing the search speed. A Perl script has been written that bolts building blocks together into larger substructures, so that new isomers can be built. At each build step, compatibility checks are made according to the portion of the empirical formula not yet accounted for. The next substructure to be added on is randomly selected via a roulette wheel that includes heuristic information. Each time a new isomer is built it must be assessed using an evaluation function to determine its fitness. This is conducted by performing the substructural analysis previously described, this time ensuring larger substructural sizes are selected. Their frequencies at the experimentally determined shifts are looked up. These are treated as weightings in a matching cardinality algorithm which is used to establish which substructure is most likely to have caused which shift. This process avoids double counting during the evaluation phase, and subsequent weightings are taken as the fitness level. The ACO program needs to respond to the fact that groups of substructures occurring together distort typical ranges, because of the secondary magnetic field effects. Hence, after a set number of iterations, the isomer with highest fitness value will be used to update a pheromone matrix, in an ordered manner, by increasing the values of its substructures. This pheromone matrix is used to attract more ants to substructures with higher pheromone levels. Upon convergence, the isomer with the highest pheromone trail can be taken as the most persuasive underlying structure. Initial results show promise. Increasing the initial dataset is expected to generate more accurate results. Further work will be conducted to optimise the ACO. This project is likely to be extended to larger structures and proton spectra. Bibliography: Bonabaeu, Eric, et al. (1999), �Swarm Intelligence: From Natural to Artificial Systems�; ISBN 0195131592 �Oxford University Press Gasteiger, Johann (ed.) (2003), �Handbook of Chemoinformatics: From Data to Knowledge�; ISBN 3-527-30680-3 - Wiley-VCH, Weinheim Griffiths, L. (2003).


22. Anna Badimo. School of Computer Science, University of Witwatersrand, Witwatersrand, Zaire. [ PDF ]

A comparison of different distance metrics as applied to recombination detection in HIV

Short Abstract: Using distance-based metrics for sequence comparison offers an alternative to some of the problems associated with alignment-based sequence comparison. Recombination detection entails identifying breakpoints that signal changes in phylogenetic history. An existing technique of correlation measures is used with enhanced distance measures to analyse the detected recombination and compare results.


Long Abstract: In order to infer recombination, a determination has to be made about the phylogenetic histories of the sequences under study. To determine this history, multiple sequence alignment (MSA) is done to determine the similarity between the sequences. The more similar the sequences are, the closer they are related in their history. To do a multiple sequence alignment, the closest sequences are aligned first and progessively other most distant ones are added. A score is then kept for each successive alignment and the more similar new sequences are to the ones that are already aligned, the closer one gets to an optimal score. If dissimilar sequences are added, it affects the best score that can be obtained. Infact, it is known that a lot of multiple sequence alignment algorithms perform poorly where sequence identity is very low [Madsen2000]. As a result, introducing a single divergent sequence into a set of closely related sequences causes the iteration to diverge away from the best alignment [Thompson99]. Sequences can also be compared by calculating the distance between them. Initially, distance measures like the Hamming distance were used. Unfortunately, the Hamming distance is not a true reflection of how sequences have evolved since events like multiple substitutions, large insertions, and so on, are not accounted for.. Distance measures like Kimura-2 and others account for some evolutionary events and were seen as an improvement to the crude Hamming distance. However, in order to calculate the Hamming distance or any of the improved distances, MSA is still used first to determine positional homology. So in essence, even these improved distance measures are still based on the calculation of an alignment. As a result, MSA remains the most popular and commondly used technique for comparing sequences. A lot of approaches have been developed for detecting recombination. Over 90% of them use MSA to determine sequence similarity. On the other hand, MSA as a technique for sequence comparison is continuosly being questioned, not only from a performance and accuracy point of view but also in terms of how it handles recombination. In attempting to address some of the problems associated with MSA and Hamming-like distance measures, authors like [Vinga2003] have proposed alignment-free and resolution-free distance methods that offers an alternative to some of the complexities. The use of distance measures for recombination detection becomes relevant in HIV studies. Whereas most recombination detection methods would look for a phylogenetic signal to determine the breakpoints. In HIV, these breakpoints are easy to find where the combination is from different subtypes. So most methods have been sucessful in detecting recombination where two subtypes have combined. In cases where recombination occured within one subtype, correlation measures will become sensitive to that combination. A study was conducted to determine the effect of the different distance-based methods on the outcome of recombination. The distance measures that were used are: a) Hamming(dH), b) Kimura 2-parameter (dK80), c) d2, d) Edit distance (dE), e) Mahalanobis (dM), and the Chaos Game Representation (dCGR). Where applicable, suitable nucleotide substitution models were factored into the distance calculations. The recombination detection method, Phylopro by [Weiller98], was then used for inferring recombination. The above distance measures were fed into Phylopro to produce a correlation measure that was used to infer recombination. There null hypothesis was that the recombination result inferred was the same for each distance measure. Ten datasets (can be found at http://titan.cs.wits.ac.za/anna/paper2.html) were used for the tests. Each distance measure was run through the ten datasets and ANOVA was used to test the differences between the means. References: [Madsen2000], Madsen D. and Kleywegt G, Research Project - Multiple Sequence Alignment (FarOut).


23. Renato Milani, Eduardo Galembeck. Department of Biochemistry, Unicamp, Rua Vicente de Carvalho , 908 - Vila Amorim Americana,SP 13469-130, Brazil. [ PDF ]

Alignment of the Glycolytic Enzymes Aminoacid Sequen ces for Heterogeneity Analysis and Building of Hexokinase Phylogenetic Tree

Short Abstract: We tried to determine the heterogeneity of the glycolytic enzymes sequences, towards a better understanding of their phylogenetic relationships. The ten glycolytic enzymes retrieved from KEGG had their sequences aligned by CLUSTALW. A phylogenetic tree was built for the sequences related to hexokinase.


Long Abstract: The molecular phylogenetic analysis of biological sequences has been used to assess some extremely important biological questions, such as the relationship between humans and primates, the origin of AIDS etc. These methods have shown successful due to the development of more rigorous and efficient techniques, which were applied here in order to determine the heterogeneity of the glycolytic enzymes sequences, aiming a better understanding of their phylogenetic relationships. A MySQL database was built with information pertaining all the enzymes associated with glycolysis in KEGG, ExPASy, MetaCyc and PUMA2, in a total of 30 enzymes. The data were collected from KEGG and incorporated into the database. As a first step, FASTA files specific for each enzyme containing ortholog sequences were generated including every organism present in KEGG related to the ten.


24. Robert Hoehndorf, Janet Kelso, Heinrich Herre. Max-Planck-Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig, 04103, Germany. [ PDF ]

GFO-Bio: A biomedical core ontology

Short Abstract: A core ontology provides the most general concepts of a domain, which can be used to define all other concepts in this domain. We present GFO-Bio, a biomedical core ontology based on the top-level ontology GFO. GFO-Bio is fully formalized in OWL-DL and can be found on http://onto.eva.mpg..de.


Long Abstract: In recent years, the number and quality of biomedical ontologies is continually increasing. Of high relevance to biomedical research are the ontologies united under the umbrella


25. Eduardo Eyras, Mireya Plass. Universitat Pompeu Fabra, Passeig Maritim Barceloneta 37-49 Barcelona, Barcelona E08005, Spain. [ PDF ]

Differentiated evolutionary rates in alternative exons and the implications for splicing regulation

Short Abstract: Two contradicting properties have been associated to conserved alternative exons: higher sequence conservation and higher rate of non-synonymous substitutions, relative to constitutive exons. We provide evidence showing that most of the observed differences can be explained by the conservation of the transcript exonic structure. These results provide evidence for a selection pressure related to the regulation of splicing.


Long Abstract: Two contradicting properties have been associated to evolutionary conserved alternative exons: higher sequence conservation (1,2) and higher rate of non-synonymous substitutions (3,4), relative to constitutive exons. In order to clarify this issue, we have performed an analysis of the evolution of alternative and constitutive exons, using a large set of protein coding exons conserved between human and mouse and taking into account the conservation of the transcript exonic structure. Splicing regulatory elements are abundant in constitutive and alternative exons and this is likely to exert some selection pressure on the exon sequence at the pre-mRNA level. On the other hand, changes in the regulatory elements are known to affect the splicing pattern of a gene. Thus, a purifying selection at the pre-mRNA would be linked to a constraint on the splicing regulation, and therefore to a conservation of the exonic structure, whereas a relaxation of this selection would be linked to a weaker constraint on the splicing regulation, and to a lack of conservation of the exonic structure. In summary, the conservation of the exonic structure is expected to influence the sequence conservation of alternative and constitutive exons, and consequently, the measurements of the synonymous and non-synonymous substitutions. Accordingly, we separated our set of constitutive and alternative exons into four groups according to whether they were part of a transcript with an exonic structure that is conserved in the orthologous gene or not. Exons in transcripts with conserved exonic structure (CES) are called CES exons, whereas exons in transcripts with non-conserved exonic structure are called non-CES exons. A non-CES exon is such that there is a pattern of splicing of the pre-mRNA, which includes this exon, and which is never the same in the orthologous mRNA whenever the orthologous exon is included. We find evidence for a relation between the lack of conservation of the exonic structure and the weakening of the sequence evolutionary constraints in alternative and constitutive exons. Non-CES exons have higher synonymous (dS) and non-synonymous (dN) substitution rates than CES exons. Moreover, alternative non-CES exons are the least constrained in sequence evolution, and at high EST-inclusion levels they are found to be very similar to constitutive exons, whereas alternative CES exons have dS values significantly lower than average at all EST-inclusion levels. At high inclusion levels, alternative and constitutive exons of the same type (CES or non-CES) have indistinguishable dN distributions. However, at high inclusion levels, CES and non-CES exons can still be separated by their dN. We conclude that most of the differences in dN observed between alternative and constitutive exons can be explained by the conservation of the transcript exonic structure. Additionally, low dS values are characteristic of alternative CES exons at all EST-inclusion levels, but not of alternative non-CES exons. These results provide evidence for a selection pressure related to the splicing of the pre-mRNA. Furthermore, we have also defined a measure of the variation of the arrangement of exonic splicing enhancers (ESE-conservation score) to study the evolution of splicing regulatory sequences. We have used this measure to correlate the changes in the arrangement of ESEs with the divergence of exon and intron sequences. We find a higher conservation in the arrangement of ESEs in constitutive exons compared to alternative ones. Additionally, the sequence conservation at flanking introns remains constant for constitutive exons at all ESE-conservation values, but increases for alternative exons at high ESE-conservation values, indicating a higher density of regulatory signals. 1. Xing, Y. and Lee, C. 2005. Evidence of functional selection pressure for alternative splicing events that accelerate evolution of protein subsequences. Proc Natl Acad Sci USA. 102(38):13526-31. 2. Chen, F-C., Wang S-S., Chen, C-K., Li, W-H. and Chuang, T-J. 2006. Alternatively and Constitutively Spliced Exons Are Subject to Different Evolutionary Forces. Mol. Biol. Evol. 23:675-682. 3. Sorek, R., Shemesh, R., Cohen,Y., Basechess, O., Ast, G. and Shamir, R. 2004. A non-EST-based method for exon-skipping prediction. Genome Res. 14(8):1617-1623. 4. Philipps, D.L., Park, J.W. and Graveley, B.R. 2004. A computational and experimental approach toward a priori identification of alternatively spliced exons. RNA 10(12):1838-44. 5. Plass, M., Eyras, E. 2006. Differentiated evolutionary rates in alternative exons and the implications for splicing regulation. BMC Evolutionary Biology, in press.


26. Rosina Piovani, Hugo Naya, Victor Sabb�a, H�ctor Musto. Facultad de Ciencias, Igua 4225, Montevideo,Montevideo 11400, Uruguay. [ PDF ]

Codon instability at high GC level in the comparison between Human and Chimpanzee

Short Abstract: We obtained a set of 2482 orthologous genes between Human and Chimpanzee and analyzed the codon usage of sites where the amino acid was conserved. We found that as long as GC3 increases the level of conservation decreases for most amino acids. We discuss the results as a consequence of natural selection.


Long Abstract: Over the past ten years there has been a huge increase in the amount of completely sequenced genomes that made molecular evolutionists to start thinking in a higher level, the genomic level. That made possible studies like comparing rates of genome evolution, finding orthologous genes among genomes, timing genomic divergence and so on. All these studies are progressively making it possible to understand and analyze the evolutionary patterns among species, and also the compositionally trends that they might have undergone. This has been a very important step because, as said before, it is a way to study this features in another level, not the genes nor the proteins but the whole genome. Taking this into account and focusing on the human genome, this study aims to analyze the trends in codon usage in this genome in comparison with its most related one: the chimpanzee genome. The first step to approach this study was to identify orthologous genes between the two genomes. The performance of reciprocal blasts led to a set of 2482 orthologous genes which were used for the whole study. In order to start the analysis on codon usage between both species we prompted the discussion concerning when the amino acid was conserved in the set of orthologous genes. If that was accomplished we wanted to know weather the codon was also conserved. There are several papers that claim that the differences between codon usage strategies in vertebrates are due to differences in genome organization, due to the presence of isochores. These are fragments of the genome that have a relatively constant GC content, so according to these studies the differences in codon usage might be only due to the localization of the genes in these regions. To avoid, what we could call, this �isochore effect�, and since we are studying synonymous changes (changes in codons that do not change the amino acid); we separated all the orthologous genes into several groups taken every 5% of GC3 content and analyzed the eleven resulting windows separately. By doing this, the GC content of the third codon position (GC3), which is the greatest influence in synonymous changes, has less influence in the codon choice, making this situation ideal to identify other features influencing codon usage. We wrote a script to calculate all the synonymous and non synonymous changes per codon. Using this script we calculated what we called the �coincidence coefficient� referring to the ratio between the conserved codons and the total codons for each amino acid. This is a very useful way to analyze this type of changes because it is different from estimating a synonymous substitution distance in the sense that we are not comparing at the gene level but at the amino acid level, which allows us to better understand the codon usage trends. When we plotted this coincidence coefficient against the average GC3 of all the windows we observed that all the amino acids had a negative correlation which in fact was significant in thirteen out of eighteen amino acids. That means that synonymous changes are more frequent in GC3-rich genes than in GC3-poor ones. The amino acids with the most significant correlation were Ser, Pro, Ala and Thr. Surprisingly, they are all the amino acids that have C in the second codon position and they are also part of quartets, so they can form the CpG dinucleotide in the second and third codon position. We analyzed the frequency of the NCG codons in these amino acids with respect to the others and we compared it with other amino acids with A, T or G in the second codon position. We found that in the GC3-rich genes the NCG codon is used less than expected, and the results suggest that this is due to the known effect of CpG avoidance, since this dinucleotide is very susceptible to deamination of 5-mC to T which leads to CpA and TpG if it is not repaired. This could explain, for these four amino acids, the presence of more synonymous changes in the GC3-rich genes, at least in the transition NCG to NCC. Furthermore, we analyzed the conservation of the amino acid subtracting the number of conserved residues by the total of amino acids without considering the cases in which it was paired with an indel in the alignment. We plotted the frequency of conserved amino acids (calculated for each window) against the GC3 content of genes and we found a negative correlation. This result indicates that the instability observed at higher levels of CG3 at the codon level is also seen at the amino acid level. In order to get a more detailed picture, we calculated the correlation coefficient in each codon instead of amino acid (the fraction of conserved codons over the total number of codons for a given amino acid). We plotted these values (calculated for each GC3 window) against the number of copies of isoacceptor tARN for each codon and we found a positive correlation in each window. This means that the most conserved codons are the ones that are recognized by the isoaceptor tARN with higher copies. This implies a tendency to conserve the codons that guarantee a better speed of translation and/or accuracy. One interesting feature of this result is that the plot correlation is more significant in CG3-rich genes, which suggests that these genes could have a higher level of expression which would explain the conservation of the major codons. To sum up, using a set of orthologous genes between Human and Chimpanzee we found that genes with higher GC3 content have a lower level of amino acid conservation but when it is conserved the number of synonymous substitutions of codons is higher. The conserved codons are probably fixed by the action of natural selection acting at the level of translation (accuracy and/or speed).


27. Todd Stokes, John Phan, Chang F Quo. Biomedical Engineering Department, Georgia Institute of Technology and Emory University, 313 Ferst Drive, Atlanta,GA 30332, USA. [ PDF ]

Intelligence-based Center Information Integration and Report System in CCNE for Personalized Oncology

Short Abstract: We established Emory-Georgia Tech Center of Cancer Nanotechnology Excellence to combine cancer biology with nanotechnology and informatics for personalized oncology. One key objective is to develop an intelligence-based research result processing system to improve the interpretation, and to speed up the translating for such discoveries to cancer care community.


Long Abstract: In translational biomedical informatics research, the goal is to translate lab-based research results, including genomic and proteomic data mining, sequence analysis, biomarker discovery, pathway modeling, to bed-side patient care. In 2005, we have established Emory-Georgia Tech Center of Cancer Nanotechnology Excellence (CCNE), to combine cancer biology with nanotechnology and informatics to deliver novel molecular imaging probes, nano-therapeutics, and computing tools to treat cancer. 75 faculty from biology, nanotechnology, oncology, imaging, chemistry etc conducting research in different aspects. Research results from these individual groups are reported monthly. To be able to speed up the time it takes to translate the novel discoveries from lab to bedside, it is critical to integrate all the results in a meaningful way. Evident by the strong funding commitment of the National Institute of Health for multi-disciplinary clinical research centers, the acceleration of clinical research achieved by such relatively large-scale collaborative centers is invaluable. Information management and status reporting to funding agencies are integral infrastructure to support the center goals. Common information management objectives are as follows [1-4]: (i) to reduce administrative cost and redundancy in laboratory management; (ii) to develop laboratory methods to facilitate the research process using laboratory instruments and equipment; (iii) to integrate different instruments into an automated workflow architecture; (iv) to facilitate data production, storing, mining, visualization; (v) to ensure data quality and accessibility to other scientists for information dissemination; and (vi) to report research status to funding agencies as required. Individual research groups may accomplish these objectives with relative ease, but common solutions are not easily scalable to large centers. A. The Goals of a Center Status Reporting System The objective of our research is to develop an intelligence-based information processing system to analyze these results by comparing with existing patents and literature, and to provide a prioritized list based on significance requirement from center director. We have designed and developed our system (1) to be simple and flexible in order to easily adapt to the dynamics of novel discoveries; and (2) to have a multi-scale information reporting hierarchy from detailed technical version to higher-level executive summary. This web-based system collects incremental research progress updates from each researcher in the center and allows PIs, Directors, Managers, and Mentors to customize and generate status reports based on predefined templates, reporting hierarchies, and intelligent text mining and scoring methods. This system can be configured to deal with a variety of dynamic collaborative efforts and yet is simple enough to be used by project teams with little computational expertise. Furthermore, the nature of increasingly multi-disciplinary collaborative projects dictates specific needs that must be addressed such as data heterogeneity, diverse professional cultures and traceability at various manpower organizational levels. B. SYSTEM CONCEPT AND DESIGN Our research information processing and reporting system is designed with an analogy of a newspaper office. Researchers take the role of reporters working on a variety of stories. They submit regular, categorized, and proof-ready updates to the Editor (the center director). When one of their projects reaches a planning period, they submit a project planning update. In this way, researchers give frequent and uniform updates, leaving the pressure of deciding how to format the front page of each edition to the Editor. This saves time and allows researchers to focus more on their research. At the same time, the quality of their reports should be improved because updates are submitted at the moment when researchers are focused on that project rather than recalling and organizing them from memory later. Our system assumes that some amount of content cleaning, expansion or focusing will take place on the side of the Editor, and is designed to ensure that they have plenty of high-quality content to work with. For example, figures may be required for certain update types on a per-member basis. The information will be scored based on human knowledge such as Scientific Importance, Urgency, etc. These scores can then be used to create interpretation reports for center director to understand the scope of the discovery. This system is developed in the LAMP (Linux Apache mySQL PHP) web platform.. Cascading Style Sheets (CSS) were used so that the look and feel of the system can be easily modified to integrate within existing LIMS. The email reminder system is dependent on being hosted on the Linux/Unix operation system as cron and sendmail are used. The LaTeX formatting standard is used to enhance the readability of reports. This formatting language has long been used in the scientific community to format papers for submission to conferences and journals and thus is a de facto standard for this type of system. Our team evaluated the HTML standard but found that it did not provide important features such as automated equation formatting and handling figure references and citations. Other tools and utilities incorporated into the system are LaTeX2rtf (http://latex2rtf.sourceforge.net/) for creating an output format that is readable by Microsoft Word and sMArTH (http://smarth.sourceforge.net/) (see Figure 3) for giving users a friendly interface for creating complex equations without having to learn LaTeX syntax. The flexibility of our system is built into the data model used in the reporting database.. Currently, the database is composed of less than 20 tables and maintains a history of updates, figures, and reports as they are generated. C. CONCLUSIONS AND FUTURE WORK Our system has proven useful for improving organization and status reporting among our collaborators. We would also like to implement an interface to allow the Editor to do more high-level reorganization before the report compile and generation step is executed. For future work specific to our computing core, we plan to integrate this reporting system with GForge, an open-source software bug tracking system that will organize issues reported by users of our software and report on the progress of our developers in resolving those issues. In the future, grant administrators may support this type of system by providing LaTeX templates of reports in their reporting requirements.


28. Pavol Hanus, Johanna Weindl, Zaher Dawy, Juergen Zech, Joachim Hagenauer, Jakob C. Mueller. TU-Munich, Arcisstr. 21 Munich, 80290, Germany. [ PDF ]

Synchronization Model of Transcription Initiation in Prokaryotes and its Kinetic Interpretation

Short Abstract: Transcription initiation in prokaryotes has been extensively studied in the past decades. However, little is known about the kinetics and dynamics involved in promoter detection by the RNA polymerase (RNAP) and its sigma subunit. Using an analogy to the well understood synchronization techniques commonly used in digital data transmission we try to gain more insights into this process.


Long Abstract: There exist several analogies between the processing of genetic information on the molecular level in the cell and particular scenarios in communication systems. In this work we concentrate on the analogy between the promoter detection in prokaryotes and the synchronization pattern search in asynchronous communication. Our model organism is E.Coli and its Sigma 70 promoter sites. In the first step of the transcription cycle, the protein sigma factor associates with the RNAP core enzyme. The resulting complex, the RNAP holoenzyme, subsequently binds to the DNA double helix at a random position and slides along it until the sigma factor detects the two promoter regions -35 and -10 that indicate the proximate beginning of a gene. Similarly in an asynchronous communication scenario the receiver starts listening to the data stream at any point of time and looks for the synchronization pattern indicating the beginning of an information frame. The choice of the synchronization pattern is crucial for reliable detection of the information frame. In our analysis we have compared the synchronization patterns used in technical systems for quaternary alphabets with the actual promoter binding site patterns, finding that the strongest promoter binding sites have very good synchronization properties. Weight matrices are an important tool for determining putative protein-binding sites. They are constructed using the nucleotide occurrence frequencies of known binding site patterns and have already been applied to promoter analysis. However, recent gene expression data analysis has shown that the consensus binding site pattern obtained in this manner, does not necessarily coincide with the strongest binding site pattern in terms of binding energies and expression strength. In order to model the underlying biological process as closely as possible, we constructed the weight matrix based on the position specific nucleotide dependent contribution to the binding energy and expression strength. Analog to synchronization in digital data transmission, we modeled promoter detection performed by the RNAP during transcription initiation using a sliding window that is shifted in single-nucleotide steps over the DNA. The algorithm accounts for the fact, that the sigma factor is capable of stretching or squeezing and hereby, adapting to different promoter spacings btw. the -35 and the -10 region in order to bind to the energetically most favorable site at each step of the sliding process. Our main goal was to gain more insights into the process of promoter recognition on the molecular level common to all promoters. We were not trying to develop another alternative method for promoter detection. Thus, we applied our developed synchronization algorithm to all promoters in our dataset and derived an average signal to eliminate the strong fluctuations of individual sequences. As expected we observed strong recognition signal of the promoter regions. When applied to a wider range around the promoter regions our algorithm revealed a characteristic increase of the binding energies in a region of approximately 500 basepairs in size surrounding the promoters. When binding to promoterless sequences, the RNAP holoenzyme performs a random walk without a preference for a particular direction of sliding. Our results suggest that this random walk gets directed towards the transcription start site as soon as the complex enters this region of approximately 500 base pairs surrounding the promoters.. At the same time, the speed of sliding decreases, i.e. the complex is slowed down. These two facts imply that the sequence around the promoter site contains important information apart from the promoter itself that guides the RNAP holoenzyme to the transcription start site and hereby, ensures expression of the corresponding genes.


29. Perret, Bairoch, Veuthey. Swiss Institute of Bioinformatics, 1 rue Michel-Servet, Geneva, 1211, Switzerland. [ PDF ]

CaPSuLo, an integrative web tool for predicting the subcellular location of proteins

Short Abstract: CaPSuLo provides a common access to a set of selected subcellular location prediction programs in an integrative tool, preventing an easy interpretation of inconsistent results. It displays results together with information on sequence homology and topology to help users take the most reliable prediction (http://biosapiens.isb-sib.ch/capsulo).


Long Abstract: Knowing the subcellular location of a protein is a crucial step in the discovery of its function. The huge increase in available biological data requires the development of prediction tools to better functionally characterize genomes and proteomes. For instance, knowing how a protein is targeted to a subcellular compartment can lead to further investigations regarding its function. As a consequence, scientists have taken advantage of the implementation of experimental data in sequence databases and have developed many subcellular location prediction programs. These programs can be categorized into two groups according to the protein features they use to perform the prediction, namely (i) the composition and physicochemical properties of amino acids from part of or the entire protein sequence, or (ii) the detection of a targeting peptide. They also diverge depending on the methods used to perform the prediction. These differences along with the increasing number of tools can be confusing for users; although it is generally admitted that relying on several predictors improves confidence, inconsistent results may make the decision on the correct prediction difficult. These observations lead us to build CaPSuLo, a web tool, which includes several subcellular location prediction programs, as well as additional information such as protein topology prediction and BLAST homology search (http://biosapiens.isb-sib.ch/capsulo). CaPSuLo consists of three different parts: (i) The query interface allows the user to enter either a protein sequence, or a UniProtKB identifier or accession number. Additional information on the taxonomic group to which the sequence belongs is also required, as a number of programs have only been trained on sequences from a specific range of organism. The user can select the predictors to run on the sequence from a table, with optional parameters for some programs, a link to the original website of each predictor, and the corresponding PubMed citation. The range of subcellular locations that each program is able to predict coupled with the taxonomic specificity of these predictions, thanks to a documented color code, completes the table. In addition, the protein topology can also be predicted using TMHMM. (ii) A set of CGI scripts calls the user-selected programs, via the network, with the parameters required. In addition, we run a BLAST automatically on the query sequence against UniProtKB/Swiss-Prot. (iii) CaPSuLo parses the retrieved results of each selected program and displays them together with their respective confidence measurements. When the user calls the programs with a UniProtKB identifier, CaPSuLo displays the annotation corresponding to the protein�s subcellular location if it is available in the UniProtKB entry. Moreover, a link to the complete entry annotation is provided. The best-scored BLAST results (above 70% identity) can be screened, together with their full UniProtKB description and the corresponding subcellular location annotation, when available. Additional predicted features, e.g. target peptide cleavage sites and transmembrane regions, can be seen directly on the sequence. Predicting the subcellular location of a protein is a step towards its functional characterization, from where further research work can be carried out. We hope CaPSuLo will help experimentalists in such a task by providing an easy way to compare and analyze the results of up to ten different subcellular location prediction programs. We also plan to improve the UniProtKB annotation by providing the database curators with a decision support system for protein subcellular location prediction.


30. Anat Achiron, Michael Gurevich, Tair Snir, Mathilda Mandel. Multiple Sclerosis Center, Sheba Medical Center, Ramat-Gann, 52621, Israel. [ PDF ]

An integrative Model for Reconstruction of Drug-Related Regulatory Pathway

Short Abstract: An integrative computational model based on gene expression, chromosomal location and promoter region sequence data was applied to analyze interferon beta-1-a (Rebif) treatment effects in multiple sclerosis. It identified 21 directly regulating genes by 48 transcription factors. This approach enables better evaluation of the molecular mechanisms involved in drug interactions.


Long Abstract: In the past few years, reconstruction of regulatory networks has become an essential part of gene-expression analysis studies. However, using the common methods for reconstruction of regulatory networks, often resulted in high false positive error rate. In the present study, we present a rich integrated computational model based on combination of gene expression data, chromosomal location data and promoter region sequence data. We applied this integrated model to analyze interferon beta 1-a (IFNb, Rebif) treatment effects in multiple sclerosis (MS). Rebif is an immunomodulatory drug used for treatment of MS. MS is a T-cell mediated autoimmune disease characterized by central nervous system inflammation,demyelination and axonal loss. It is known that binding of IFNs to specific cell surface receptor results in the activation of multiple intra-cellular signaling cascades. Furthermore, many of the involved genes in this IFNb inducible cascade are well characterized, yet much of its regulatory pathways in MS is unknown. Using Affymetrix cDNA microarrays (HG-U133A2 gene chip, we measured peripheral blood mononuclear cells (PBMC) gene-expression in 12 subjects with MS, prior and 3 months after initiation of Rebif treatment. We classified significantly-altered genes into 23 groups, according to their chromosomal location. For each group of genes that belong to a single chromosome we reconstructed regulatory pathways, using Bayesian Networks. We referred only to regulatory pathways with robustness>50%. We specifically focused on genes related to the directly induced regulatory process of Rebif treatment; these genes are stimulated by binding of the drug to specific receptors on the cell's membrane. In order to suggest possible interactions between the resulting regulatory pathways (trans-chromosomal and intra -chromosomal), we looked for common motifs of known transcription factors (TRANSFAC) in the promoter region of the directly affected regulating genes, assuming that a high correlation coefficient between the expression values of two genes implies that they are co-regulated. We used this assumption for scoring the resulting transcription factors and their possibly bounded genes. For each transcription factor, we calculated the average Pearson correlation between all the genes that this transcription factor can bind to. We referred only to transcription factors with 0.7 correlation and p<0.05. Application of our integrative model resulted in 21 directly regulating genes, regulated by 48 transcription factors. The biological analysis of the results demonstrated variable immunomodulatory effects, including induction of MHC class I, antigen processing, CD8+ T cells promotion, activation and shaping of Natural Killer (NK) cells cytotoxity, suppression of IL12 expression and regulation of dendritic cells.


31. Bingding Huang, Michael Schroeder. Biotec, Technical University of Dresden, Gerokstr 32, Dresden, 01307, Germany. [ PDF ]

A Novel Enzyme-inhibitor Docking Algorithm Based on the Degree of Burial and Conservation of Surface Residues

Short Abstract: The binding sites of enzyme-inhibitor complexes usually involve a very deep buried pocket and are more conserved than the rest of surface. In our docking approach BDOCK, the degree of burial and conservation of surface residues are taken into account, which significantly improves the performance of traditional FFT docking method.


Long Abstract: Protein-protein interactions are fundamental as many proteins mediate their biological function through protein interactions. Most processes in the living cell requires molecular recognition and formation of complexes, which may be stable or transient assemblies of two or more molecules with one molecule acting on the other, or promoting intra- and intercellular communication, or permanent oligomeric ensembles. The rapid accumulation of data on protein-protein interactions, sequences, structures calls for the development of computational methods for protein docking. Typically docking methods are investigated which attempt to predict the complex structures given the structures of components. Over the past 20 years there have been many computational approaches to dock proteins. These approaches are mostly based on the shape complementarity of structures and the physio-chemical properties of the interfaces. However, these docking approaches are far from perfect and there still remains room for improvement. In blind protein-protein docking approaches, it is of great importance that the binding sites are predicted correctly in the first step. Knowing where the binding sites are located on the protein surface can limit the conformational search space and reduce computational time. In the last 10 years, there have been many efforts to predict the protein-protein interaction binding sites based on the analysis of the protein surface properties. However, only a few approaches integrate the interface prediction into docking. The binding sites of enzyme-inhibitor complexes always involve a very deep buried pocket. It is also believed that the binding sites are more conserved than the rest of surface. In this work, we develop a system called BDOCK, which uses both the degree of burial and conservation to improve docking. First, we propose a novel shape complementary scoring function for the initial docking stage, which takes the degree of burial of surface residues into account as different weights. Secondly, in the filter stage, the highly conserved surface residues located around a pocket are considered to be predicted interface residues, which are then used to calculate the tightness of fit as a scoring function to filter the docking solutions generated in the first step. Our approach is evaluated on the unbound structures of 22 enzyme-inhibitor complexes from Chen benchmark data set. An interface prediction is defined as success if more than 50% of the predicted residues are real interface residues. Here we define a docking solution as near-native structure (hit) if the RMSD between it and the native complex is below 4.5 angstrom. In the first step, the top 2000 solutions based shape complementarity are kept for each complex. The Z-score of the tightness of fit is calculated and the threshold for filtering is set to -1.0. As results, the interface residues are correctly predicted for 19 complexes. Furthermore, 76% success rate is achieved on a large data set of 102 enzyme-inhibitor complexes derived from the PDB database and SCOP database. In the initial stage of docking, our approach can generate some near-native complex structures for all 22 cases, ranging from 1 to 931. In the filter step, the tightness of fit scoring function based on the predicted interface residues improves the docking results by factors of 4-61. Moreover, the tightness of fit significantly improves the ranking of the best hit to the top 10 for 12 cases. BDOCK is implemented using the BALL library in C++, taking an object-oriented and generic programming approach, which is very easy to extend and adapt new docking algorithms. The source code of BDOCK is available at http://www.biotec.tu-dresden.de/~bhuang/bdock.


32. Martin Akerman, Yael Mandel-Gutfreund. Technion- Israel Institute of Technology, Technion, Haifa, 32000, Israel. [ PDF ]

Choosing the End: Regulation of Alternative 3' Splice Sites

Short Abstract: In this study we apply a Support Vector Machine to identify alternative 3' splice site events. We show that we can distinguish between alternative and constitutive event both for tandem acceptor motifs and for splice sites which are distant apart. Finally, we suggest a possible mechanism of splice site selection.


Long Abstract: Alternative splicing (AS) constitutes a major mechanism creating protein diversity in humans. This can result from skipping entire exons or by altering the selection of the splice sites that define the exon borders. Alternative 3' Splice Sites (A3SS) represents ~18.4% of all AS events conserved between human and mouse. Half of these events involve the NAGNAG motif at the 3? splice sites. Though the NAGNAG motif is found frequently in 3' splice sites, only ~4% of these tandem acceptors are confirmed by EST to be alternatively spliced in human and mouse, while in 86% of the cases the proximal splice site is constitutively selected (P) and in 10% the distal splice site is chosen (D). We have previously shown that it is possible to distinguish between sequences that undergo alternative versus constitutive splicing (CS) at the NAGNAG motifs without relying on EST data. Among the features which are characteristic of the AS NAGNAG sites are: high sequence conservation of the motif, high conservation of ?30 bp at the intronic regions flanking the 3? splice site and overabundance of cis-regulatory elements in the flanking intronic and exonic regions. In this study we compute these and others characteristics for a variety of training sets of alternatively and/or constitutively spliced NAGNAG motifs and use a Support Vector Machine (SVM) in order to automatically classify them. We show that we can discriminate AS events from both P and D events with relative high accuracy. Interestingly, a high performance is attained when separating P from D events. The most discriminative parameters between the groups are splice site composition, namely the sequence of the NAGNAG motif, intronic conservation near the splice site and the strength and relative position of the polypyrimidine tract. Based on the assumption that tandem acceptors are a subset of all Alternative 3' Splice Site events, we expand our analysis to further cases in which the acceptors sites are spaced from each other at distances varying from 4 to 100 nucleotides. Using similar parameters the SVM succeeds to discriminate between A3SS events and a control set of Constitutive Splicing events including an AG site in varying distances. In addition, we observe that when using a control set of sequences in which the AG which does not serve as a splice site is not evolutionarily conserved, the SVM performance is considerably higher than when the conservation of the AG dinucleotide is not accounted for. The later result could suggest that 1- there is contamination of the CS ESTs within the AS events, 2- In some cases an AS-like regulatory process in required in order to avoid the selection of a proximal splice site, which is usually chosen as the default. In order to examine the second assumption we have carried out additional tests comparing subsets of CS sequences to the AS datasets. In a subset of sequences in which the distance between the splice site and the nearest AG site is 4-12 nt, we find that when the AG dinucleotide is evolutionarily conserved, SVM performance considerably drops compared to cases in which the AG dinucleotide is not conserved. We interpret that only in the former group the sequence environment resembles that of Alternative 3' Splice Site events. However, when the AG dinucleotide is located far from the splice site (30-100 nt), the sequence environment at the CS events does not display the characteristic features of AS neither when the AG site is conserved nor when it is not. Overall, our results imply that regulation is necessary in order to avoid the selection of undesired AG sites. Interestingly, we do not observe this regulation when the AG site is not conserved, perhaps because these sites are still evolving. Moreover such regulation seems not to be required when an AG is placed far from the real splice site, presumably upstream the branch point, since this AG will be sequestrated in the lariat during the second step of transesterification. In Conclusion our findings indicate that both the selection between two alternative splice sites and also the recognition of a constitutive AG site from a range of possible splice sites are regulated processes.


33. Markus Weniger, Joerg Schultz. Department of Bioinformatics, University of Wuerzburg, Am Hubland, Wuerzburg,Bavaria 97074, Germany. [ PDF ]

Genome Expression Pathway Analysis Tool - Analysis and Visualization of Microarray Gene Expression Data under Genomic, Proteomic and Metabolic Context

Short Abstract: GEPAT, http://bioapps.biozentrum.uni-wuerzburg.de/GEPAT/, is a web-based platform for annotation, analysis and visualization of microarray gene expression data. Analysis and visualization methods under genomic, proteomic and metabolic context are integrated in an easy to use, interactive graphical user interface.


Long Abstract: A typical analysis of microarray gene expression data consists of a number of steps, including normalization, filtering and annotation of the data, followed by various group prediction (unsupervised clustering) or classification (supervised clustering) methods. Although there exist a large number of data analysis tools available for these steps, most of them lack the ability to help with the interpretation of the results. For a deeper insight of the biological meaning, information like gene function, chromosomal location, affected pathways, protein interactions and literature references provide a useful start for further research. For this reason we developed a web-based Tool, GEPAT, offering an integrated analysis of transcriptome data under genomic, proteomic and metabolic context. GEPAT imports various formats of oligonucleotide arrays, cDNA arrays and data tables in csv format. Upload of multiple files at once is possible. Data is stored password-protected in private user space on our server, allowing access to data from any computer around the world. A gene annotation database was build based on the UniGene and Ensembl databases, allowing gene identification for the probes on the microarray chip. Following data import, various missing value imputation and normalization methods can be applied to the data, for both oligonucleotide and cDNA arrays. Additional information, such as CGH data for the samples, can also be specified. GEPAT offers various analysis methods for gene expression data. Hierarchical, k-means and PCA clustering methods allow group detection in probes and samples. With these detected groups or a predefined group set, a linear model based t-Test can be used to identify differences in gene expression between groups. An M/A-Plot, filtering and sorting on fold-change, p-value and probe variance allow a quick identification of differentially expressed genes, a GO category enrichment analysis shows categories with elevated number of differentially expressed genes. For an interpretation of the results, GEPAT uses data from the Ensembl database and provides information about gene names, chromosomal location, GO categories and enzymatic activity for each probe on the chip. Gene interaction and association data from the STRING database, overlaid with analysis results, e.g. fold change values, can be used to find functionally related genes. The enzymatic information for the genes is used to overlay analysis results onto KEGG pathway maps, allowing an overview of the regulation of metabolic pathways. To check for chromosomal aberrations a chromosome overview can be generated, showing analysis results and optional CGH data on the chromosome set. For a further investigation of gene functions, a tree view of the Gene Ontology graph has been implemented. For easy usage of GEPAT we provide an application-like environment in the web browser. Drop-down-Menus and dialog windows provide the look and feel of a desktop application. Computational intensive analysis tasks are directed to our cluster network, allowing a large number of users at the same time. GEPAT is freely accessible for academic or non-profit users at http://bioapps.biozentrum.uni-wuerzburg.de/GEPAT/


34. David Orlando, Steven B. Haase, Alexander Hartemink, Edwin S. Iversen, Philip Benfey. Duke University, 1015 Sycamore St, Durham, NC 27707, USA. [ PDF ]

A model for determining cell-cycle-distributions in synchrony/release experiments

Short Abstract: Synchrony/Release protocols are commonly used to examine dynamic processes during the cell cycle. Quantitative time-series measurements are drawn from populations that lose synchrony over time. Each measurement represents a convolution of values over cell cycle states. We describe a model for dynamic population synchrony loss that helps resolve these effects.


Long Abstract: In order to understand the mechanisms that regulate cell division, the ideal method would be to measure dynamic processes in a single cell as it progresses through the cell division cycle. However, accurate quantitative measurements (such as transcript or protein levels) often require the collection of multiple cells. Thus, large populations of cells are often synchronized in a discrete cell cycle phase, and then released from synchrony. Time-series measurements are then made on the synchronous population of cells as they progress through the cell cycle. Unfortunately synchronization is an imprecise art; populations are never completely synchronous, they tend to lose synchrony over time, and their temporal progression rarely matches from experiment to experiment. This asynchrony causes two distinct problems: first, any observed value is a convolution of values from cells distributed across different cell cycle phases, and this convolving effect increases as the time course progresses. Secondly, because the kinetics of progression of populations varies from experiment to experiment, it is very difficult to compare values across time points from multiple experiments. We have developed a mathematical model of a population after release from synchrony to help resolve these two issues. The budding yeast, S. cerevisiae, has been widely used as model system for modeling cell cycle dynamics in synchronous populations. A variety of methods have been established for synchronizing yeast cell populations in various cell cycle phases. Moreover, the cell cycle position of individual yeast cells can be readily determined by monitoring landmark events, and because these events can be observed for single cells within a population, they are not subject to convolution effects that stem from population-level measurements. For example, a distinct morphological landmark, bud emergence, correlates with the transition from G1 into S-phase, and can be monitored by light microscopy. Cells remain budded until the completion of mitosis and cytokinesis, when cells re-enter G1. Thus, the passage of cells from G1 (unbudded) into S-phase (budded), or from M (budded) into G1 phase (unbudded) can be measured by plotting the percentage of budded cells (budding index) in a population as a function of time. The cell cycle position of individual cells in a population can also be determined by measuring DNA content by flow cytometry. Haploid yeast cells in G1 contain a single copy of the genome (1C). As cells enter S-phase they begin to synthesize a new copy of the genome and thus contain DNA contents between 1C and 2C. Cells that have completed S-phase but have not undergone cytokinesis (G2 and M) contain a 2C DNA content. Thus, by measuring the DNA content of individual cells in a population by flow cytometry, the distribution of cells in G1, S-phase, G2, and M can be determined. Although some information about population distributions can be learned by examining budding and DNA content, the resolving power of these methods is limited. Budding indices can only distinguish whether cells are in G1 (unbudded) or in S-phase, G2, or M (budded). Measuring DNA content can identify cells in all of the different phases, but cannot distinguish between cells in sub-compartments of G1, G2 or M (eg. early vs. late G1, or G2 vs. M). However, these measurements can be used to develop mathematical models that predict distributions with significantly higher resolution. Here we present a robust new mathematical model for determining cell cycle distributions of synchronized populations over time. Our model utilizes bud emergence at the G1/S border to learn distributions. We are able to capture multiple sources of synchrony loss including intrinsic cell-cycle rate differences between individual cells in the population and asynchrony effects related to the synchrony procedure and experimental conditions. The model also explicitly describes reproduction within the population via a branching process construction, allowing the determination of cell cycle distributions after cell division as well as providing for a mechanism to capture the asynchrony induced by asymmetric cell division. The model is fully parametric, can be expressed in closed-form and is conceptually easy to extend or incorporate into hierarchical models that combine data across experiments and/or involve multiple data types.. We show that this model can be fit to observed data generated under a variety of conditions. We also show that our model has predictive ability by generating simulated flow cytometry data for a time course fit using budding data. This model can mathematically compensate for some the shortcomings of synchrony release experiments by allowing the comparison of data across multiple synchrony release experiments, and by learning functions that allow true quantitative values to be extracted from convolving effects of asynchrony.


35. Gr�cia Maria Soares Rosinha, Fabiane Gon�alves de Souza, Fl�bio Ribeiro de Ara�jo, Cleber Oliveira Soares, Cl�udio Roberto Madruga, Carina Elisei de Oliveira, B�rbara Csordas. Embrapa Beef Cattle, BR 262 Km 04, Caqmpo Grande,Mato Grosso do Sul 79002-970, Brazil [ PDF ]

Construction of Suicidal Plasmids Used to put out Genes of Brucella abortus

Short Abstract: Construction of Suicidal Plasmids Used to put out Genes of Brucella abortus Rosinha, G. M. S. (1); Souza, F. G. (1); Ara�jo, F. R. (1); Soares, C. O.. (1); Madruga, C. R. (1); Elisei, C. (1); Csordas, B. G.(1). (1) Embrapa Beef Cattle, MS-Brazil. Brucella spp. are found in association with a large number of wild and domesticated animal.The objective of this work Was the construction of suicidal plasmids that could be used to put out of genes of Brucella that codify molecules whose functions are on the factors of virulence in Brucella abortus.


Long Abstract: Construction of Suicidal Plasmids Used to put out Genes of Brucella abortus Rosinha, G. M. S. (1); Souza, F. G. (1); Ara�jo, F. R. (1); Soares, C. O. (1); Madruga, C. R. (1); Elisei, C. (1); Csordas, B. G.(1). (1) Embrapa Beef Cattle, MS-Brazil. Brucella spp. are found in association with a large number of wild and domesticated animal species, where they can cause persistent infection and abortion. Zoonotic transmission of the organism to humans can lead to a chronic febrile disease known as brucellosis or Malta fever.The brazilian situation in the control of infectious illnesses of animals requires improvements to reach the international standards and to increase the animal production. Some of these illnesses, beyond for in risk the health of the cattle, can represent a problem of public health, as it is the case of brucellosis that it is caused by Brucella, a facultative intracellular bacteria. This disease is responsible for esteem losses to the brazilian livestock cattle in 32 million annual dollars, damages these that justify efforts in the vaccine search effective for its control. Recently, the type IV secretion system, encoded by the virB operon, was characterized in intracellular bacteria; responsible for the molecule transport and related with virulence factors of Brucella. The objective of this work was the construction of suicidal plasmids that could be used to put out of genes that codify molecules whose functions are on the factors of virulence in Brucella abortus. They were constructed through the amplification of genes virB4 and virB10, interrupted for a cassette of kan, and linking of these in a suicidal plasmid for Brucella. A PCR product of 2496 bp containing the virB4 gene was amplified using primers VirB4EcoF 5� � GGC GAA TTC ATG GGC GCT CAA TCC AAA � 3� and VirB4KpnR 5� � GGC GGT ACC TCA CCT TCC TGT TGA TTT and ligated to pBluescript � KS (+) (Stratagene) to generate the plasmid pBlue:virB4. The resulting plasmid was linearized with SalI and ligated to a 1.3 kb SalI fragment containing a kanamycin resistance cassette to generate the plasmid pBlue:virB4::Kan. Other PCR product of 1167 bp containing the virB10 gene was amplified using primers VirB10EcoF 5� � GGC GAA TTC ATG ACA CAG GAA AAC ATT � 3� and ViRb10HindR 5�- GGC AAG CTT TCA CTT CGG TTT GAC ATC � 3� and ligated to pBluescript � KS (+) to generate the plasmid pBlue:virB10. The resulting plasmid was linearized with MfeI and ligated to a 1.3 kb EcoRI fragment containing a kanamycin resistance cassette to generate the plasmid pBlue:virB10:Kan. It was opted to these genes being based on the fact of that the absence of these in the genome of B. abortus disables strains mutant of if talking back in macrophages and to survive in mice. Amongst the possible vaccine forms studied against B.. abortus, the strategy that comes being widely used is the attainment of strains genetically attenuated, for to put out of supposedly involved genes in the virulence. The construction of suicidal plasmids consists of an important tool of molecular biology, its application makes possible the development of strains vaccine genetically modified which provide diverse advantages in the control of infectious illnesses. Supported by FUNDECT.