Background: The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions. Results: This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70 % of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions. Conclusions: Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.
Background: The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences.; Results: In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories.; Conclusions: In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.
In the last decade, a great number of methods for reconstructing gene regulatory networks from expression data have been proposed. However, very few tools and datasets allow to evaluate accurately and reproducibly those methods. Hence, we propose here a new tool, able to perform a systematic, yet fully reproducible, evaluation of transcriptional network inference methods.
Our open-source and freely available Bioconductor package aggregates a large set of tools to assess the robustness of network inference algorithms against different simulators, topologies, sample sizes and noise intensities.
The benchmarking framework that uses various datasets highlights the specialization of some methods toward network types and data. As a result, it is possible to identify the techniques that have broad overall performances.
Fosso, B.; Santamaria, M.; Marzano, M.; Alonso-Alemany, D.; Valiente, G.; Donvito, G.; monaco, A.; Notarangelo, P.; Pesole, G. BMC bioinformatics Vol. 16:203, p. 1-11 DOI: 10.1186/s12859-015-0595-z Data de publicació: 2015-07-01 Article en revista
Background: Substantial advances in microbiology, molecular evolution and biodiversity have been carried out in recent years thanks to Metagenomics, which allows to unveil the composition and functions of mixed microbial communities in any environmental niche. If the investigation is aimed only at the microbiome taxonomic structure, a target-based metagenomic approach, here also referred as Meta-barcoding, is generally applied. This approach commonly involves the selective amplification of a species-specific genetic marker (DNA meta-barcode) in the whole taxonomic range of interest and the exploration of its taxon-related variants through High-Throughput Sequencing (HTS) technologies. The accessibility to proper computational systems for the large-scale bioinformatic analysis of HTS data represents, currently, one of the major challenges in advanced Meta-barcoding projects.; Results: BioMaS (Bioinformatic analysis of Metagenomic AmpliconS) is a new bioinformatic pipeline designed to support biomolecular researchers involved in taxonomic studies of environmental microbial communities by a completely automated workflow, comprehensive of all the fundamental steps, from raw sequence data upload and cleaning to final taxonomic identification, that are absolutely required in an appropriately designed Meta-barcoding HTS-based experiment. In its current version, BioMaS allows the analysis of both bacterial and fungal environments starting directly from the raw sequencing data from either Roche 454 or Illumina HTS platforms, following two alternative paths, respectively. BioMaS is implemented into a public web service available at https://recasgateway.ba.infn.it/ and is also available in Galaxy at http://galaxy.cloud.ba.infn.it:8080 (only for Illumina data).; Conclusion: BioMaS is a friendly pipeline for Meta-barcoding HTS data analysis specifically designed for users without particular computing skills. A comparative benchmark, carried out by using a simulated dataset suitably designed to broadly represent the currently known bacterial and fungal world, showed that BioMaS outperforms QIIME and MOTHUR in terms of extent and accuracy of deep taxonomic sequence assignments.
Proteins are the key elements on the path from genetic information to the development of life. The roles played by the different proteins are difficult to uncover experimentally as this process involves complex procedures such as genetic modifications, injection of fluorescent proteins, gene knock-out methods and others. The knowledge learned from each protein is usually annotated in databases through different methods such as the proposed by The Gene Ontology (GO) consortium. Different methods have been proposed in order to predict GO terms from primary structure information, but very few are available for large-scale functional annotation of plants, and reported success rates are much less than the reported by other non-plant predictors. This paper explores the predictability of GO annotations on proteins belonging to the Embryophyta group from a set of features extracted solely from their primary amino acid sequence.
Background: A number of software packages are available to generate DNA multiple sequence alignments (MSAs)
evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating
the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts to the
time-reversible models and it is not optimized to generate nonhomogeneous data (i.e. placing distinct substitution
rates at different lineages).
Results: We present the first package designed to generate MSAs evolving under discrete-time Markov processes on
phylogenetic trees, directly from probability substitution matrices. Based on the input model and a phylogenetic tree
in the Newick format (with branch lengths measured as the expected number of substitutions per site), the algorithm
produces DNA alignments of desired length. GenNon-h is publicly available for download.
Conclusion: The software presented here is an efficient tool to generate DNA MSAs on a given phylogenetic tree.
GenNon-h provides the user with the nonstationary or nonhomogeneous phylogenetic data that is well suited for
testing complex biological hypotheses, exploring the limits of the reconstruction algorithms and their robustness to
Background: The Distributed Annotation System (DAS) has proven to be a successful way to publish and share
biological data. Although there are more than 750 active registered servers from around 50 organizations, setting
up a DAS server comprises a fair amount of work, making it difficult for many research groups to share their
biological annotations. Given the clear advantage that the generalized sharing of relevant biological data is for the
research community it would be desirable to facilitate the sharing process.
Results: Here we present easyDAS, a web-based system enabling anyone to publish biological annotations with
just some clicks. The system, available at http://www.ebi.ac.uk/panda-srv/easydas is capable of reading different
standard data file formats, process the data and create a new publicly available DAS source in a completely
automated way. The created sources are hosted on the EBI systems and can take advantage of its high storage
capacity and network connection, freeing the data provider from any network management work. easyDAS is an
open source project under the GNU LGPL license.
Conclusions: easyDAS is an automated DAS source creation system which can help many researchers in sharing
their biological data, potentially increasing the amount of relevant biological data available to the scientific
Background: Typical evolutionary events like recombination, hybridization or gene transfer make necessary the use of phylogenetic networks to properly depict the evolution of DNA and protein sequences. Although several theoretical classes have been proposed to characterize these networks, they make stringent assumptions that will likely not be met by the evolutionary process. We have recently shown that the complexity of simulated networks is a function of the population recombination rate, and that at moderate and large recombination rates the resulting networks cannot be categorized. However, we do not know whether these results extend to networks estimated from real data.
Results: We introduce a web server for the categorization of explicit phylogenetic networks, including the most relevant theoretical classes developed so far. Using this tool, we analyzed statistical parsimony phylogenetic networks estimated from ~5,000 DNA alignments, obtained from the NCBI PopSet and Polymorphix databases. The level of characterization was correlated to nucleotide diversity, and a high proportion of the networks derived from these data
sets could be formally characterized.
Conclusions: We have developed a public web server, NetTest (freely available from the software section at http://
darwin.uvigo.es), to formally characterize the complexity of phylogenetic networks. Using NetTest we found that most statistical parsimony networks estimated with the program TCS could be assigned to a known network class. The level of network characterization was correlated to nucleotide diversity and dependent upon the intra/interspecific levels, although no significant differences were detected among genes. More research on the properties of phylogenetic networks is clearly needed.
Background: Although methods based on highly abstract descriptions of protein structures, such as VAST and TOPS, can perform very fast protein structure comparison, the results can lack a high degree of biological
significance. Previously we have discussed the basic mechanisms of our novel method for structure comparison based on our TOPS+ model (Topological descriptions of Protein Structures Enhanced with Ligand Information). In this paper we show how these results can be significantly improved using parameter optimization, and we call the
resulting optimised TOPS+ method as advanced TOPS+ comparison method i.e. advTOPS+.
Results: We have developed a TOPS+ string model as an improvement to the TOPS [1-3] graph model by considering loops as secondary structure elements (SSEs) in addition to helices and strands, representing ligands as first class objects, and describing interactions between SSEs, and SSEs and ligands, by incoming and outgoing arcs, annotating SSEs with the interaction direction and type. Benchmarking results of an all-against-all pairwise
comparison using a large dataset of 2,620 non-redundant structures from the PDB40 dataset  demonstrate the biological significance, in terms of SCOP classification at the superfamily level, of our TOPS+ comparison method.
Conclusions: Our advanced TOPS+ comparison shows better performance on the PDB40 dataset  compared to our basic TOPS+ method, giving 90% accuracy for SCOP alpha+beta; a 6% increase in accuracy compared to the TOPS and basic TOPS+ methods. It also outperforms the TOPS, basic TOPS+ and SSAP comparison methods on the
Chew-Kedem dataset , achieving 98% accuracy.
Software Availability: The TOPS+ comparison server is available at http://balabio.dcs.gla.ac.uk/mallika/WebTOPS/.
Phylogenetic trees resulting from molecular phylogenetic analysis are available in Newick format from specialized databases but when it comes to phylogenetic networks, which provide an explicit representation of reticulate evolutionary events such as recombination, hybridization or lateral gene transfer, the lack of a standard format for their representation has hindered the publication of explicit phylogenetic networks in the specialized literature and their incorporation in specialized databases. Two different proposals to represent phylogenetic networks exist: as a single Newick string (where each hybrid node is splitted once for each parent) or as a set of Newick strings (one for each hybrid node plus another one for the phylogenetic network).
The standard we advocate as extended Newick format describes a whole phylogenetic network with k hybrid nodes as a single Newick string with k repeated nodes, and this representation is unique once the phylogenetic network is drawn or the ordering among children nodes is fixed. The extended Newick format facilitates phylogenetic data sharing and exchange, and also allows for the practical use of phylogenetic networks in computer programs and scripts. This standard has been recently agreed upon by a number of computational biologists, is already supported by several phylogenetic tools, and avoids the different drawbacks of using an a priori unknown number of Newick strings without any additional mark-up to represent a phylogenetic network.
The adoption of the extended Newick format as a standard for the representation of phylogenetic network is an important step towards the publication of explicit phylogenetic networks in peer-reviewed journals and their incorporation in a future database of published phylogenetic networks.
Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of evolutionary events acting at the population level, like recombination between genes, hybridization between lineages, and lateral gene transfer. While most phylogenetics tools implement a wide range of algorithms on phylogenetic trees, there exist only a few applications to work with phylogenetic networks, none of which are open-source libraries, and they do not allow for the comparative analysis of phylogenetic networks by computing distances between them or aligning them.
In order to improve this situation, we have developed a Perl package that relies on the BioPerl bundle and implements many algorithms on phylogenetic networks. We have also developed a Java applet that makes use of the aforementioned Perl package and allows the user to make simple experiments with phylogenetic networks without having to develop a program or Perl script by him or herself.
The Perl package is available as part of the BioPerl bundle, and can also be downloaded. A web-based application is also available (see availability and requirements). The Perl package includes full documentation of all its features.
Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available.
We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next.
Masseroli, M.; Galati, O.; Manzotti, M.; Gibert, Karina; Pinciroli, G. BMC bioinformatics Vol. 6, num. suppl 4, p. 18-1-18-20 DOI: 10.1186/1471-2105-6-S4-S18 Data de publicació: 2005-11 Article en revista
Analysis of inherited diseases and their associated phenotypes is of great importance to gain knowledge of underlying genetic interactions and could ultimately give clinically useful insights into disease processes, including complex diseases influenced by multiple genetic loci. Nevertheless, to date few computational contributions have been proposed for this purpose, mainly due to lack of controlled clinical information easily accessible and structured for computational genome-wise analyses. To allow performing phenotype analyses of inherited disorder related genes we implemented new original modules within GFINDer http://www.bioinformatics.polimi.it/GFINDer/, a Web system we previously developed that dynamically aggregates functional annotations of user uploaded gene lists and allows performing their statistical analysis and mining.
New GFINDer modules allow annotating large numbers of user classified biomolecular sequence identifiers with morbidity and clinical information, classifying them according to genetic disease phenotypes and their locations of occurrence, and statistically analyzing the obtained classifications. To achieve this we exploited, normalized and structured the information present in textual form in the Clinical Synopsis sections of the Online Mendelian Inheritance in Man (OMIM) databank. Such valuable information delineates numerous signs and symptoms accompanying many genetic diseases and it is divided into phenotype location categories, either by organ system or type of finding.
Supporting phenotype analyses of inherited diseases and biomolecular functional evaluations, GFINDer facilitates a genomic approach to the understanding of fundamental biological processes and complex cellular mechanisms underlying patho-physiological phenotypes.