29 posts · 20,283 views
OpenHelix
29 posts
Sort by Latest Post, Most Popular
View by Condensed, Full
NCBI was created in 1988 and has maintained the GenBank database for years. They also provide many computational resources and data retrieval systems for many types of biological data. As such they know all too well how quickly the data that biologists collect has changed and expanded. As uses for various data types have been [...]... Read more »
Tatusova, T., Karsch-Mizrachi, I., & Ostell, J. (1999) Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics, 15(7), 536-543. DOI: 10.1093/bioinformatics/15.7.536
Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T.... (2011) BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Research. DOI: 10.1093/nar/gkr1163
Sayers, E., Barrett, T., Benson, D., Bolton, E., Bryant, S., Canese, K., Chetvernin, V., Church, D., DiCuccio, M., Federhen, S.... (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. DOI: 10.1093/nar/gkr1184
In today’s tip I’d like to introduce you to the Cancer Genome Workbench, or CGWB. The workbench gathers cancer information from a wide variety of projects including Johns Hopkins University and GlaxoSmithKline Cancer Cell Line Genomic Profiling Data, NCI’s Therapeutically Applicable Research to Generate Effective Treatment (TARGET), NHGRI’s Tumor Sequencing Project (TSP), The Cancer Genome Atlas (TCGA), and the Sanger Center’s COSMIC initiative and presents the cumulative data as high-level summary visualizations. The CGWB’s genome-browser view is built on a UCSC Genome Browser backbone, for power and flexibility.
I noticed an announcement in the May 7th Nature Signaling Gateway Update email that the NCI-Nature Pathway Interaction Database – May Update was featuring a bioinformatics primer on The Cancer Genome Workbench. The primer is great & goes into much more detail about the Cancer Genome Workbench than I will be able to in this quick tip. I strongly check the primer, and the workbench out. When I went over to the workbench to explore, I quite honestly was a bit taken back by the complexity of the displays – the amount of data presented in their summary visualizations are somewhat intense.
I hope that in my tip movie I will be able to convince you that the small investment you will need to do to get acclimated to their images is well worth the amount of data you will quickly understand how to analyze. The views are so data rich, it takes a bit of adjusting to – there is very little labeling (to keep displays as clean as possible) and information is provided via pop-up messages as you scroll over the display. Once I got past the intensity of the displays, I was really amazed by the scope of data visualized in CGWB displays – data on every chromosome & gene over multiple datasets/experiments, in one 2D image. As the NCI primer says, cancer is complex – really complex. Being able to see such ‘big picture’ views as those provided by the Cancer Genome Workbench is a really powerful analysis aid. I for one am impressed with this resource, which is why I’ve chosen to feature it today.
In my 5 minute tip I was only able to show you the briefest of glimpses of the CGWB landscape and heatmap views. I was not able to show you the details of wither view, including a hyperlinked list of genes with the highest mutation frequencies. Nor was I able to show you the full scope of other views which include genome browser views (based on the UCSC Genome browser, as I mentioned earlier), correlation plots, protein domain views, 3D vizualizations, as well as next-gen and trace sequence views. Check out figure 1 of the bioinformatics primer to see examples of those.
I’ve added a citation to the original CGWB publication. It was published in 2007, and so does not cover all the current functions of the workbench, but I think reading it might help give you an idea of the workbench because it goes into the goals and background that the CGWB is based on more than the primer, which is much more up-to-date and focuses on the functionality of the workbench. In this paper you can also read how the authors utilized the workbench to analyze three public datasets, and see how it expanded their research findings.
All & all, I think the Cancer Genome Workbench is an amazing resource for cancer research. Be sure to check out the tip movie, the primer, the original CGWB publication and especially the CGWB! Thanks for joining us for this week’s tip.
Zhang, J., Finney, R., Rowe, W., Edmonson, M., Yang, S., Dracheva, T., Jen, J., Struewing, J., & Buetow, K. (2007). Systematic analysis of genetic alterations in tumors using Cancer Genome WorkBench (CGWB) Genome Research, 17 (7), 1111-1117 DOI: 10.1101/gr.5963407
... Read more »
Zhang, J., Finney, R., Rowe, W., Edmonson, M., Yang, S., Dracheva, T., Jen, J., Struewing, J., & Buetow, K. (2007) Systematic analysis of genetic alterations in tumors using Cancer Genome WorkBench (CGWB). Genome Research, 17(7), 1111-1117. DOI: 10.1101/gr.5963407
More and more disease-causing mutations are being identified in exonic splicing regulatory sequences (ESRs). These disease effects can result from ESR mutations that cause exon skipping in functionally diverse genes. In today’s tip I’d like to introduce you to a tool designed to detect exon variants that modulate splicing. The tool is named SKIPPY and has been developed and is maintained by groups in the Genomic Functional Analysis research section of the NHGRI.
At the end of the post I cite a very well-written paper describing the development of SKIPPY, as well as the background on why the tool was developed. I won’t have time to go into all those details, but if you are interested the paper is freely available from Genome Biology. The site also has nice, clear documentation and example inputs – which I will use as my examples. Splicing can be modulated in a variety of ways, including the loss or gain of exonic splicing enhancers (ESEs) or silencers (ESSs). Variants accomplishing either of those are referred to as splice-affecting genome variants, or SAVs. Not all of the abbreviations are explained on the results page, as you will see in the tip, but all are explained in detail in the SKIPPY publication, and the ‘Methods and Interpretations‘ and ‘Quick Reference and Tutorial‘ areas of the site.
I first found the tool because it was mentioned in a nice review entitled “Using Bioinformatics to predict the functional impact of SNVs“, which is a paper that reviews mechanisms by which point mutations can effect function, describes many of the algorithms and resources available & provides some sage advice. I’ll post more on it in a later post. For now, check out the tip & the SKIPPY resource, and if you use the site please let us know what you think.
Woolfe, A., Mullikin, J., & Elnitski, L. (2010). Genomic features defining exonic variants that modulate splicing Genome Biology, 11 (2) DOI: 10.1186/gb-2010-11-2-r20
Cline, M., & Karchin, R. (2010). Using bioinformatics to predict the functional impact of SNVs Bioinformatics DOI: 10.1093/bioinformatics/btq695
... Read more »
Woolfe, A., Mullikin, J., & Elnitski, L. (2010) Genomic features defining exonic variants that modulate splicing. Genome Biology, 11(2). DOI: 10.1186/gb-2010-11-2-r20
Cline, M., & Karchin, R. (2010) Using bioinformatics to predict the functional impact of SNVs. Bioinformatics. DOI: 10.1093/bioinformatics/btq695
Back in April I happened to mention that we (OpenHelix) were writing a paper on informal sources of bioinformatics education (in a Friday SNPets item) and we were asked to announce when the paper came out. Well, we got word late last week that the article has been published. The article appears in a special issue of Briefings in Bioinformatics that is devoted to bioinformatics education. I’m not sure if all the articles in the issue are available yet, but it looks like several are in the journal’s Advanced Access area. Bioinformatics education is an area (obviously) that OpenHelix cares deeply about & we are anxiously awaiting our copies of the full issue so we can read all the articles, but I digress…
The title “OpenHelix: bioinformatics education outside of a different box” was a cool suggestion from one of the article’s reviewers – my original title was much tamer (ok, more boring). Regardless of the final title, what we wanted to do in the article is to discuss informal sources of bioinformatics education. By education we do mean acquiring applicable information that allows a researcher to operate within the field of bioinformatics. By informal we mean outside of traditional, credit based classes and degrees. Essentially we provide a bit of the knowledge and know-how that we’ve gathered over years of working with hundreds of resources, thousands of workshop attendees, and countless online contacts about where a researcher, or librarian, or whoever can turn for various informational needs in the field of bioinformatics.
Our contention is that not everyone needs to program in order to manage and manipulate their biological data these days. There are SO many fine publicly available databases, algorithms, tools and more, it is just a matter of awareness and training for anyone to be able to reformat and analyze their personal data sets. We maintain that :
…bioinformatics education needs to do a minimum of four things:
1. raise awareness of the available resources
2. enable researchers to find and evaluate resource functionality
3. lower the barrier between awareness and use of a resource
4. support the continuing educational needs of regular resource users
In the paper we walk through each of these – we first describe example needs associated with the point, and then cover possible informal resources that meet the needs. The article includes tables of resources and links to them and many many references. We really hope that is a very useful resource in the field of bioinformatics education. I am already looking forward to contributing to the next special education issue, both to hone my writing skills and to extend the information we can provide readers. Please do comment, email, whatever and let us know about the resources that you use, what you learned from the article, etc. Oh, here’s the citation info:
Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics DOI: 10.1093/bib/bbq026
... Read more »
Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010) OpenHelix: bioinformatics education outside of a different box. Briefings in Bioinformatics. DOI: 10.1093/bib/bbq026
In this week’s tip I’m going to introduce you to a suite of motif discovery tools, and show you (briefly) how to use one of the tools. The MEME suite is a comprehensive collection of tools for analysis of both protein and DNA motifs. As described on the MEME Suite homepage, or in the citation that I reference below, this set of tools allows one to use as much or as little of the suite as meets their research needs. A user can initially find motifs with either the MEME or GLAM2 algorithm. The original motif discovery algorithm created by the developers, MEME, finds ungapped motifs within DNA or protein sequences. GLAM2 specializes in the discovery of gapped motifs. The motifs found with either of these tools can then flow directly into the downstream tools of the suite for further analysis.
There are three different tools that can be used to search a sequence database for motifs. MAST and FIMO use different algorithms but both use MEME output or ungapped motifs to search sequence databases. GLAM2SCAN uses the gapped motif output from GLAM2 to search sequence databases. There is also the tool, TOMTOM, which allows you to compare your motif to a database of motifs to find any matches. The GOMO tool finds gene ontology terms that are associated with genes regulated by a motif, to add functional information about a motif.
All of these tools together create a comprehensive, unified site for the discovery and analysis of sequence motifs. Researchers can begin with unaligned sequences and use the MEME suite of tools to find motifs and obtain: aligned motifs, annotated sequences, or annotated motifs. This suite has been thoughtfully designed to allow you to find motifs with MEME and GLAM2 and then easily – with just a click of a button – perform further analysis with MAST, FIMO, GLAM2SCAN, TOMTOM and GOMO.
I cannot begin to show the utility of the whole suite in this short tip, but if you are doing motif discovery, alignment or analysis I’d suggest that you check out these tools for yourself. If you are interested in further details on MEME, you can check out our MEME Suite tutorials, check out the documentation on the site (it is clear & pretty comprehensive), or check out their paper in the database issue of the journal Nucleic Acids Research. The paper is well written & provides a nice overview of how data can flow through the suite, as well as some details on each specific tool in the suite.
Bailey, T., Boden, M., Buske, F., Frith, M., Grant, C., Clementi, L., Ren, J., Li, W., & Noble, W. (2009). MEME SUITE: tools for motif discovery and searching Nucleic Acids Research, 37 (Web Server) DOI: 10.1093/nar/gkp335
... Read more »
Bailey, T., Boden, M., Buske, F., Frith, M., Grant, C., Clementi, L., Ren, J., Li, W., & Noble, W. (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Research, 37(Web Server). DOI: 10.1093/nar/gkp335
In this week’s tip I’d like to introduce you to CircuitsDB, which describes itself as:
“…a database where transcriptional and post-transcriptional (miRNA mediated) network information is fused together in order to propose and recognize non trivial regulatory combinations. “
I found out about the database from the BioMed Central article “CircuitsDB: a database of mixed microRNA/transcription factor feed-forward regulatory circuits in human and mouse“, which I cite below. I had already been thinking about miRNAs because I am slated to update our miRBase tutorial in the near future and have been reading/catching up on the latest in the field. The CircuitsDB paper by Olivier Friard et al does a really nice job of quickly and clearly laying out the background of the project – how transcription factors have long been studied for their transcriptional regulation of protein-coding genes involved in any manor of pathways, including those of disease. It goes on to describe that the study of microRNAs, or miRNAs, is a newer field studying the post-translational regulatory effects of miRNAs on protein-coding genes and their functions. Current efforts are moving to integrate the two areas of research to create more complete, and admittedly more complex, regulatory views of protein-coding genes and the affects on disease and other pathways.
The developers of CircuitsDB also very clearly describe how they have mined, analyzed and connected data from several top databases – many of which we have tutorials on, such as OMIM, miRBase, Ensembl and others – in order to create feed-forward regulatory loops, or FFLs, of TFs, affected miRNAs and ultimately affected protein-encoding genes. The image at the right is from their original paper: “Genome-wide survey of microRNA–transcription factor feed-forward regulatory circuits in human” (cited below), which reported the development of the computational framework for the mixed miRNA/TF Feed-Forward regulatory circuits that are freely available through the CircuitsDB web interface. This original paper is available for free, with registration to RSC Publishing, and provides a detailed description of their original development, as well as access to several supplemental files.
Essentially networks linking transcription factors and affected genes, miRNAs and affected genes, and transcription factors and miRNAs were painstakingly connected through an ab-initio oligo analysis. Support was then gained for the connections by analyzing enriched GO terms, disease connections, and previously-known connections found in other specialized resources. The CircuitsDB interface offers multiple tools. The main tool (FFL) is what I show in this tip & is the tool that searches for the networks diagrammed above. The MYC FFL is an impressive “curated database of miRNA mediated Feed Forward Loops involving MYC as Master Regulator”, and includes information on the direction of regulation, loop participants, evidence levels and more. The Transcriptional network tool allows a user to search with either a miRNA & find its regulating TF, or search with a TF & find regulated genes or miRNAs. The Post-transcriptional network tool is similar, but allows searches for a miRNA or gene to find regulated genes or regulating miRNA, respectively. So check out the tip & then check out CircuitsDB – enjoy!
References:
Friard, O., Re, A., Taverna, D., De Bortoli, M., & Corá, D. (2010). CircuitsDB: a database of mixed microRNA/transcription factor feed-forward regulatory circuits in human and mouse BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-435
Re, A., Corá, D., Taverna, D., & Caselle, M. (2009). Genome-wide survey of microRNA–transcription factor feed-forward regulatory circuits in human Molecular BioSystems, 5 (8) DOI: 10.1039/B900177H
... Read more »
Friard, O., Re, A., Taverna, D., De Bortoli, M., & Corá, D. (2010) CircuitsDB: a database of mixed microRNA/transcription factor feed-forward regulatory circuits in human and mouse. BMC Bioinformatics, 11(1), 435. DOI: 10.1186/1471-2105-11-435
Re, A., Corá, D., Taverna, D., & Caselle, M. (2009) Genome-wide survey of microRNA–transcription factor feed-forward regulatory circuits in human. Molecular BioSystems, 5(8), 854. DOI: 10.1039/B900177H
A few weeks ago a commenter asked me to compare IMG (Integrated Microbial Genomes) to the UCSC Microbial Genome browser. I’ve been exploring & thinking since then & am going to give a very brief comparison of those two resources in today’s tip & I’ll expand the comparison to other resources here in the text of this post.
Each of these resources could (and does in many cases) have an hour long tutorial devoted to it so I will only be able to give the briefest of overviews in this 5 minute tip, but I think (hope) it will be enough to get you thinking and exploring. The way I see things, comparing these two resources is kind of like comparing apples and aardvarks: they start with the same thing – namely microbial genome information from NCBI’s RefSeq database – but after that they are very different organisms.
* The UCSC Microbial Genome browser includes archaea, bacteria and archaeal virus genomes, and is based on a slightly modified version of the UCSC Genome Browser system, which is an amazingly powerful browser that we know and love here at OpenHelix. On their homepage they describe the resource as:
The UCSC Microbial Genome Browser is a window on the biology of more than 300 microbial species from Bacteria and Archaea domains. Basic gene annotaiton is derived from NCBI Genbank/RefSeq entries, with overlays of sequence conservation across multiple species, nucleotide and protein motifs, non-coding RNA predictions, operon predictions, and other types of bioinformatic analyses. In addition, we display available gene expression data, and soon, high-throughput RNA sequencing. Direct contributions of functional genomic data or bioinformatic analyses are welcome.
Information is presented as ‘tracks’ aligned along the sequence of the genome. These tracks can be hidden, expanded and customized to display exactly the information a researcher is interested in, in the exact format that the researcher would like it. The database contains hundreds of microbial genomes and information from these genomes can be retrieved and analyzed by intersecting datasets using UCSC’s powerful Table Browser resource. We have multiple free tutorials on the general UCSC Genome Browser which would be applicable to using the UCSC Microbial Browser.
* The UCSC Archaeal Genome Browser is another microbial resource based on the general UCSC Genome Browser & displays information very similarly to that of the UCSC Microbial Genome Browser, but the two offer differences in the information tracks and species available. We just created a full tutorial on this resource in July of this year & the homepage has been updated significantly since then, so this group must really be active! (Watch for an announcement of our updated tutorial in the near future…)
* The Integrated Microbial Genomes resource is from the Joint Genome Institute, and also obtains its sequences from NCBI’s RefSeq database, as well as from their own sequencing efforts. It contains sequence data on archaea, bacteria, eukaryotes (for comparative purposes), viruses and plasmids. To quote the IMG homepage:
The Integrated Microbial Genomes (IMG) system ( Nucleic Acids Research, 2010, Vol. 38) serves as a community resource for comparative analysis and annotation of all publicly available genomes from three domains of life in a uniquely integrated context…
… The IMG user interface (see User Interface Map) allows navigating the microbial genome data space along its three key dimensions (genes, genomes, and functions), and groups together the main comparative analysis tools.
However, rather than presenting vast amounts of information aligned along a sequence, this resource aims to provide researchers with the most state-of-the-art technology for microbial comparative analyses. They provide an array of tools for finding precise sets of genomes, genes of functions according to various characteristics. The researcher is then able to use Abundance profilers, functional alignment tools, analysis carts and more to compare the items within the set to one another. Visualizations present the comparative information beautifully and clearly, but are not flexible in the same way that UCSC browser displays are using the track controls.
* The Integrated Microbial Genomes with Microbiome samples (IMG/M) resource is a Data Management & Analysis System that is related to IMG (which I bet you guessed) that specializes in the unique issues surrounding metagenomic sequences. It currently contains metagenome data on 133 microbiomes, specialized tools for metagenomic analyses, and all of the IMG data for reference in comparisons.
* The Complete Microbial Resource, or CMR, from the J. Craig Venter Institute is another general microbial resource, with archaeal, bacterial, and viral genomes. Its genome browser and comparative functions put it somewhere between the UCSC Microbial Browser & IMG, but with a real emphasis on the ability to easily download lists of genes, evidence or genomic elements as well as sequences, etc. that result from you analyses.
* The Complete Microbial Genomes database is from NCBI and offers an extensive collection of data, resources and tools for prokaryotic genomic analysis. Data and tools are organized into three major tables, including Organism info, Complete genomes, and Genomes in progress. Sequence information is available for over 1000 archaeal and bacterial genomes and utilizes NCBI’s extensive resources to provide extensive linkout options to additional information. Complete Microbial Genomes is one of NCBI’s many Entrez Genome Projects.
These resources are just a few of the many general microbial resources publicly available to researchers. Then there are species specific resources such as EcoliHub, subject specific resources such as MiST (Microbial Signal Transduction Database), and resources associated with specific projects such as the Human Microbiome Project (HMP) resources: Data Analysis and Coordination Center (DACC) for the Human Microbiome Project (HMP), the NIH Human Microbiome Project (HMP) Roadmap Project, and the HMP NIH Intramural Skin Microbiome Consortium (NISMC) data at dbGaP, which Mary heard about at a recent meeting. I think I’ll leave exploration of these specialized projects to another day though.
References:
UCSC Microbial Genome Browser and UCSC Archaeal Browser:
Schneider, K. (2006). The UCSC Archaeal Genome Browser Nucleic Acids Research, 34 (90001) DOI: 10.1093/nar/gkj134
IMG:
... Read more »
Schneider, K. (2006) The UCSC Archaeal Genome Browser. Nucleic Acids Research, 34(90001). DOI: 10.1093/nar/gkj134
Markowitz, V., Chen, I., Palaniappan, K., Chu, K., Szeto, E., Grechkin, Y., Ratner, A., Anderson, I., Lykidis, A., Mavromatis, K.... (2009) The integrated microbial genomes system: an expanding comparative analysis resource. Nucleic Acids Research, 38(Database). DOI: 10.1093/nar/gkp887
Markowitz, V., Ivanova, N., Szeto, E., Palaniappan, K., Chu, K., Dalevi, D., Chen, I., Grechkin, Y., Dubchak, I., Anderson, I.... (2007) IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Research, 36(Database). DOI: 10.1093/nar/gkm869
Davidsen, T., Beck, E., Ganapathy, A., Montgomery, R., Zafar, N., Yang, Q., Madupu, R., Goetz, P., Galinsky, K., White, O.... (2009) The comprehensive microbial resource. Nucleic Acids Research, 38(Database). DOI: 10.1093/nar/gkp912
Sayers, E., Barrett, T., Benson, D., Bolton, E., Bryant, S., Canese, K., Chetvernin, V., Church, D., DiCuccio, M., Federhen, S.... (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 38(Database). DOI: 10.1093/nar/gkp967
The team here at OpenHelix has recently updated our sponsored tutorials on two excellent structural biology resources, the RCSB Protein Data Bank (PBD) and the PSI-Nature Structural Biology Knowledgebase (PSI SBKB). Because the tutorials are sponsored by these resources they are free for anyone to view and download in full. You can access our training materials for the resources at our RCSB PDB landing page, or our PSI SBKB landing page. I’m very happy with both tutorial suites, so please check them out.
As my personal celebration for these releases I have been reading a variety of articles showing the scope of how far our abilities to analyze protein structures have come. The first article is one that Mary pointed me to a while back, which discusses the infancy of bioinformatics, entitled “The Roots of Bioinformatics in Protein Evolution” by RF Doolittle (cited below, as are all articles mentioned). In this wonderful perspective Dr. Doolittle describes a time when DNA sequencing was unimaginable and protein sequencing was laborious, slow, and yet so new that each day was full of excitement as one more amino acid was identified. It is a revealing glimpse at a research era gone by – to quote Doolittle, “Science as an endeavor thrives on obsolescence.” – and mentions the contributions of Margaret Dayhoff, who Mary has blogged about.
The next historical article that I read was entitled “The Early Years of Retroviral Protease Crystal Structures” by M Miller (freely available on PMC). As you can tell from the title, this covers a time more recent than the Doolittle article, when protein crystallization studies were possible. Dr. Miller traces the X-ray crystal studies of retroviral proteases at the NCI-Fredrick in the late 1980′s and early 1990′s, and she describes how chemical synthesis of HIV1-PR was critical to obtaining enough protein for crystallization and how the crystal structure of it (deposited into the PDB archive and therefore freely available for all researchers to study) was invaluable for the design of inhibitors of HIV1-PR as anti-AIDS drugs.
I’ve also be perusing more recent papers that highlight how protein structures can aid biological investigations. These include: “Structure of mammalian AMPK and its regulation by ADP“, “Bioinformatics analysis of disordered proteins in prokaryotes“, “Crystal structure of inhibitor of κB kinase β” and others. It would also be fun to attend “The 25th Annual Meeting of the Groups Studying the Structures of AIDS-Related Systems and Their Application to Targeted Drug Design” to learn more, but alas I will not be in the area at the time of the meeting. As I’ve posted before, I am a geneticist by education. To me seeing the development of protein studies (through the historical reviews) and the studies currently occurring in the field of structural biology, combined with the amazing offerings available freely through both the RCSB PDB and the PSI SBKB really does feel like an appropriate, and enjoyable, celebration for the completion of our tutorial updates. Let me know what you think about them, when you get a chance!
References:
Berman, H. (2000). The Protein Data Bank Nucleic Acids Research, 28 (1), 235-242 DOI: 10.1093/nar/28.1.235
Berman, H., Westbrook, J., Gabanyi, M., Tao, W., Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer, F., Bordoli, L., Kopp, J., Podvinec, M., Adams, P., Carter, L., Minor, W., Nair, R., & Baer, J. (2009). The protein structure initiative structural genomics knowledgebase Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn790
Doolittle, R. (2010). The Roots of Bioinformatics in Protein Evolution PLoS Computational Biology, 6 (7) DOI: 10.1371/journal.pcbi.1000875
Miller, M. (2010). The early years of retroviral protease crystal structures Biopolymers, 94 (4), 521-529 DOI: 10.1002/bip.21387
... Read more »
Berman, H. (2000) The Protein Data Bank. Nucleic Acids Research, 28(1), 235-242. DOI: 10.1093/nar/28.1.235
Berman, H., Westbrook, J., Gabanyi, M., Tao, W., Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer, F., Bordoli, L.... (2009) The protein structure initiative structural genomics knowledgebase. Nucleic Acids Research, 37(Database). DOI: 10.1093/nar/gkn790
Doolittle, R. (2010) The Roots of Bioinformatics in Protein Evolution. PLoS Computational Biology, 6(7). DOI: 10.1371/journal.pcbi.1000875
Miller, M. (2010) The early years of retroviral protease crystal structures. Biopolymers, 94(4), 521-529. DOI: 10.1002/bip.21387
In today’s tip I want to make you aware of a tool that I think will help researchers to present their own data and publications in an accurate and universally searchable way. I learned of the resource (UCSDBioLit) through an article in one of my recent BioMed Central article alert emails. This resource allows authors to mark-up their own publications with XML tags AS THEY WRITE their papers. This will allow faster and more accurate semantic searching of their research.
A huge problem in science today is the ability to quickly search the vast literature base and to accurately and efficiently find the data that you are interested in. Here at OpenHelix we focus on ways of effectively and efficiently get information out of public databases and resources, but at the other end of the process is the ability for scientific knowledge to be curated into those resources. We have featured biocurators and the phenomenal work that they do several times in the past, but it is work that never ends and can be very labor intensive. It often involves an initial triaging of a field’s literature, some level of automatic information gathering, and then careful manual effort on the part of scientist at the resource to gather and present the information through their site. I know from personal experience that the process of reading a paper, clarifying research details with an author, and then presenting that information to the author’s satisfaction can be a very long & labor intensive process, for both the curator AND the original author.
For years there has been discussion of ‘expert curation’ in which experts in the field author review or summary pages in a resource, or community curation jamborees, etc. And there have been fruits from many of these efforts, but in general participation is low. But who is more of an expert on the research being published other than the author himself? If authors could/would mark up their own papers during the publication process, not only could they be assured that it would be accurate but they would help make their research universally searchable without the lag required for searchability through a specific resource. Thus far document mark-up is has not been an easy process and has largely been deemed ‘not worth the effort’ for the level of attribution/recognition affiliated with it.
The BioMed Central article does a nice job of outlining and discussing many of these issues. It cites many other efforts and resources, explains their motivation and the implementation of their software. A nice feature of the tool is that there are interoperability features, and a real commitment to conforming with existing standards of practice. The article also presents an appendix of resource addresses of other groups involved in semantic searching and literature publication. I especially like this quote from the paper:
The Word add-in presented here will assist authors in this effort using community standards and by making it possible for the author of the document, the absolute expert on the content, to do so during the authoring process and to provide this information in the original source document.
You can also find brief tutorials on using the tool at SciVee: Word Add-in for Ontology Recognition Tutorial (1 of 4): Install Process
As a note, literature mark-up and enabling are currently an active area – Mary found another literature handling resource and paper as well: Check out the tip, the articles & the tools. Tell me what you find/think. Thanks! (OH, and Happy St. Patty’s to ya!)
UCSDBioLit Reference:
Fink, J., Fernicola, P., Chandran, R., Parastatidis, S., Wade, A., Naim, O., Quinn, G., & Bourne, P. (2010). Word add-in for ontology recognition: semantic enrichment of scientific literature BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-103
... Read more »
Fink, J., Fernicola, P., Chandran, R., Parastatidis, S., Wade, A., Naim, O., Quinn, G., & Bourne, P. (2010) Word add-in for ontology recognition: semantic enrichment of scientific literature. BMC Bioinformatics, 11(1), 103. DOI: 10.1186/1471-2105-11-103
I’ve got a few news items regarding IMG, or Integrated Microbial Genomes, from the DOE Joint Genome Institute. The first item is that their Sept 2010 release occurred this week. IMG is now on version 3.2, has updated features and a bunch of new/revised genomes. I’ve begun updating our tutorial & will let you know when that is released. It’s not the craziest level of tool changes that I’ve seen from this group, but dang, they SURE don’t rest on their laurels! They are constantly changing and improving their interface and database.
If you are involved in microbial research and haven’t already checked out this powerful resource, I strongly suggest that you do. We’ve been training on this resource since 2006 and really believe in its value, which seems to increase with each of their releases. Mary & Trey presented an IMG workshop at NIH recently and it was surprising how many of their researchers were not aware of IMG. We hear that pretty often and it is too bad, it has so much to offer the microbial community and others as well.
The second item is that IMG has an annotation tool specifically designed for undergraduate education. Iddo Friedberg describes this as ‘Way cool’ in a recent tweet. The program/interface is named the “Integrated Microbial Genomes Annotation Collaboration Toolkit (IMG-ACT)“, and is somewhat associated with the “Interpret a GEBA Genome for Education” project from JGI. “GEBA” stands for Genomic Encyclopedia of Bacteria and Archaea. Both efforts are aimed at encouraging undergraduate research in microbial genome annotation, which might lead to the ‘alternative science career’ as a biocurator!
You can read all about the tool in their PLoS Biology article “Incorporating Genomics and Bioinformatics across the Life Sciences Curriculum“, or see a tour of the program/interface here. The tour makes the interface seem a bit clunky to me, but well thought out with lots of solutions to problems/issues often associated with undergraduate classes. The paper really provides a nice overview of the concept, collaborations, and initial outcomes of the 2008-2009 program.
Sign-ups are occurring for the 2011-2012 version of the program. The time frame is as follows:
Timeline to Participate:
1. Apply to be part of the 2011-2012 team by Monday, November 5, 2010 (download the application)
2. After acceptance, attend the workshop at the JGI (January 2011)
3. Implement in 2011-2012 academic year
as can be seen at the bottom of this page.
IMG-ACT Reference:
Ditty, J., Kvaal, C., Goodner, B., Freyermuth, S., Bailey, C., Britton, R., Gordon, S., Heinhorst, S., Reed, K., Xu, Z., Sanders-Lorenz, E., Axen, S., Kim, E., Johns, M., Scott, K., & Kerfeld, C. (2010). Incorporating Genomics and Bioinformatics across the Life Sciences Curriculum PLoS Biology, 8 (8) DOI: 10.1371/journal.pbio.1000448
... Read more »
Ditty, J., Kvaal, C., Goodner, B., Freyermuth, S., Bailey, C., Britton, R., Gordon, S., Heinhorst, S., Reed, K., Xu, Z.... (2010) Incorporating Genomics and Bioinformatics across the Life Sciences Curriculum. PLoS Biology, 8(8). DOI: 10.1371/journal.pbio.1000448
Bioinformatics analysis is a powerful technique applicable to a wide variety of fields, and the subject of many a blog post here at OpenHelix. I’ve had two particular bioinformatics articles on my desk for a couple of months now, waiting for me to be able to articulate my thoughts on them. They both offer great [...]... Read more »
Cline, M., & Karchin, R. (2010) Using bioinformatics to predict the functional impact of SNVs. Bioinformatics, 27(4), 441-448. DOI: 10.1093/bioinformatics/btq695
Vincent Shen. (2011) Mistaken identities in proteomics. BioTechniques. info:other/http://www.biotechniques.com/news/biotechniquesNews/biotechniques-312015.html
I’ve got tomatoes on my mind, so summer must be coming. It seems every where I turn, I’m being reminded of tomatoes. Not the grocery store/hot house kind, but the fresh farmer’s market/back yard-grown kind with juice and flavor so plentiful that it runs down your arms and onto the sunny porch floor where [...]... Read more »
Ruzicka, D., Barrios-Masias, F., Hausmann, N., Jackson, L., & Schachtman, D. (2010) Tomato root transcriptome response to a nitrogen-enriched soil patch. BMC Plant Biology, 10(1), 75. DOI: 10.1186/1471-2229-10-75
Not that long ago Mary posted on updates that occurred recently at the Allen Institute for Brain Science & hinted that there might be a tip coming about their cool 3D Brain Explorer tool – well, today’s the day! As Mary mentions in her post, the Allen Institute has created some phenomenal tools and detailed datasets for brain
From the Brain Explorer documentation, the Explorer allows users to:
# View a fully interactive version of the Allen Human Brain Atlas in 3D for two donors.
*View gene expression data in 3D: partially-inflated white matter surfaces are colored by gene expression values of nearby samples.... Read more »
Lau, C., Ng, L., Thompson, C., Pathak, S., Kuan, L., Jones, A., & Hawrylycz, M. (2008) Exploration and visualization of gene expression with neuroanatomy in the adult mouse brain. BMC Bioinformatics, 9(1), 153. DOI: 10.1186/1471-2105-9-153
A few years ago I did a tip on the proteomic tools list available from the ExPASy site. You can still get to that list, but it is no longer being updated. Instead the entire ExPASy site has been updated and reorganized and is now the new Swiss Institute of Bioinformatics (SIB) Bioinformatics Resource Portal. [...]... Read more »
Gasteiger, E. (2003) ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research, 31(13), 3784-3788. DOI: 10.1093/nar/gkg563
It is often beneficial to visit multiple biomedical databases or resources, even if they seem to provide overlapping information because no two resources focus on the exact same information, or present it in exactly the same way. Instead of duplicating each others’ curation efforts, database often link out to related information at other resources. You [...]... Read more »
The UniProt Consortium. (2009) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research, 38(Database). DOI: 10.1093/nar/gkp846
Rose, P., Beran, B., Bi, C., Bluhm, W., Dimitropoulos, D., Goodsell, D., Prlic, A., Quesada, M., Quinn, G., Westbrook, J.... (2010) The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Research, 39(Database). DOI: 10.1093/nar/gkq1021
Berman, H., Westbrook, J., Gabanyi, M., Tao, W., Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer, F., Bordoli, L.... (2009) The protein structure initiative structural genomics knowledgebase. Nucleic Acids Research, 37(Database). DOI: 10.1093/nar/gkn790
In today’s tip I will feature the data distribution summaries and their drill down features which you can see from many RCSB PDB searches. We are in the process of updating our full tutorial sponsored by the RCSB PDB team, and as part of that effort I’ve gotten to know and appreciate this new data presentation format. Over the last five years the RCSB PDB has really been working hard at redesigning their resource to be more easily accessed by a wide variety of users. Below you will find a recent citation from the group explaining all of their updates and the logic behind them. The paper is a good read because I won’t have time to do anything except scratch the surface of the redesign & you’ll get the details there, but also because the intro also gives a great glimpse into what resources are dealing with in the way of ‘data deluge’. The increase in users AND data that the RCSB PDB has experienced over the last few years is mind boggling!
OK, back to the data distributions. To me these are really elegant ways of helping any user – PDB is by no means just for structural biologists – come to the RCSB PDB & quickly and easily access whole categories of interesting information and then drill down in detailed ways to access the specific structure or data that they are most interested in. For example, I could begin with a keyword search for something as general as ‘kinase’. This search retrieves over 4 thousand hits, which could be quite daunting, but at the top of the report results are displayed under categories such as Organism, Taxonomy, Experimental Method, SCOP classification and more. Subcategories under each of these categories lets me know how many hits are, for example are a mixed Polymer type, are human hits, or are alpha and beta proteins. I can mouse over any subcategory title to find out the percent of hits in this category compared to all hits, or click on the title to further drill-down the data distribution on just that subcategory of results. The distribution summaries are updated to then focus specifically on the distribution of THOSE data. Using these summaries is much more intuitive than any text description description that I can muster.
My advice? Check out the tip, then check out the data distribution summaries, drill down utility, and all the other great features of the RCSB PDB & see how easy it is to find information on your favorite gene. Oh yea, and be watching for us to release our full, free & newly updated tutorial on the RCSB PDB resource soon!
Rose, P., Beran, B., Bi, C., Bluhm, W., Dimitropoulos, D., Goodsell, D., Prlic, A., Quesada, M., Quinn, G., Westbrook, J., Young, J., Yukich, B., Zardecki, C., Berman, H., & Bourne, P. (2010). The RCSB Protein Data Bank: redesigned web site and web services Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1021
... Read more »
Rose, P., Beran, B., Bi, C., Bluhm, W., Dimitropoulos, D., Goodsell, D., Prlic, A., Quesada, M., Quinn, G., Westbrook, J.... (2010) The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Research, 39(Database). DOI: 10.1093/nar/gkq1021
Most weeks our tip is a five-minute movie that quickly introduces you to a new resource, or a cool new function at an established resource. Occasionally we feature one of our full resource tutorial that is being made freely available through resource sponsorship of our training suite. In this week’s tip we provide access to one of our tutorials that is especially near and dear to our heart. It is a World Tour of Genomics Resources in which we explore a variety of publicly-available biomedical, bioinformatics and bioscience databases and other resources.
This tutorial is quite different from our usual ones. Generally we focus on a specific software resource and describe step-by-step how to use its functions such as how to do basic and advanced searches, how to understand and modify displays, where to find specific types of data such as FASTA sequences, etc. and even provide tips on ‘hidden features’ that power users even find useful and informative. This type of software training is absolutely critical.
But many people need an even earlier step: just the *awareness* that resources are available that might serve their needs. This tutorial fills that niche. We present a sampling of resources, all free to use, from each of 9 categories including: Analysis & Algorithms, Expression, Genome Browsers (for Eukaryotes and for Prokaryotes and Viruses), Genome Variation, Literature, Nucleotides, Pathways and Proteins. After the World Tour, which is the majority of the tutorial, we then describe how to use OpenHelix’s free search and learn portal to find bioscience resources most appropriate for your research needs. From this the tour transitions into a brief discussion of the format of our training materials and how to use them, and then ends with information about other learning resources that we provide.
This tutorial has been wildly popular whenever we’ve done it as a live seminar. At the NIH they actually had to lock the doors because we’d hit the capacity of the room, and people were turned away. In fact, it has been so popular that we decided to produce it as a full tutorial suite and release it as one of our free trainings so that anyone and everyone could learn about the breadth of great public software options available for free use.
In addition to this free tutorial, we also have published a paper entitled “OpenHelix: bioinformatics education outside of a different box” in a special issue of Briefings in Bioinformatics entitled “Special Issue: Education in Bioinformatics“. This paper describes a plethora of sources where researchers can access informal educational sources of learning on publicly available bioinformatics resources. The sources of information include a wide variety of formats including lists of resources, journals that regularly feature tool descriptions, and eLearning resources sources such as the MIT OpenCourseWare effort. If you know of other such resources that aren’t covered in our tour or paper, comment & let us know about them – we love to learn as much as we love to teach!
Quick link to World Tour of Genomics Resources tutorial here.
Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics, 11 (6), 598-609 DOI: 10.1093/bib/bbq026
... Read more »
Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010) OpenHelix: bioinformatics education outside of a different box. Briefings in Bioinformatics, 11(6), 598-609. DOI: 10.1093/bib/bbq026
In this week’s tip I’d like to introduce you to VirusMINT. We found VirusMINT during our ‘regularly scheduled’ update of our Introductory tutorial on MINT, or the Molecular INTeraction database. We really like MINT for all the great interaction information they provide on a wide variety of species. When we saw they had a ‘virally focused’ database, we had to check it out.
It turns out that VirusMINT is really unique in that it shows interactions BETWEEN human and viral proteins, all on the same interaction map. PLUS, from the VirusMINT homepage description:
VirusMINT uses the PSI-MI standard and is fully integrated with the MINT database.
You can either search for any viral or human protein by entering either common names or database identifiers in the form in the left frame or display a complete viral interactome by pressing the corresponding button in the frame below.
I only had time to show you the most basic VirusMINT features in this short movie. After you watch it, be sure to head over to MINT & check out all their great features, which currently includes 4 sister databases: MINT, HomoMINT (an inferred human network), Domino (a domain peptide interactions database) and VirusMINT. These four databases are a really nice protein interaction resource because each offers a clean set of information on important areas of protein interactions and are all integrated with one another. The MINT databases use PSI_MI standard formatting to capture curated protein interaction information from literature & direct user submissions. Not only is data integrated across each of the four databases, each database provides interactive viewers for visually displaying the data. Outputting and downloading the data is also possible.
For more details of their full functionality: consider checking out our full MINT tutorial – available through a subscription to our full database of training materials (currently on sale); purchasing an individual access to the tutorial; or checking out the references listed below.
References:
Chatr-aryamontri, A., Ceol, A., Peluso, D., Nardozza, A., Panni, S., Sacco, F., Tinti, M., Smolyar, A., Castagnoli, L., Vidal, M., Cusick, M., & Cesareni, G. (2009). VirusMINT: a viral protein interaction database Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn739
Chatr-aryamontri, A., Ceol, A., Palazzi, L., Nardelli, G., Schneider, M., Castagnoli, L., & Cesareni, G. (2007). MINT: the Molecular INTeraction database Nucleic Acids Research, 35 (Database) DOI: 10.1093/nar/gkl950
Ceol, A., Chatr Aryamontri, A., Licata, L., Peluso, D., Briganti, L., Perfetto, L., Castagnoli, L., & Cesareni, G. (2009). MINT, the molecular interaction database: 2009 update Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp983
... Read more »
Chatr-aryamontri, A., Ceol, A., Peluso, D., Nardozza, A., Panni, S., Sacco, F., Tinti, M., Smolyar, A., Castagnoli, L., Vidal, M.... (2009) VirusMINT: a viral protein interaction database. Nucleic Acids Research, 37(Database). DOI: 10.1093/nar/gkn739
Chatr-aryamontri, A., Ceol, A., Palazzi, L., Nardelli, G., Schneider, M., Castagnoli, L., & Cesareni, G. (2007) MINT: the Molecular INTeraction database. Nucleic Acids Research, 35(Database). DOI: 10.1093/nar/gkl950
Ceol, A., Chatr Aryamontri, A., Licata, L., Peluso, D., Briganti, L., Perfetto, L., Castagnoli, L., & Cesareni, G. (2009) MINT, the molecular interaction database: 2009 update. Nucleic Acids Research, 38(Database). DOI: 10.1093/nar/gkp983
In today’s tip I will briefly introduce you to the beta version of the updated DGV resource. The Database of Genomic Variants, or DGV, was created in 2004 at a time early in the understanding of human structural variation, or SV, which is defined by DGV as genomic variation larger than 50bp. DGV has historically [...]... Read more »
Church, D., Lappalainen, I., Sneddon, T., Hinton, J., Maguire, M., Lopez, J., Garner, J., Paschall, J., DiCuccio, M., Yaschenko, E.... (2010) Public data archives for genomic structural variation. Nature Genetics, 42(10), 813-814. DOI: 10.1038/ng1010-813
In today’s tip I am going to feature a resource that I found recently. I’ve been updating our dbSNP tutorial, which Mary & Trey will be presenting at workshops in Morocco, and also our free PDB tutorial, which is sponsored by the RCSB PDB team. I have therefore been thinking about protein structures and small [...]... Read more »
Yang, J., Oh, S., Ko, G., Park, S., Kim, W., Lee, B., & Lee, S. (2010) VnD: a structure-centric database of disease-related SNPs and drugs. Nucleic Acids Research, 39(Database). DOI: 10.1093/nar/gkq957
Do you write about peer-reviewed research in your blog? Use ResearchBlogging.org to make it easy for your readers — and others from around the world — to find your serious posts about academic research.
If you don't have a blog, you can still use our site to learn about fascinating developments in cutting-edge research from around the world.