136 posts · 95,068 views
On the OpenHelix blog you will find a genomics resources news portal with daily postings about genomics and bioinformatics resources, genomics news and research, science and more. Our goal is to keep you, the researcher, informed about the overwhelming amount of genomics data out there and how to access it through the tools, databases and resources that are publicly available to you.
Sort by: Latest Post, Most Popular
View by: Condensed, Full
The Lancet paper, Clinical assessment incorporating a personal genome, has held my fascination this weekend (yes, I read it at the beach). Mary posted Friday and again Saturday on the paper and related NPR segment. It feels to me to be a seminal paper, though I do agree with Daniel at Genetic Future, there are a lot there we still don’t know. A large portion of the variation is in non-coding regions, and thus predictions and propensities are hard to come by with the available analysis. In fact, as he pointed out, many of the coding region variations have little information as to their effect on disease. I would add also that even if we get to that holy grail of $1,000 to sequence a personal genome, this kind of extensive analysis would still be time and cost-prohibitive for the vast majority of sequenced genomes.
Yet, as with all early steps in science and medicine, there’s missing pieces, large gaps and huge efforts (think “space travel,” “computers,” “microwave ovens,” “internet,”) that over time become inexpensive and commonplace (ok, so the former isn’t necessarily “inexpensive”). Sequencing genomes will become inexpensive before the analysis does, but both will come. And I think this paper is pointing to that future.
The other hurdle to large scale personal genomics I see (of course) is the understanding and use of the genomics and data resources. The authors use a large (and excellent, in my opinion) suite of genomics resources to do obtain data and do their analysis. I’ll list them here with links in alphabetical order:
dbSNP (T)
GVS (T)
HapMap (T)
HGMD
OMIM (T)
PharmGKB
PolyPhen
PubMed (T)
SIFT
UniProt (T)
All of these resources have a wealth of data, but even then, that is a lot of analysis and familiarization that is needed with each tool. Each tool does have documentation and tutorials, and of course OpenHelix has tutorials on many of the ones mentioned (those with linked “T”s after the name). Still, this one analysis took a large number of tools and familiarization.
The paper does have a pretty good figure (figure 1) outlining the analysis process. For example, they SIFTed the genome to find gene-associated, non-synonymous, rare and novel and disease associated variations and then analyzed those using dbSNP, HGMD, OMIM and PubMed to analyze something like HFE2 which might have an association with Haemochromotosis. One of my quibbles with the paper, as often is with these papers, is that there isn’t a good methods ‘walk-through’ of the paper using something like Galaxy or Taverna in a history or workflow that would help reproduce the analysis.
We also have a tutorial I’d like to point you to, one that walks through a similar process and teaches users the basics of walking through that process. You can find this tutorial here, it’s free and publicly available. The tutorial walks the user through the analysis of a gene variation, in this case in the CYPC9 that effects an individual’s response to Warfarin. There is a similar variation (different gene, affects same drug response) in the paper. The tutorial uses the NIEHS SNPs site to get an overview of the variation including SIFT and PolyPhen predictions, then to the UCSC Genome Browser to find an overview of the region, walks through the dbSNP information and does a quick tag SNP analysis using GVS. That tutorial is only one very small step in what will have to be a immense education into genomics and genomics resources.
That is all to point out that the paper is an fascinating first step, and as a first step suggests the gaping holes we will have in bringing personal genomics to medicine.
Ashley, E., Butte, A., Wheeler, M., Chen, R., Klein, T., Dewey, F., Dudley, J., Ormond, K., Pavlovic, A., & Morgan, A. (2010). Clinical assessment incorporating a personal genome The Lancet, 375 (9725), 1525-1535 DOI: 10.1016/S0140-6736(10)60452-7
... Read more »
Ashley, E., Butte, A., Wheeler, M., Chen, R., Klein, T., Dewey, F., Dudley, J., Ormond, K., Pavlovic, A., & Morgan, A. (2010) Clinical assessment incorporating a personal genome. The Lancet, 375(9725), 1525-1535. DOI: 10.1016/S0140-6736(10)60452-7
PLoS Biology reports today on WikiPathway. The paper entitled “WikiPathways: Pathway editing for the people,” announces a new wiki for the ‘public curation’ of pathway data. The authors argue that
 The exponential growth of diverse types of biological data presents the research community with an unprecedented challenge to keep the flood of biological data as accessible, ... Read more »
Alexander Pico, Thomas Kelder, Martijn P van Iersel, Kristina Hanspers, Bruce R Conklin, & Chris Evelo. (2008) WikiPathways: Pathway Editing for the People. PLoS Biology, 6(7). DOI: 10.1371/journal.pbio.0060184
In most of software and database development the changes that are coming along all the time seem to be tweaks and polishes on the existing strategies. Every so often, though, there’s a big shift in the strategy or mechanism. This week the JBrowse paper I read made me realize that is now firmly underway. Today’s tip of the week will introduce JBrowse, and here I’ll describe some of the reasons this is a game changer.... Read more »
Skinner, M., Uzilov, A., Stein, L., Mungall, C., & Holmes, I. (2009) JBrowse: A next-generation genome browser. Genome Research, 19(9), 1630-1638. DOI: 10.1101/gr.094607.109
Today’s tip is on Genomicus. Genomicus is a great tool to visualize gene duplication, synteny and genome evolution. The search and display interfaces are quite straightforward, and there are lots of great features (viewing ancestral gene information, links out to resources, different views of phylogenies, etc) in the tool. This video is only a short introduction. You can delve deeper into the tool with the help and documentation, including an 11 minute video.
There is also a recent (advance access) paper in the journal “Bioinformatics” that will give you a lot more detail on how the database and tool works and what is there.
Muffato, M., Louis, A., Poisnel, C., & Roest Crollius, H. (2010). Genomicus: a database and a browser to study gene synteny in modern and ancestral genomes Bioinformatics DOI: 10.1093/bioinformatics/btq079
You will also notice today the video is a SciVee embed. We are trying out a new way to post and share our tips. SciVee allows us to not only post on our blog, but for you to share the tip with others and also for scientists in the SciVee community to view the tips. This is only a test. We will be working with this for the next couple weeks to find the best way to post and share. Eventually, soon, we hope to share these on Facebook and Youtube also. If the video is not high enough quality for you (SciVee and other video sharing sites by necessity reduce size, you can try out the entire mpeg4 version a this link.
... Read more »
Muffato, M., Louis, A., Poisnel, C., & Roest Crollius, H. (2010) Genomicus: a database and a browser to study gene synteny in modern and ancestral genomes. Bioinformatics. DOI: 10.1093/bioinformatics/btq079
by Mary in OpenHelix
A question on the blog last week got me going through my old posts, because I was sure that I had done one on a database of SNP effects on gene expression. But it turned out that was in my memory, but still in the draft posts for the blog….
I had come across the work [...]... Read more »
Heinzen, E., Ge, D., Cronin, K., Maia, J., Shianna, K., Gabriel, W., Welsh-Bohmer, K., Hulette, C., Denny, T., & Goldstein, D. (2008) Tissue-Specific Genetic Control of Splicing: Implications for the Study of Complex Traits. PLoS Biology, 6(12). DOI: 10.1371/journal.pbio.1000001
NCBI was created in 1988 and has maintained the GenBank database for years. They also provide many computational resources and data retrieval systems for many types of biological data. As such they know all too well how quickly the data that biologists collect has changed and expanded. As uses for various data types have been [...]... Read more »
Tatusova, T., Karsch-Mizrachi, I., & Ostell, J. (1999) Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics, 15(7), 536-543. DOI: 10.1093/bioinformatics/15.7.536
Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T.... (2011) BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Research. DOI: 10.1093/nar/gkr1163
Sayers, E., Barrett, T., Benson, D., Bolton, E., Bryant, S., Canese, K., Chetvernin, V., Church, D., DiCuccio, M., Federhen, S.... (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. DOI: 10.1093/nar/gkr1184
In today’s tip I’d like to introduce you to the Cancer Genome Workbench, or CGWB. The workbench gathers cancer information from a wide variety of projects including Johns Hopkins University and GlaxoSmithKline Cancer Cell Line Genomic Profiling Data, NCI’s Therapeutically Applicable Research to Generate Effective Treatment (TARGET), NHGRI’s Tumor Sequencing Project (TSP), The Cancer Genome Atlas (TCGA), and the Sanger Center’s COSMIC initiative and presents the cumulative data as high-level summary visualizations. The CGWB’s genome-browser view is built on a UCSC Genome Browser backbone, for power and flexibility.
I noticed an announcement in the May 7th Nature Signaling Gateway Update email that the NCI-Nature Pathway Interaction Database – May Update was featuring a bioinformatics primer on The Cancer Genome Workbench. The primer is great & goes into much more detail about the Cancer Genome Workbench than I will be able to in this quick tip. I strongly check the primer, and the workbench out. When I went over to the workbench to explore, I quite honestly was a bit taken back by the complexity of the displays – the amount of data presented in their summary visualizations are somewhat intense.
I hope that in my tip movie I will be able to convince you that the small investment you will need to do to get acclimated to their images is well worth the amount of data you will quickly understand how to analyze. The views are so data rich, it takes a bit of adjusting to – there is very little labeling (to keep displays as clean as possible) and information is provided via pop-up messages as you scroll over the display. Once I got past the intensity of the displays, I was really amazed by the scope of data visualized in CGWB displays – data on every chromosome & gene over multiple datasets/experiments, in one 2D image. As the NCI primer says, cancer is complex – really complex. Being able to see such ‘big picture’ views as those provided by the Cancer Genome Workbench is a really powerful analysis aid. I for one am impressed with this resource, which is why I’ve chosen to feature it today.
In my 5 minute tip I was only able to show you the briefest of glimpses of the CGWB landscape and heatmap views. I was not able to show you the details of wither view, including a hyperlinked list of genes with the highest mutation frequencies. Nor was I able to show you the full scope of other views which include genome browser views (based on the UCSC Genome browser, as I mentioned earlier), correlation plots, protein domain views, 3D vizualizations, as well as next-gen and trace sequence views. Check out figure 1 of the bioinformatics primer to see examples of those.
I’ve added a citation to the original CGWB publication. It was published in 2007, and so does not cover all the current functions of the workbench, but I think reading it might help give you an idea of the workbench because it goes into the goals and background that the CGWB is based on more than the primer, which is much more up-to-date and focuses on the functionality of the workbench. In this paper you can also read how the authors utilized the workbench to analyze three public datasets, and see how it expanded their research findings.
All & all, I think the Cancer Genome Workbench is an amazing resource for cancer research. Be sure to check out the tip movie, the primer, the original CGWB publication and especially the CGWB! Thanks for joining us for this week’s tip.
Zhang, J., Finney, R., Rowe, W., Edmonson, M., Yang, S., Dracheva, T., Jen, J., Struewing, J., & Buetow, K. (2007). Systematic analysis of genetic alterations in tumors using Cancer Genome WorkBench (CGWB) Genome Research, 17 (7), 1111-1117 DOI: 10.1101/gr.5963407
... Read more »
Zhang, J., Finney, R., Rowe, W., Edmonson, M., Yang, S., Dracheva, T., Jen, J., Struewing, J., & Buetow, K. (2007) Systematic analysis of genetic alterations in tumors using Cancer Genome WorkBench (CGWB). Genome Research, 17(7), 1111-1117. DOI: 10.1101/gr.5963407
by Jennifer in OpenHelix
For today’s tip, I would like to introduce you to the TDR Targets Database, which seeks “… to exploit the availability of diverse datasets to facilitate the identification and prioritization of drug targets in pathogens causing neglected diseases.” I found out about this database this past weekend as I was catching up on my [...]... Read more »
Agüero, F., Al-Lazikani, B., Aslett, M., Berriman, M., Buckner, F., Campbell, R., Carmona, S., Carruthers, I., Chan, A., Chen, F.... (2008) Genomic-scale prioritization of drug targets: the TDR Targets database. Nature Reviews Drug Discovery, 7(11), 900-907. DOI: 10.1038/nrd2684
Yesterday a tweet to a great post came across the ethers, and ever since I read it I knew I had to write this post. Here’s the original nugget:
RT @ctitusbrown: (my) thoughts on data intensive science & workflows: http://bit.ly/tWXSnx
It is a post about why end users are not adopting workflows which could really help them in this eScience world we find ourselves in, and as we keep moving forward with giant data sets and “big data” projects. And some other points about what we need in workflows. We’re big fans of workflows and have talked about them in the past....... Read more »
Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team, T. (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology, 11(8). DOI: 10.1186/gb-2010-11-8-r86
microRNAs have become a rich source of research as they probably have a huge effect on gene expression and disease. The human genome may encode over 1,000 miRNAs that target over half of our genes. They might be implicated in a lot of common diseases (which not yet have been picked up in GWAS studies?). They are a fascinating area of biology that has only come of it’s on in the last decade. As such, the number of databases to catalog miRNAs is large. Today’s tip is on a new one, RepTar, which is reported in the upcoming NAR database issue. The niche RepTar is attempting to fill is to get predictions of miRNAs more comprehensive by including new research in the algorithm. This new research suggests there are more possible target sites than previously thought. As mentioned in the article,
Recently, the miRNA binding options were expanded further with the identification of ‘centered sites’, functional miRNA target sites that lack both perfect seed pairing and 3′-compensatory pairing and instead exhibit pairing with the target along 11–12 contiguous pairs at the center of the miRNA (4). While some algorithms relaxed the evolutionary conservation criterion (5–11) and/or offer also predictions of 3′-compensatory sites [e.g. (6,12,13)], few databases offer predictions of the whole repertoire of miRNA targeting patterns. Furthermore to date, no database lists genome-wide prediction of cellular targets of viral miRNAs. These miRNAs lack significant evolutionary conservation and their targets are not necessarily expected to be evolutionarily conserved. In addition, the few identified viral miRNA targets have shown both conventional seed binding and 3′-compensatory binding [e.g. (3,14)].
Here we present a database of genome-wide miRNA target predictions for mouse and human genes, based on the predictions of our novel target prediction algorithm, RepTar
I’ll leave the predictive value up to miRNA researchers, but I thought I’d introduce the site.
While I’m at it, allow me to list a few other miRNA sites from labs and institutes as far flung as China, Italy, Israel, Canada and the U.S.. Perhaps someday I’ll do a comparison.
CircuitsDB, which Jennifer did a great tip of the week tutorial on.
miRBase, which we have a full-length tutorial on.
microRNA.org
HMDD
miRDB
tarBase
miRecords:
PicTar, they have an annotation track for UCSC Genome Browser
miRNA2Disease
PuTmiR (in relation to transcription factors)
microRNAdb:
two lists to catch some others: http://mirnablog.com/microrna-target-prediction-tools/ and http://www.ncrna.org/KnowledgeBase/link-database/mirna_target_database
Elefant, N., Berger, A., Shein, H., Hofree, M., Margalit, H., & Altuvia, Y. (2010). RepTar: a database of predicted cellular targets of host and viral miRNAs Nucleic Acids Research DOI: 10.1093/nar/gkq1233
... Read more »
Elefant, N., Berger, A., Shein, H., Hofree, M., Margalit, H., & Altuvia, Y. (2010) RepTar: a database of predicted cellular targets of host and viral miRNAs. Nucleic Acids Research. DOI: 10.1093/nar/gkq1233
by Mary in OpenHelix
So I’m all excited about the genome festival that I’m seeing, related to the publication of the new sequence version of corn. You can access the main paper in Science, and there’s a very neat diagram in figure 1 that is like looking across time at the sequence data and into the corn nebula. But the thing that cracked me up was this line from the abstract:
Nearly 85% of the genome is composed of hundreds of families of transposable elements, dispersed nonuniformly across the genome.
That means 85% of corn isn’t corn!! And what business do those elements have messing with the genomes?? I am told all the time that messing with plant genomes is wrong and unnatural. Heh.
For full coverage of the big news today I’ll point you to James and the Giant Corn (appropriately enough) who seems to be the CNN (Corn News Network) of 24-hour coverage of many aspects of the work.
I spent my morning looking over the PLoS Maize Special Collection papers, including the intriguing appetizer: 10 Reasons to be Tantalized by the B73 Maize Genome. But I spent longer looking at the CNVs and PAVs paper. I’ve been thinking about CNVs a lot lately, and was interested to see this covered in a non-mammalian species.
Figure 1 is a nice example of how to use VISTA for effective displays in comparative genomics. (If you haven’t used VISTA before you might check out our sponsored free tutorial on that–we are currently working with the VISTA team to update that with their new features too.)
There’s a really striking segment of chromosome 6 that appears to be present in one of the strains they examine and absent in the other (illustrated in figure 4). And it looks like it has genes that are expressed and active in the B73 strain. The ongoing investigation of that is pretty intriguing as well.
The structural variations are not evenly distributed across the genomes. Some places have large occurrences, and some are untouched. It’s clear that just in these two strains there’s a lot more structural diversity than in other species that have been examined:
In the human, rat, dog, mouse, macaque and chimpanzee genomes the average number of CNVs between two individuals is between 15 and 75 [43]–[48]. A high resolution study of eight human genomes [49] revealed only several hundred insertions and deletions, including CNV and PAV sequences, in the comparison of any two human genomes. In contrast, even after very stringent filtering we identified >3,700 CNV or PAV sequences that represent at least 2,000 events between these two maize genomes.
Emphasis mine. Plants are so much more flexible, apparently….
This is going to lead to some neat clues on heterosis (or hybrid vigor) as the research proceeds with these new tools. What a great time to be a plant scientist. There are some very exciting projects coming along with the tools of genomics.
What I couldn’t locate was any reference to a CNV database (like DGV or CHOP CNV) where you can examine the whole set. I’ll dig through the supplement data to see if I can find out more on that. But I wanted get this post out to celebrate the very nice work and collection of papers on this project. Congrats to the teams involved!
References:
Springer, N., Ying, K., Fu, Y., Ji, T., Yeh, C., Jia, Y., Wu, W., Richmond, T., Kitzman, J., Rosenbaum, H., Iniguez, A., Barbazuk, W., Jeddeloh, J., Nettleton, D., & Schnable, P. (2009). Maize Inbreds Exhibit High Levels of Copy Number Variation (CNV) and Presence/Absence Variation (PAV) in Genome Content PLoS Genetics, 5 (11) DOI: 10.1371/journal.pgen.1000734
... Read more »
Springer, N., Ying, K., Fu, Y., Ji, T., Yeh, C., Jia, Y., Wu, W., Richmond, T., Kitzman, J., Rosenbaum, H.... (2009) Maize Inbreds Exhibit High Levels of Copy Number Variation (CNV) and Presence/Absence Variation (PAV) in Genome Content. PLoS Genetics, 5(11). DOI: 10.1371/journal.pgen.1000734
Schnable, P., Ware, D., Fulton, R., Stein, J., Wei, F., Pasternak, S., Liang, C., Zhang, J., Fulton, L., Graves, T.... (2009) The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science, 326(5956), 1112-1115. DOI: 10.1126/science.1178534
by Mary in OpenHelix
One of the most frequently-asked questions we get when we are out doing workshops is: how do I find motifs in promoters, and what can I do with them to find more information? Just last Friday we were asked this again at the workshops we did at USC. So for this week’s tip of the week I’m going to show one of the tools I recommend for that purpose–Melina II. (I also recommended the MEME Suite and VISTA‘s rVISTA features as well, but for this tip I’ll focus on Melina.)
Melina II is not a new tool, it’s been around for a while. But it’s been one of my favorites because of the way it combines several tools that I would otherwise have to access separately. And I like the graphical representation that it delivers for the motifs that are discovered.
As they say on their homepage, it’s a straightforward 3-step process: put in some sequences, choose motif finders and set parameters, and then run to see your results. And it is just that easy. You can go in and tweak all of the motif finder parameters if you like, but a default setting search will quickly get you started finding motifs from an input set of sequences.
You’ll have a graphical display of the motif location in the sequence panel at the top, but you can click on any of the colored discovered motifs to display the alignments, the sequence logo, or the weight matrix at the bottom. And from there you could also do a couple of searches of other resources as well to locate additional promoters that may carry your motif.
It’s such a quick and slick way to look for motifs it has long been one of my first choices for this kind of analysis. You can access each of the individual motif finder tools at their home sites as well, and there may be more features over there. But to get started this is a very nice choice.
Note: A couple of weeks ago the Melina II tool was down, because of server issues. We talked with the team and although it’s up and running on a backup server, some of the settings aren’t quite right yet. But I still wanted to discuss it because of answering that question from the workshop. So you can try it out, but check back at a later date for the full server with the correct settings.
Find Melina II here: http://melina2.hgc.jp/
Reference:
Okumura, T., Makiguchi, H., Makita, Y., Yamashita, R., & Nakai, K. (2007). Melina II: a web tool for comparisons among several predictive algorithms to find potential motifs from promoter regions Nucleic Acids Research, 35 (Web Server) DOI: 10.1093/nar/gkm362
++++++++++++++++++++++++++++++++
(We have just updated our full tutorial version of the Melina II tool, which is available for individual purchase or by subscription here.)
... Read more »
Okumura, T., Makiguchi, H., Makita, Y., Yamashita, R., & Nakai, K. (2007) Melina II: a web tool for comparisons among several predictive algorithms to find potential motifs from promoter regions. Nucleic Acids Research, 35(Web Server). DOI: 10.1093/nar/gkm362
Galaxy started out as a very useful tool to do genomics research that was reproducible and sharable. One of my pet peeves in reading research papers that use genomic analysis or online genomics resources is the materials and methods sections. Often the methods and parameters used are mentioned only in a very cursory manner, if [...]... Read more »
Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team, T. (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology, 11(8). DOI: 10.1186/gb-2010-11-8-r86
by Mary in OpenHelix
There is plenty of buzz out there for the big data biology projects–but usually the focus is the human data (with a few token model organisms thrown in). But this week plant researchers renewed the call for big plant data. I’m totally on board with that.
The 1000 Genomes project to obtain more human variation information [...]... Read more »
Clark, R., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G., Shinn, P., Warthmann, N., Hu, T., Fu, G., Hinds, D.... (2007) Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana. Science, 317(5836), 338-342. DOI: 10.1126/science.1138632
Ossowski, S., Schneeberger, K., Clark, R., Lanz, C., Warthmann, N., & Weigel, D. (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Research, 18(12), 2024-2033. DOI: 10.1101/gr.080200.108
Weigel, D., & Mott, R. (2009) The 1001 Genomes Project for Arabidopsis thaliana. Genome Biology, 10(5), 107. DOI: 10.1186/gb-2009-10-5-107
by Mary in OpenHelix
Perusing my copy of Nature Genetics last week, I was flipping through the pages and noticed an unusual graphic. I looked at it a little closer and was convinced it was one of the Spirographs that I used to make as a kid. (Remember those? I always liked that….) I looked a little bit closer and realized it was somewhat more informative than the Spirographs I used to draw. This represented the relationships between genes, based on the literature. Hmmm….how did they do this, exactly?
The paper I was reading was Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk by Raychaudhuri et al, which was interesting enough. I like to read the GWAS papers to see what the current techniques and strategies are, not only for the specific genes themselves. And this paper reported the strategy that they used to prioritize their SNPs, and that they used GRAIL to generate the data for this graphic of gene relationships. Check out Figure 1 for the strategy.
When I saw the name GRAIL I thought–huh….GRAIL is back with a new use? I thought that was…ah…retired…at this point. But this isn’t that GRAIL (http://compbio.ornl.gov/Grail-1.3/, Gene Recognition and Assembly Internet Link). This is a different GRAIL–the new one is Gene Relationships Among Implicated Loci. So I had to go and read that paper, which is Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions by Raychaudhuri et al.
This new GRAIL is all about text mining. It is a tool that relies on statistical text mining of the literature for genes in a region and examines the relationships among those genes in the text. The focus in their case is disease regions, but there’s no reason that you couldn’t use it for a variety of other topics. As the authors state:
Given only a collection of disease regions, GRAIL uses our text-based definition of relatedness (or alternative metrics of relatedness) to identify a subset of genes, more highly related than by chance; it also assigns a select set of keywords that suggest putative biological pathways.
So you pull a set of genes out of the literature based on SNPs or locations of interest, and you can begin to assess what’s interesting in the set. Now, the tool makes a lot of assumptions that you should be aware of if you are going to use it. It assumes each region contains a single pathogenic gene. I’m not sure that’s always going to be the case, but for this tool as long as you know that, that’s a fair assumption. They suggest this helps to keep from multigenic regions from dominating the analysis. Fair enough, but…what if that is the interesting aspect? Still–that’s ok as long as you know.
In the paper they use validated SNPs from 4 different research areas:
SNPs associated with serum lipid levels: GRAIL finds genes in the cholesterol biosynthesis pathway.
SNPs associated with height; they identify pathways they consider plausible.
Crohn’s disease; they confirm associations that have been seen.
Schizophrenia–and here they used rare deletions as the items of interest; they find related genes, many highly enriched in the CNS. So this suggests using this not only for SNPs but for CNVs this may be a useful strategy.
Their Figure 1 nicely summarizes the strategy:
One curious tweak of the data analysis was that they used the literature prior to December 2006, because right after that there was an onslaught of GWAS papers that would list a whole bunch of genes associated with regions that might be more tenuous still. I understand this in theory, but I imagine it also eliminates more current research on genes of interest from other methods too. I saw in the tool you could choose either pre-Dec 06 or a more up-to-date literature set. It would be useful to try both if you use GRAIL and keep that in mind.
Another point to keep in mind: some genes are just not found in the abstracts, and they mention that is an issue. So the set you can examine are those that were in the abstracts, and were identified properly with nomenclature, spelling, etc. Text mining is cool, but has a lot of limitations around those aspects, and the use of synonyms too in general. It’s not just an issue for GRAIL, but for all text mining tools at this point.
They also devise a way to use Gene Ontology (GO) and some expression data in GRAIL as other “relatedness” metrics. You’ll find those available from the GRAIL tool as well.
They don’t show any spirographs in their figures in this first GRAIL paper. That one that drew me in was Figure 2 in the arthritis paper. So I went over to the software to try to generate these myself. The outcome at this point is a web page with text and links to UCSC Genome Browser, and Entrez Gene (from the individual genes and from the keyword list–keywords collect multiple Entrez Genes). I was a little surprised that the keyword link wasn’t to PubMed as well. Currently it doesn’t provide the graphic, but maybe that will come along over time. If it does I’ll be sure to mention it on the blog.
One final note on the paper: in the supplemental section they compare GRAIL to other tools in this arena. If you are interested in tools like we are here you may find some of them interesting as well. The tools are listed with URLs in Table S5, and the comparison outcome is in Text S1:
Prioritizer [2], Gene2Disease (G2D) [3,4,5], Commonality of Functional Annotation (CFA) [6], and Prospectr [7]. There were five supervised tools: Endeavour [8], GeneSeeker [9], SUSPECTS [10], TOM [11], and CANDID [12]
So check out GRAIL and see if you find gene relationships. But don’t forget those caveats about the genes not listed in the abstracts, or the literature coverage dates. The software can be found here: http://www.broad.mit.edu/mpg/grail/
I know it’s a beta. But I think it has a lot of potential to help people sift through the results they are getting from a variety of techniques. Check it out.
NOTE: you may find periods that you can’t run GRAIL because it puts a burden on the servers. You should try again during off hours if you are seeing problems with getting it to run. This happened to me during my testing of it last week.
The list of GWAS data I used to test GRAIL came from the NHGRI catalog, which we discussed here: List of GWAS studies. I tried the straight hair SNP list, and got a pretty interesting set of results that certainly included “epidermis” and “skin” as keywords, among other things.
++++++++++++ Citations ++++++++++++
... Read more »
Raychaudhuri, S., Plenge, R., Rossin, E., Ng, A., , ., Purcell, S., Sklar, P., Scolnick, E., Xavier, R., Altshuler, D.... (2009) Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions. PLoS Genetics, 5(6). DOI: 10.1371/journal.pgen.1000534
Raychaudhuri, S., Thomson, B., Remmers, E., Eyre, S., Hinks, A., Guiducci, C., Catanese, J., Xie, G., Stahl, E., Chen, R.... (2009) Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk. Nature Genetics, 41(12), 1313-1318. DOI: 10.1038/ng.479
Medland, S., Nyholt, D., Painter, J., McEvoy, B., McRae, A., Zhu, G., Gordon, S., Ferreira, M., Wright, M., & Henders, A. (2009) Common Variants in the Trichohyalin Gene Are Associated with Straight Hair in Europeans. The American Journal of Human Genetics, 85(5), 750-755. DOI: 10.1016/j.ajhg.2009.10.009
by Mary in OpenHelix
I’ve always been a fan of clever graphical displays that convey key data points. They can be so effective when done well. And from way back when I was first exposed to the logo-style histogram displays for conserved promoter sequences they’ve always struck me as particularly suited to this method. Usually for ease-of-use I’ve gone to WebLogo. It’s easy, and quick to use. But recently I learned of another tool that provides the logos and has a bit more to it. The iceLogo tool seems to be a useful option for these now as well, particularly suited to protein motif display.
A brief correspondence in a recent issue of Nature Methods introduced me to iceLogo. When I first tried it out, I had to download it and run it locally (and you can still do that). But there is also a web interface for this tool now, and that will be the focus of my tip of the week.
You can enter a set of sequences as a multiple sequence alignment, then you can choose the reference set of everything in Swiss-Prot or set your own background, and then visualize the sequence data with conserved positions in a couple of ways. You can see the larger letter sizes using several logo styles, or you can access a heat-map style display. You can download the images (although I often just screen shot things like this). I wish it had a clear button to easily swap out my items of interest, but that’s not a huge problem really. In this tip of the week movie I run their sample sequence to show you how it displays the data.
Their manual (PDF) goes in to the detail on the statistics and the theory. The Nature Methods paper is quite short–there real meat is in the manual. And there’s actually a neat feature to the manual PDF: you can click on Figure 1.4 to directly load up that example in the web interface and change it around if you want.
Check out iceLogo if you want to explore these types of visualizations for your data. Visit the iceLogo web interface here.
Just a final note: there is also a SOAP server that you could access with other tools as a web service that might be handy if you are setting up analysis pipelines in various situations.
++++++++++
Colaert, N., Helsens, K., Martens, L., Vandekerckhove, J., & Gevaert, K. (2009). Improved visualization of protein consensus sequences by iceLogo Nature Methods, 6 (11), 786-787 DOI: 10.1038/nmeth1109-786
... Read more »
Colaert, N., Helsens, K., Martens, L., Vandekerckhove, J., & Gevaert, K. (2009) Improved visualization of protein consensus sequences by iceLogo. Nature Methods, 6(11), 786-787. DOI: 10.1038/nmeth1109-786
by Mary in OpenHelix
At the recent (and excellent) Beyond the Genome 2010 conference, Len Pennachio gave a talk about the VISTA Enhancer Browser that reminded me how much I have always liked this project. It’s the kind of project I’d do if I had a lab: it takes the computational data we’ve been accumulating + developmental biology bench techniques = cool new insights into the function of conserved regions of the genome that we previously didn’t know much about.
The foundation of the project is that we’ve got a number of species genomic sequences that we can compare–and the VISTA suite offers a number of ways to perform these types of comparative genomics analyses and provides really nice visualizations of the data (we’ve got a free tutorial sponsored by VISTA that you can watch to see how it works). You can see peaks of high conservation across multiple species, which suggest there’s something important going on in that region. But when they are outside of the gene region per se, it’s not always obvious what the sequence represents–but the idea is that they may be cis-regulatory elements. So the Enhancer Browser team clones out those regions, and hooks them to reporter constructs. The constructs are placed into mouse oocytes and then put into pseudopregnant mice, and the embryos are examined on day 11 to see if there is an interesting pattern of expression of the reporter construct. Now, these are subject to limitations: it’s one time point they are examining so earlier or later activity is not known. And it’s possible that integration of the construct has affected expression (in positive or negative ways). But they examine multiple embryos for each construct to work around that location effect.
This data is accumulated and becomes available in the Enhancer Browser. You can search by genes of interest to see if a region near your favorite gene has been examined. Or you can examine them by tissue/localization pattern if there’s a developmental time point you may be interested in. To get a quick sense of the kind of things you can find take a look at the handy Gallery set of images. There are various ways to search or browse the data. That’s what I’ll be introducing in the Tip movie this week.
But they also “enhanced” this project but adding another technique to the process. Beyond the computational identification of conserved regions, they also began to do ChIP-Seq to pull down sequences that are bound to the p300 protein in embryos in various tissues of the embryo. That’s illustrated nicely in Figure 1 of the second paper. They obtain the sequence of those pieces and put those into eggs as well, and the rest of the process is similar. So the starting point is different: this is protein-bound sequences to start with, from a given tissue. But it also seems to be identifying working elements that can influence spatial and temporal expression of the reporter constructs. They say it has increased their success in finding working elements by 5x to 16x.
So I think this is a great way to use computational techniques and bench work in a pretty-big-data way. It’s not easy to do the mouse benchwork part so it’s not quite as big as a pure sequencing foray. But it’s exactly the kind of project I’d design if I had access to a lab. I have a different topic I’d be interested in, but the same kinds of strategies would be useful for that as well.
Anyway–explore the Enhancer Browser to learn more about these possible regulatory elements.
References:
Visel, A., Blow, M., Li, Z., Zhang, T., Akiyama, J., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F., Afzal, V., Ren, B., Rubin, E., & Pennacchio, L. (2009). ChIP-seq accurately predicts tissue-specific activity of enhancers Nature, 457 (7231), 854-858 DOI: 10.1038/nature07730
Visel, A., Minovitsky, S., Dubchak, I., & Pennacchio, L. (2007). VISTA Enhancer Browser–a database of tissue-specific human enhancers Nucleic Acids Research, 35 (Database) DOI: 10.1093/nar/gkl822
... Read more »
Visel, A., Blow, M., Li, Z., Zhang, T., Akiyama, J., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F.... (2009) ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature, 457(7231), 854-858. DOI: 10.1038/nature07730
Visel, A., Minovitsky, S., Dubchak, I., & Pennacchio, L. (2007) VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Research, 35(Database). DOI: 10.1093/nar/gkl822
As I and my family await our 23andme kit to scan our genomes, family history has come back to the forefront of my thoughts. I used to be very fascinated by my own genealogy, and with adopted children, the concepts of descent, biology and culture have taken adjusted meanings for me. It’s why we have a ‘family map’ instead of a ‘family tree’. The difference between our cultural genealogy and our genetic genealogy has been become quite clear to me. Obtaining our family ancestry through these tests will bring a lot of these issues back to focus.
But there is a specific issue that is directly related to genomics, genomics tools and my family: same-gender headed household representation in pedigree and genealogy software. It’s non-existent or takes a difficult workaround to make it happen.
With the rising use of personal genomics data, there is a corresponding rise in the use of pedigree software for medical purposes and genealogy software for family history purposes. Neither of these handle non-traditional family structures well. I use ‘non-traditional’ lightly here though because even though same-gender headed households might be relatively new as a recognized family structure, the concept of family can be quite fluid across time and cultures. What is traditional and considered the ‘norm’ today in US culture (nuclear families of two genders with children born to them) for ‘family’, is obviously not the case in the past, nor in contemporary cultures in other parts of the world.
A paper published last year entitled When Family Means More (or less) than Genetics by Burns and Edwards focuses on this inability of current tools to model family histories that aren’t within this norm. As they state:
One challenge in using family history as a health technology is that the geneticist or clinician defines family based on biology, whereas individuals often include those linked socially.
Genetic heritage and history is indeed important in determining disease susceptibilities, but ignoring or misunderstanding socially-defined kinship can lead to misdiagnosis, the lack of understanding of environmental influences and worse. Tools for modeling pedigrees must be able to flexibly model these family structures in order to be useful.
The researchers look at two groups and conclude that current tools are inadequate to model their family structures. Samoans were one group (Japanese-Americans the other):
When Samoan American participants were asked, “tell me about your family,” persons fulfilling social roles were described by that relationship. For example, an individual raised as a brother was identified as a brother whether or not there was a biological basis to the relationship. Similarly, individuals adopted in to or out of a family were described as the children of the family in which they were raised, not as offspring of the biological family. When further questioned, the participants could identify the biological link. But even when the biological relationship was known, the Samoan Americans reported family relationships based on social rather than biological ties.
They go in to good detail into why this is a problem. They also, early in the paper, suggest modern American society is changing. Americans already are one of the most ‘adopting’ nations in the world. And, as the authors note, our family structures are becoming more fluid (perhaps converging with Samoan concepts in some ways?):
For example, the Western postmodern family has looser kinship ties than in the past, with relationships that are diverse and fluid (Stacey, 1998). Blended, adoptive, and gay families, as well as those resulting from a variety of assisted reproductive technologies, place an emphasis on choice rather than genetics. For many, family is about social relationships and not solely concerned with the transfer of genes from one generation to the next (Finkler, 2001;Lévi-Strauss, 1969; Peletz, 1995). Nonbiological social factors, such as role behavior, determine family membership, so that a mother’s sister’s son who has been raised with you is your brother (Finkler, 2001). Both formal and informal adoptions are traditional practices and very common in certain societies: Polynesia often being presented as the exemplar (Brady, 1976; Carroll, 1970; Levy, 1973).
So, let me side step adoption or other non-genetic descent issues for a moment, and hone in on gay families and representation in current pedigree tools available. Though the Recommendations for Standardized Human Pedigree Nomenclature (pdf) mentions it in passing (“For example, information that is commonly recorded on a pedigree (e.g., same-sex relationships…)”) there is no standard suggested. In my and my colleague’s research so far we have yet to find a software or online medical pedigree tool that easily accepts same-gender parental groups, or represents them well.
I took at one excellent online tool, Madeline 2.0. If one enters a parent, entering a second parent automatically forces an opposite gender. Though there is the ability to model adoptive relationships, there is yet no way to model same-gender couples. I wrote the developers of the tool and received a thoughtful reply. No, there was ability to do this, but considering adopt-in and adopt-out relationships are model, it would make sense to include same-gender couples. They suggested they indeed will consider implementing this. Of course, as with all software and online tools, funding, timing and priorities I know will be an issue. I’ll definitely will keep an eye on developments. So as to not single Madeline out, no other tools that we know of (see here, here and here) allow for same-gender couples or headed families.
When going to family history modeling software for genealogy, the omission is as stark. Every individual has two family trees: a cultural/historical one and a genetic one. For most individuals, those histories overlap. The culture you received from your parents and they from theirs is pretty close to the genetic descent. Even then, its not a perfect overlap. What is important to who you are from a cultural or historical perspective might not at all be related to who are you from a genetic one, and who you are is as much cultural as it is genetic. I am as interested in where I got my cultural ancestry as where I got my genetic one, this has become quite clear to me as we’ve adopted children.
And in the future, descendants will look at their family genealogies and it will be very important to them that one of their ancestors was raised by two men, or two women whether adopted or biological from one parent. As these genealogies are built, those relationships which are very important to their family culture and histories should be represented. I know I personally will hope that this will be the case for our family history in the years to follow.
Yet, for software available it is impossible, a complicated workaround or awkward to allow for same-gender parents in the representation (not to mention paper family trees!). GEDCOM is the defacto standard for exchanging genealogical information. There is no simple standard in GEDCOM for including same-sex parents. That it was developed by the Mormon Church probably has something to do with that ‘oversight’, but frankly given the oversight across the board in pedigree and genealogy standards and software, I doubt that was a deliberate one.
So far I have found software that requires complicated workarounds, like Legacy, or it’s not easy to figure out (though once you do, it’s simple . Of the many I’ve tried, none even allow it.
In a world where the number of same-sex couples is increasing annually (not to mention adoption, blended families and many other types of structures) and increased interest in family history through both genomics and culture and history, I look forward to seeing the software catch up to the ability to model my family for future researchers and historians.
... Read more »
Burns McGrath, B., & Edwards, K. (2009) When Family Means More (or Less) Than Genetics: The Intersection of Culture, Family, and Genomics. Journal of Transcultural Nursing, 20(3), 270-277. DOI: 10.1177/1043659609334931
More and more disease-causing mutations are being identified in exonic splicing regulatory sequences (ESRs). These disease effects can result from ESR mutations that cause exon skipping in functionally diverse genes. In today’s tip I’d like to introduce you to a tool designed to detect exon variants that modulate splicing. The tool is named SKIPPY and has been developed and is maintained by groups in the Genomic Functional Analysis research section of the NHGRI.
At the end of the post I cite a very well-written paper describing the development of SKIPPY, as well as the background on why the tool was developed. I won’t have time to go into all those details, but if you are interested the paper is freely available from Genome Biology. The site also has nice, clear documentation and example inputs – which I will use as my examples. Splicing can be modulated in a variety of ways, including the loss or gain of exonic splicing enhancers (ESEs) or silencers (ESSs). Variants accomplishing either of those are referred to as splice-affecting genome variants, or SAVs. Not all of the abbreviations are explained on the results page, as you will see in the tip, but all are explained in detail in the SKIPPY publication, and the ‘Methods and Interpretations‘ and ‘Quick Reference and Tutorial‘ areas of the site.
I first found the tool because it was mentioned in a nice review entitled “Using Bioinformatics to predict the functional impact of SNVs“, which is a paper that reviews mechanisms by which point mutations can effect function, describes many of the algorithms and resources available & provides some sage advice. I’ll post more on it in a later post. For now, check out the tip & the SKIPPY resource, and if you use the site please let us know what you think.
Woolfe, A., Mullikin, J., & Elnitski, L. (2010). Genomic features defining exonic variants that modulate splicing Genome Biology, 11 (2) DOI: 10.1186/gb-2010-11-2-r20
Cline, M., & Karchin, R. (2010). Using bioinformatics to predict the functional impact of SNVs Bioinformatics DOI: 10.1093/bioinformatics/btq695
... Read more »
Woolfe, A., Mullikin, J., & Elnitski, L. (2010) Genomic features defining exonic variants that modulate splicing. Genome Biology, 11(2). DOI: 10.1186/gb-2010-11-2-r20
Cline, M., & Karchin, R. (2010) Using bioinformatics to predict the functional impact of SNVs. Bioinformatics. DOI: 10.1093/bioinformatics/btq695
Back in April I happened to mention that we (OpenHelix) were writing a paper on informal sources of bioinformatics education (in a Friday SNPets item) and we were asked to announce when the paper came out. Well, we got word late last week that the article has been published. The article appears in a special issue of Briefings in Bioinformatics that is devoted to bioinformatics education. I’m not sure if all the articles in the issue are available yet, but it looks like several are in the journal’s Advanced Access area. Bioinformatics education is an area (obviously) that OpenHelix cares deeply about & we are anxiously awaiting our copies of the full issue so we can read all the articles, but I digress…
The title “OpenHelix: bioinformatics education outside of a different box” was a cool suggestion from one of the article’s reviewers – my original title was much tamer (ok, more boring). Regardless of the final title, what we wanted to do in the article is to discuss informal sources of bioinformatics education. By education we do mean acquiring applicable information that allows a researcher to operate within the field of bioinformatics. By informal we mean outside of traditional, credit based classes and degrees. Essentially we provide a bit of the knowledge and know-how that we’ve gathered over years of working with hundreds of resources, thousands of workshop attendees, and countless online contacts about where a researcher, or librarian, or whoever can turn for various informational needs in the field of bioinformatics.
Our contention is that not everyone needs to program in order to manage and manipulate their biological data these days. There are SO many fine publicly available databases, algorithms, tools and more, it is just a matter of awareness and training for anyone to be able to reformat and analyze their personal data sets. We maintain that :
…bioinformatics education needs to do a minimum of four things:
1. raise awareness of the available resources
2. enable researchers to find and evaluate resource functionality
3. lower the barrier between awareness and use of a resource
4. support the continuing educational needs of regular resource users
In the paper we walk through each of these – we first describe example needs associated with the point, and then cover possible informal resources that meet the needs. The article includes tables of resources and links to them and many many references. We really hope that is a very useful resource in the field of bioinformatics education. I am already looking forward to contributing to the next special education issue, both to hone my writing skills and to extend the information we can provide readers. Please do comment, email, whatever and let us know about the resources that you use, what you learned from the article, etc. Oh, here’s the citation info:
Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics DOI: 10.1093/bib/bbq026
... Read more »
Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010) OpenHelix: bioinformatics education outside of a different box. Briefings in Bioinformatics. DOI: 10.1093/bib/bbq026
Do you write about peer-reviewed research in your blog? Use ResearchBlogging.org to make it easy for your readers — and others from around the world — to find your serious posts about academic research.
If you don't have a blog, you can still use our site to learn about fascinating developments in cutting-edge research from around the world.