Pages

Sunday, November 12, 2017

New paper: "WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research"


Focus on metabolic pathways increases the
number of annotated metabolites, further improving
the usability in metabolomics. Image: CC-BY.
TL;DR: the WikiPathways project (many developers in the USA and Europe, contributors from around the world, and many people curating content, etc) has published a new paper (doi:10.1093/nar/gkx1064/4612963), with a slight focus on metabolism. 

Full story
Almost six years ago my family and I moved back to The Netherlands for personal reasons. Workwise, I had a great time in Stockholm and Uppsala (two wonderful universities; thanks to Ola Spjuth, Bengt Fadeel, and Roland Grafström), but being immigrant in another country is not easy, not even for a western immigrant in a western country. ("There is evil among us.")

We had decided to return to our home country, The Netherlands. By sheer coincidence, I spoke with Chris Evelo in the week directly following that weekend. I had visited his group in March that year, while attending a COST-action about NanoQSAR in Maastricht. I had never been to Maastricht University yet, and this group, with their Open Source and Open Data projects, particularly WikiPathways, would give us enough to talk about. Chris had a position on the Open PHACTS project open. I was interested, applied, and ended up in the European WikiPathways group led by Martina Kutmon (the USA node is the group of Alex Pico).

Fast forward to now. It was clear to me that biological text book knowledge was unusable for any kind of computation or machine learning. It was hidden, wrongly represented, and horribly badly annotated. In fact, it still is a total mess. WikiPathways offered machine readable text book knowledge. Just what I needed to link the chemical and biological worlds. The more accurate biological annotation we put in these pathways, or semantically link to these pathways, the more precise our knowledge becomes and the better computational approaches can find and learn patterns not obvious to the human eye (it goes both ways, of course! Just read my PhD thesis.)

Over the past 5-6 years I got more and more involved in the project. Our Open PHACTS tasks did involve WikiPathways RDF (doi:10.1371/journal.pcbi.1004989), but Andra Waagmeester (now Micelio) was the lead on that. I focused on the Identifier Mapping Service, based on BridgeDb (together with great work from Carole Goble's lab, e.g. Alasdair and Christian). I focused on metabolomics.

Metabolomics
Indeed, there was plenty to be done in terms of metabolic pathways in WikiPathways. The current database had a strong focus on the genetics and proteins aspects of the pathways. In fact, many metabolites were not datanodes and therefore did not have identifiers. And without identifiers, we cannot map metabolomics data to these pathways. I started working on improving these pathways, and we did some projects using it for metabolomics data (e.g. a DTL Hotel Call project led by Lars Eijssen).

The point of this long introductions is, I am standing on the shoulders of giants. The top right figure shows, besides WikiPathways itself, and the people I just mentioned, more giants. This includes Wikidata, which we previously envisioned as hub of metabolite information (see our Enabling Open Science: Wikidata for Research (Wiki4R) proposal). Wikidata allows me to solve the problem that CAS registry numbers are hard to link to chemical structures (SMILES): it has some 70 thousand CAS numbers.


SPARQL query that lists all CAS registry numbers in Wikidata, along with the matching
SMILES (canonical and isomeric), database entry, and name of the compound. Try it.
A lot more about CAS registry numbers is found in my blog.
Finally, but certainly not least, is Denise Slenter, who started this spring in our group. She picked up things I and others were doing very quickly (for example this great work from Maastricht Science Programme students), gave those her own twist, and is now leading the practical work in taking this to the next level. This new WikiPathways paper shows the fruits of her work.

Metabolism
Of course, there are plenty of other pathways database. KEGG is still the gold standard for many. And there is the great work of Reactome, RECON, and many others (see references in the NAR article). Not to mention the important resources that integrate pathways resources. To me, unique strengths of WikiPathways include the community approach, very liberal licence (CCZero), many collaborations (do we have a slide on that?), and, importantly, its expressiveness. The latter allows our group to do the systems biology work that we do, analyzing microRNA/RNASeq data, studying diseases at a molecular interaction level, see the effects of personal genetics (SNPs, GWAS), and visually integrate and summarize the combination of experimental data and text book knowledge.

OK, this post is now already long enough. And seeing from the length, you can see how much I am impressed with WikiPathways and where it goes. Clearly, there is still a lot left to do. And I am just another person contributing to the project and honored that we could give this WikiPathways paper a metabolomics spin. HT to Alex, Tina, and Chris for that!

Slenter, D. N., Kutmon, M., Hanspers, K., Riutta, A., Windsor, J., Nunes, N., Mélius, J., Cirillo, E., Coort, S. L., Digles, D., Ehrhart, F., Giesbertz, P., Kalafati, M., Martens, M., Miller, R., Nishida, K., Rieswijk, L., Waagmeester, A., Eijssen, L. M. T., Evelo, C. T., Pico, A. R., Willighagen, E. L., Nov. 2017. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Research. http://dx.doi.org/10.1093/nar/gkx1064

Sunday, October 29, 2017

Happy Birthday, Wikidata!

CCZero.
Wikidata celebrates their 5th birthday with a great WikidataCon in Berlin. Sadly, I could not join in person, so I assuming it is a great meeting, following the #WikidataCon hash tag and occasionally the live stream.

Happy Birthday, Wikidata!

My first encounter was soon after they started, and was particularly impressed by the presentation by Lydia Pintscher at the Dutch Wikimedia Conferentie 2012. I had played with DBPedia occasionally but always disappointed by the number of issues with extracting chemistry from the ChemBox infobox, but that's of course the general problem with data that has been mangled into something that looks nice. We know that problem from text mining from PDFs too. Of course, if you start with something machine readable in the first place, your odds for success are much higher.

Yesterday, Lydia shows the State of Wikidata and I think they delivered on their promise.

I did not create my Wikidata account until a year later but did not use the account much in the first two years. But the Wikidata team did a lot of great work in their first three years, and somewhere in 2015 I wrote my first blog post about Wikidata. That year Daniel Mietchen also asked me to join the writing of a project proposal (later published in RIO Journal). The reason for more active adoption of Wikidata and joining Daniel's writing team, was the CCZero license and that chemical identifiers had really picked up. Indeed, free CAS numbers was an important boon. Since then, I have been using Wikidata as data source for our BridgeDb project and for WikiPathways (together with Denise Slenter). I also have to mention the work by Andra Waagmeester and the rest of the Andrew Su team gave me extra support to push Wikidata in our local research agenda around FAIR data.

The Wikidata RDF export and SPARQL end point was an important tipping point. This makes reuse of Wikidata so much easier. Integrating slices of data with curl is trivial and easy to integrate into other projects, as I do for BridgeDb. Someone in the education breakout session mentioned that you can use the interactive SPARQL end point even with people with zero programming experience. I wholeheartedly agree. That is exactly what I did last Thursday at the Surf Verder bouwen aan Open Science seminar. The learning curve with all the example queries is so shallow, it is generally applicable.

And then there is Scholia. What do I need to say? Impressive project by Finn Nielsen to which I am happy to contribute. Check out his WikidataCon talk. Here I am contributing to the biology corner and working on RSS feeds. It makes a marvelous tool to systematically analyze literature, e.g. for the Rett Syndrome as disease or as topic.

Wikidata has evolved to a tremendously useful resource in my biology research and I cannot imagine where we will be next year, at the sixth Wikidata birthday. But it will be huge!

Sunday, October 15, 2017

Two conference proceedings: nanopublications and Scholia


The nanopublication conference article in
Scholia.
It takes effort to move scholarly publishing forward. And the traditional publishers have not all shown to be good at that: we're still basically stuck with machine-broken channels like PDFs and ReadCubes. They seem to all love text mining, but only if they can do it themselves.

Fortunately, there are plenty of people who do like to make a difference and like to innovate. I find this important, because if we do not do it, who will. Two people who make an effort are two researchers who recently published their work as conference proceedings: Tobias Kuhn and Finn Nielsen. And I am happy to have been able to contribute to both efforts.

Nanopublications
Tobias works on nanopublications which innovates how we make knowledge machine readable. And I have stressed how important this is in my blog for years. Nanopublications describe how knowledge is captures, makes it FAIR, but importantly, it links the knowledge to the research that led to the knowledge. His recent conference proceedings details how nanopublications can be used to establish incremental knowledge. That is, given two sets of nanopubblications, it determines which have been removed, added, and changed. The paper continues outlining how that can be used to reduce, for example, download sizes and how it can help establish an efficient change history.

Scholia
And Finn developed Scholia, an interface not unlike Web-of-Science. But then based on Wikidata and therefore fully on CCZero data. And, with a community actively adding the full history of scholarly literature and the citations between papers, courtesy to the Initiative for Open Citations. This is opening up a lot of possibilities: from keeping track of articles citing your work, to get alerts of articles publishing new data on your favorite gene or metabolite.

Kuhn T, Willighagen E, Evelo C, Queralt-Rosinach N, Centeno E, Furlong L. Reliable Granular References to Changing Linked Data. In: d'Amato C, Fernandez M, Tamma V, Lecue F, Cudré-Mauroux P, Sequeda J, et al., editors. The Semantic Web – ISWC 2017. vol. 10587 of Lecture Notes in Computer Science. Springer International Publishing; 2017. p. 436-451. doi:10.1007/978-3-319-68288-4_26


Nielsen FÃ, Mietchen D, Willighagen E. Scholia and scientometrics with Wikidata. arXiv.org; 2017. Available from: http://arxiv.org/abs/1703.04222.

Sunday, October 08, 2017

CDK used in SIRIUS 3: metabolomics tools from Germany

Screenshot from the SIRIUS 3 Documentation.
License: unknown.
It has been ages I blogged about work I heard about and think should receive more attention. So, I'll try to pick up that habit again.

After my PhD research (about machine learning (chemometrics, mostly), crystallography, QSAR) I first went into the field metabolomics. Because is combines core chemistry with the complexity biology. My first position was with Chris Steinbeck, in Cologne, within the bioinformatics institute led by Prof. Schomburg (of the BRENDA database). During that year, I worked in a group that worked on NMR data (NMRShiftDb, dr. Stefan Kuhn), Bioclipse (collaboration with Ola Spjuth), and, of course, the Chemistry Development Kit (see our new paper).

This new paper, actually, introduces functionality that was developed in that year, for example, work started by Miquel Rojas-Cheró. This includes the work on atom types, which we needed to handle radicals, lone pairs, etc, for delocalisation. It also includes work around handling molecular formula and calculating molecular formulas from (accurate) molecular masses. For the latter, more recent work even further improved on earlier work.

So, whenever metabolomics work is published and they use the CDK, I realize that what the CDK does has impact. This week Google Scholar alerted me about a user guidance document for SIRIUS 3 (see the screenshot). Seems really nice (great) work from Sebastian Böcker et al.!

It also makes me happy, as our Faculty of Heath, Medicine, and Life Sciences (FHML) is now part of the Netherlands Metabolomics Center, and that we published the recent article our vision of a stronger, more FAIR European metabolomics community.

Wednesday, October 04, 2017

new paper: "The future of metabolomics in ELIXIR"

CC-BY from F1000 article.
This spring I attended a meeting organized by researchers from the European metabolomics community, including from PhenoMeNal to talk about proposing a use case to ELIXIR. Doing research in metabolomics and being part of ELIXIR, I was happy that meeting happened. During the meeting I presented the work from our BiGCaT group (e.g. WikiPathways, see doi:10.1093/nar/gkv1024).

During the meeting various metabolomics topics were discussed, and I pushed for interoperability of chemical (metabolic) structures, which requires structure normalization, equivalence testing, etc. You know, the kind of work that partners in Open PHACTS did, and that we're now trying to bootstrap with ChemStructMaps. It did not make it, but ideas are included in the selected topic.

All this you can read in this meeting write up, peer-reviewed in F1000Research (doi:10.12688/f1000research.12342.1). I am happy to have been given the opportunity to contribute to this work. The work in our group (e.g. from our PhD student Denise) can surely contribute to this community effort.

 Van Rijswijk M, Beirnaert C, Caron C, Cascante M, Dominguez V, Dunn WB, et al. The future of metabolomics in ELIXIR. F1000Research. 2017 Sep;6:1649+. 10.12688/f1000research.12342.1.