Friday, February 27, 2009

Solubility Data in Bioclipse #3: Finding ChEBI IDs

With the RDF functionality set up in Bioclipse (see Solubility Data in Bioclipse #2: handling RDF), we can start mining the Chemical RDF space. Check out this mashup:
var ons = rdf.createStore()// output: RDFStore: 0 triples rdf.importURL(ons,"http://github.com/egonw/onssolubility/raw/master/ons.solubility.rdf/ons.rdf")// output: RDFStore: 1206 triples var results = rdf.sparql(ons, "PREFIX owl: <http://www.w3.org/2002/07/owl#> " +    "PREFIX ons: <http://spreadsheet.google.com/plwwufp30hfq0udnEmRD1aQ/onto#> " +    "SELECT DISTINCT ?same WHERE { " +    " ?solvent a ons:Solvent . " +    " ?solvent owl:sameAs ?same" +    "}") for (i=0; i<results.size(); i++) {  var row = results.get(i);  for (j=0; j<row.size(); j++) {    // use the owl:sameAs to find more triples    var uri = row.get(j);    if (uri.startsWith("http://rdf.openmolecules.net/?")) {      print("Added " + uri + "...\n");      rdf.importURL(ons, uri);    }  }} rdf.sparql(ons, "PREFIX owl: <http://www.w3.org/2002/07/owl#> " +    "PREFIX ons: <http://spreadsheet.google.com/plwwufp30hfq0udnEmRD1aQ/onto#> " +    "PREFIX rdfonm: <http://rdf.openmolecules.net/#> " +    "PREFIX dc: <http://purl.org/dc/elements/1.1/> " +    "SELECT DISTINCT ?title ?chebi WHERE { " +    " ?solvent a ons:Solvent . " +    " ?solvent dc:title ?title . " +    " ?solvent owl:sameAs ?same ." +    " ?same rdfonm:chebiid ?chebi" +    "}")

What happens in this script is the following:
1. Load the ONS Solubility data (line 4-5)
2. ask for all owl:sameAs relations to navigate (line 8-14)
3. load the RDF for the rdf.openmolecule.net resources (line 16-26)
4. query for all solvents which have an ChEBI identifier (line 28-38)
The output will look like the following (in the future this will be opened as spreadsheet in Bioclipse):
[[ethanol 40C, CHEBI:16236], [acetonitrile, CHEBI:38472], [chloroform, CHEBI:35255], [methanol 30C, CHEBI:17790], [THF, CHEBI:26911], [ethanol, CHEBI:16236], [ethanol 30C, CHEBI:16236], [methanol 40C, CHEBI:17790], [methanol, CHEBI:17790]]
Now, this example shows a simple yet powerful feature of how RDF is used nowadays: the ChEBI identifier was not part of the original Solubility spreadsheet at Google Docs. But, taking advantage of the unique and resolvable URIs for molecules, when can simply look them up.

Nice, isn't it?

Update: the embedded gist did not show up nicely, so replaced it with a pre block.

Wednesday, February 25, 2009

RDF for chemistry

C-SHALS 2009 (Conference on Semantics in Healthcare and Life Sciences) has just started, and has coverage in a blog and in a FriendFeed room. It nicely coincides with Rich' blog on What the Heck is the Semantic Web?, and the RDF work I have recently done on rdf.openmolecules.net and Bioclipse. (Oh, do I wish I could have attended that conference.)

Anyway, I was proudly surprised to see sechemtic show up in the Semantic Web technologies: Introduction and Survey tutorial by Lee Feigenbaum of Cambridge Semantics:

Use? Rich was asking what could be done with RDF for chemistry... here is a nice mashup by Phil Ashworth: a Google Map showing the locations where certain chemical can be bought:

Sunday, February 22, 2009

Solubility Data in Bioclipse #2: handling RDF

RDF is swiftly becoming the lingua franca of life sciences (see for example [1,2]). Bioclipse is an excellent platform to visualize results from analysis of the network, both for graph visualization (see [3]), as well of visualization of domain specific data types (e.g. sequences, molecules, ...).

Yesterday I uploaded a Bioclipse feature that adds a rdf manager to handle RDF content, which includes SPARQL support. The below snippet shows application to the solubility data [3]:

Saturday, February 21, 2009

Bioclipse2 Scripting #2: searching PubChem

This week I have been porting the PubChem plugin for Bioclipse 1.2 to the new manager-based architecture. While still working on the Wizards, you can run the following JavaScript in Bioclipse2 from SVN and from the next beta (*):

*) There was some confusion on the two beta Bioclipse2 releases so far. Some people expected a release without any bugs left. That release is what we planned to call a Release Candidate. We agree that the first two betas at least turned out to be more alpha than we actually hoped, and we thank everyone who has given these releases a go. Those who tried several development releases of Bioclipse2 saw a lot of ongoing development, and we are fixing any bug reported on these releases. So, do not hesitate in reporting bugs!

Earlier in this series:

Tuesday, February 17, 2009

DBPedia enters rdf.openmolecules.net

As of tonight, rdf.openmolecules.net links to the chemistry DBPedia (1816 chemical compounds), for which I used the SPARQL given in DBPedia: lookup and autocomplete of chemistry. It's first of several steps to extend rdf.openmolecules.net to link up various chemistry database. The below figure shows the current state, where the green nodes are fully RDF-ied:

Drugs are still missing, but will add those too, and since not all entries had InChIs, SMILES were converted using CDK 1.1.5.

Sunday, February 15, 2009

Bioclipse for CDK Developers #1

Ola has released the second beta for Bioclipse 2.0. Things are getting along, and I will not go into details on the molecules table Arvid is working on, the 1GB+ SD file support, the validating CML editor, the support for XMPP services, or the brand new welcome page which will guide new users around in what Bioclipse has to offer.

This blog will focus on what Bioclipse has to offer CDK developers.

While Bioclipse 1.x (doi:10.1186/1471-2105-8-59) was a prototype that showed the power if integrating different bio- and cheminformatics tools, Bioclipse2 was designed from scratch, taking advantage of the latest Eclipse RCP technologies. More importantly, the team in Uppsala decided to have all functionality work via managers, allowing all actions to be recorded. And, scripting of Bioclipse. I blogged earlier about scripting JChemPaint, and creating UFF optimized 3D structures from SMILES. Example scripts can be found on GitHub (this is their coverage), and are indexed on Delicious.

R for cheminformatics
The fact that we can script everything makes Bioclipse an ideal platform for doing cheminformatics: we have access to a variety of cheminformatics libraries, and the means to visualize results via JChemPaint and Jmol. It is like R for cheminformatics: Bioclipse being the R command line, Bioclipse plugins the R packages. Eclipse provides an mechanism called Update Sites, which makes something like CRAN redundant. Back to the Chemistry Development Kit.

Over the next weeks, I will blog about scripts aimed at CDK developers and people who want to learn more on how the CDK internals work. This series assumes Bioclipse 2.0 beta2 (or better) and the CDK Feature installed. I'll be using the Gist widget to embed scripts in this blog, but you can always download the Gist directly into Bioclipse, with the GUI as described here.

Bioclipse uses JavaScript (maybe other scripting languages in the future. File a wishlist report if you like to see Jython, BeanShell or other support in the Bioclipse bug track system.) Bioclipse managers are visible using special variables, such as:

 Bioclipse Feature ui Bioclipse UI interaction Cheminformatics Feature cdk CDK functionality jmol Jmol functionality CDK Feature cdx CDK Developer functionality
Bioclipse scripting has TAB completion support, so you can type cdk. (notice the dot at the end) to which methods the cdk manager provides.

Debugging CDK's Atom Type
As I wrote last week with the email on the first CDK 1.2 release candidate, the new CDK atom typer is a core component of the new CDK. The new implementation covers all atom types used in CDK 1.0, and many more. In particular, Miguel boosted support for charged and radical atom types.

However, the atom types in your data set may not be covered, or perception fails otherwise. That happens. Bioclipse2 makes debugging of this important step in cheminformatics quite insightful. The following script reads a molecule from SMILES, visualizes 2D diagram in JChemPaint, and perceives atom types: The atom type perception results are return to the JavaScript console, and if there are nulls given, then the CDK algorithm did not find a matching atom type for that atom. If you are sure your cheminformatics representation is in order, I welcome a bug report here.

CDK developers can take advantage of this functionality, to eliminate possible causes why a certain algorithm fails. CDK atom typing is used for a variate of algorithms, including counting implicit hydrogens, which many other algorithms need to know.

How does the CDK read a SMILES
A use case for people who want to know if a particular SMILES feature is read or to make sure it is read correctly: This script uses the diff functionality introduced in CDK 1.2, and shows two aspects of the SMILES specification: 1. it picked up the isotope information given in the second SMILES; 2. the second SMILES does not include the implicit hydrogen count, which the SMILES specification then defaults as zero.

Summary
The CDK managers in Bioclipse (cdk and cdx) expose functionality of the CDK, and allows using it in Bioclipse' rich visual workbench environment.

Thursday, February 12, 2009

Did others notice this too? The blogger.com Links to this post functionality seems seriously broken... once a rather useful feature, it has now degradated to a useless state:

I'm quite sure a post of last October cannot link to the Substructure searching on ONS solubility data item Jean-Claude posted today.

Wednesday, February 11, 2009

DBPedia: lookup and autocomplete of chemistry

On the DBPedia discussion mailing list there was a post on a nice web page which allows you to look up things, and which features a autocomplete edit field. The below screenshot show lookup of molecular structures:

If you are not ware of this, adding content to DBPedia is as easy as adding something to WikiPedia. Literally: DBPedia is the RDF flavour of WikiPedia. It extracts the information from the info boxes, as I discussed before (see Molecules in Wikipedia).

BTW, one can take advantage of DBPedia to see what WikiPedia has to offer in terms of chemistry. For example, to list all molecules which have a SMILES, one can use this simple SPARQL query:
Or, to list those which have an InChI:
And this is actually quite useful, e.g. it can be used in quality control. Running the above queries will show up several broken SMILES and InChIs. I have not had time to fix those yet, so please go ahead and beat me to those fixes, and get some WikiPedia Fame :) Alternatively, invert the queries and add missing InChIs, PubChem CID or SMILES. When I have a bit more free time again, after the new stable CDK and Bioclipse releases, I'll runs these analyses again, and summarize them in a web page.

Tuesday, February 10, 2009

CDK 1.2 Release Candidate

I release CDK 1.1.5 today. Below is the email I sent to the cdk-user mailing list:
Hi all,I am happy to be able to announce the first Release Candidate for CDK 1.2.Everyone using using CDK 1.0 is suggest to upgrade to this release,which has fewer bugs, is much better tested, and is faster too. Italso comes with API changes, and a full changelog is not available(yet). However, the CDK developers are available on this mailing listand on IRC to help you port CDK 1.0 applications to CDK 1.2. Twodifferences in particular I would like to point out at this moment:1. explicit atom typingCDK 1.0 did atom typing at various places to perform its function,leading to inconsistencies and bugs. CDK 1.2 introduces a new atomtyping module which isolates atom typing from other algorithms.Consequently, the CDK will be more critical on your code and yourdata: where the old code might have silently eaten incorrect input,the new implementation complains: expect exceptions! The actual atomtype list used in CDK 1.2 is more complete than the ones used in CDK1.0; however, it is not unlikely that you will find no atom typeperceived for a clearly valid atom type. Please report such cases.And I really want to stress this: in every instance where CDK 1.2, CDK1.0 would have failed too, though it might have not complained aboutit.2. no rendering functionalityThe new rendered under development (see the cdk-jchempaint mailinglist) has not made the CDK 1.2.0 release. However, it is expected tobe available in a later CDK 1.2.x release. If you really need thegraphics functionality, please contact me. Bioclipse2 is an exampleproject which combines CDK 1.2 with the new rendering code.Contributions-------------------This release features contributions from a larger developer group thanever before. In particular, I would like to welcome those who havepicked up JuniorJobs, and provided other smaller patches! A full listof authors is available from: http://cdk.svn.sourceforge.net/viewvc/cdk/cdk/tags/cdk-1.1.5/AUTHORS(If you see your name missing (sorry!), please just email me)If you like to contribute too, there are many ways. The JuniorJobs isjust an example and are available from: http://sourceforge.net/tracker/?group_id=20024&atid=997721Download--------------CDK 1.2 RC1 is available from SourceForge as CDK 1.1.5: http://sourceforge.net/project/showfiles.php?group_id=20024&package_id=57806Alternatively, you can download the release from SVN: http://cdk.svn.sourceforge.net/viewvc/cdk/cdk/tags/cdk-1.1.5/Bugs--------As said, this CDK release is the most tested CDK release ever, withmore than 10 thousand unit tests! However, there are open (minor)issues, which you can see reported at Nightly: http://pele.farmbio.uu.se/nightly-1.2.x/The number of failing unit tests is below 1%, and in the same range asthe number of failing tests for CDK 1.0. Importantly, these aretypically fails of unit tests which are not available in the CDK 1.0unit test suite; that is, many of the failing unit tests in CDK 1.0are *not* failing in CDK 1.2 (it really is rewarding to upgrade!)However, if you find additional bugs (or just have wishlists), you canreport these with our SourceForge bug tracker at: http://sourceforge.net/tracker/?group_id=20024&atid=120024Documentation----------------------Over the next weeks I hope to compose a somewhat useful list ofchanges. I have not made up my mind yet how that will take shape,maybe as a list of blogs, which I'll aggregate later. Dunno yet.Suggestions and contributions welcome :)JavaDoc for the release is not yet available on SF for download(working on that), but available for the cdk1.2.x / branch at: http://pele.farmbio.uu.se/nightly-1.2.x/api/OK, that wraps it up for now. Just reply if you have questions.Egon

Thursday, February 05, 2009

Where can I host my experimental data? Open Submission Chemistry Databases #1

Rich just posted an interesting read on Web-Centric Science, after a gauntlet thrown down by The Realm of Organic Synthesis (TROS).

I agree that this still is a problem: where can (organic) chemists host their data? TROS hints as Wikipedia, but an encyclopedia is not always the most suited place for cutting edge chemistry (article can easily be biased, contain (science) political views, etc...). I would suggest a blog would be a good start, and if proper markup would be used services like Chemical blogspace would automatically aggregate it.

However, something less volatile might be interesting. So, what we need is an overview of web databases where experimental chemistry data can be hosted. I'll start one, and annotate resources with license, on delicious.com, using the tags chemistry +web +database +open +submission, and regularly summarize things here.

In the below table, the last column indicated the most liberal license you can use to host your data:

 database data type license NMRShiftDB NMR spectra GNU FDL ChemSpider Structures, links to papers, spectra open data SORD Organic Reactions ?

There are some obvious gaps here, if you consider a typical experimental section. What to do with an measure melting point, IR spectra, mass spectral information, and measured elemental composition.

Why can't SourceForge just remember me?

I do not typically make complaints in my blog, so consider this a request for advice in good practices ;)

My problem is that I have to log in on SourceForge every day, even if I tick the 'Remember me' switch. I do understand that account log ins do need some time out... but less than a day? One cause of problems seems to be if I connect via a different network, but Cookies should not be affected by that? Am I doing something wrong here, or does SourceForge?

Are others having the same problems?