## Sunday, May 28, 2006

### Blue Obelisk in Obernai at Chemoinformatics in Europe

Together with Christoph, Christian and Jerome, I will be representing the Blue Obelisk movement on the first First Workshop on Chemoinformatics in Europe with the topic Research and Teaching. Though I wonder what this excludes? Development? Can't imagine that commercials companies will not be represented as usual. Moreover, it will likely include some bioinformatics too, unless you consider that to deal with sequences only.

I have my laptop with me, and, of course, the Blue Obelisk Live CD 2 on which the mouse now actually works. Bioclipse 0.9.1 does not work, though; will report that bug later.

My work schedule for the train ride:

• Work on my manuscript

• Integrate Todd Martin's SMILES and QSAR work

• Work on the next CDK News

• Think about InChI creation in Bioclipse, using OpenBabel

## Friday, May 26, 2006

### Molecular indexing on the KDE and OS/X desktops

Geoff Hutchinson should blogged about his OS/X ChemSpotLight, an indexing tool for chemistry documents. It's like, but more advanced than, the kfile_chemical and Kat I have been working on (with others) for the KDE desktop (see earlier blog items).

ChemSpotLight currently does more than the KDE tools: it adds Spotlight comments. I assume these are like the Linux extended attributes, used for example by Beagle. For example, a file indexed by Beagle will have extended attributes like:
# file: home/egonw/m43.jpguser.Beagle.AttrTime="20060509071950"user.Beagle.Filter="003 Beagle.Filters.FilterJpeg"user.Beagle.Fingerprint="02 xHn5Yi58x0eoI8ityBYkUw"user.Beagle.MTime="20031225151016"user.Beagle.Uid="YcIW72RWyk+K5FbGnpv4iA"

This is very suitable for adding metadata, like comments as in ChemSpotLight. Geoff's program adds metadata like number of atoms and bond, but it calculates the SMILES and InChI on the fly too. Especially the last is very good for indexing purposes, as it is a really unique identifier for molecular structures, and even works for proteins.

Now, kfile_chemical is a kfile plugin. These kfile plugins only extract metadata from files, and have little to do with calculated metadata. Kat, on the other hand, is an indexing application and might be expected to add additional, derived or calculated, metadata as extended attributes, just like Beagle does. And then InChI and SMILES are good candidates.

## Wednesday, May 24, 2006

### XML validation on Eclipse with Web Tools Platform

Yesterday I installed the Eclipse Web Tools Platform again, and now succesfully, using the Eclipse update mechanism, on my Kubuntu dapper eclipse install. Because it has a validating XML editor, the one last thing I still needed jEdit for. (I do miss the vertical selection feature of jEdit, though.) It signals me of errors, and allows autocompletion.

Now I can validate all Chemical Markup Langauge files I have around, which is very useful for those I use to make sure CDK and Bioclipse is working properly. I just need to make sure I use the http://www.w3.org/2001/XMLSchema-instance namespace, for example as in this example from CDK SVN:
<cml title="Regression tests for valid XML Schema documents for CML 2.3"  xmlns="http://www.xml-cml.org/schema"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:schemaLocation="http://www.xml-cml.org/schema ../../../io/cml/data/cml23.xsd">

Now, I do have some questions. Firstly, does WTP allow recycling of the XML editor? That is, can I use their validating XML editor in, for example, Bioclipse? Would I just depend on the right plugin jars from WTP, or is it more complicated? Alternatively, since in RCP all is a plugin, can WTP be installed as plugin in Bioclipse directly??

Secondly, does Kubuntu or Debian sid have binary packages for WTP? I think to remember having read something about this, in relation with splitting up the WTP into smaller, more specific plugins. Anyone?

## Monday, May 22, 2006

### A live life-sciences CD

November last year, I reported my plans to develop a live CD with all our favorite chemo- and bioinformatics software. Bioclipse requires Java5 and sort of still depends on the Sun JVM (I will experiment with classpath-generics later), but is now distributable with operating systems. So, I made a Kubuntu derived operating system with OpenBabel, Jmol, PyMOL, Bioclipse, and, on systems level, the chemical MIMEs and kfile_chemical, wich extends the desktop with chemistry awareness. In addition, I added the Blue Obelisk Data Repository, all CDK News issues, and the full NMRShiftDB data in CML format.

## Thursday, May 18, 2006

### Taverna runs with Classpath 0.91

Classpath 0.91 is released with 1.45 million lines of code and with 98.96% coverage of Java 1.4.2, and 99.82% of java.swing. Or, as Dave calls it: 0.91 rocks! JChemPaint runs again (they fixed the XML parsing problem), and Jmol still runs, but slow. I also tested Taverna which now also starts up, but has an XML parsing error too:
Exception occured whilst loading RDFS! Error on line 2: required string: "?>"org.jdom.input.JDOMParseException: Error on line 2: required string: "?>"   at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)   at org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)   at org.embl.ebi.escience.scufl.semantics.RDFSParser.loadRDFSDocument(RDFSParser.java:70)   at org.embl.ebi.escience.scuflui.workbench.Workbench.main(Workbench.java:128)   at java.lang.reflect.Method.invokeNative(Native Method)   at java.lang.reflect.Method.invoke(Method.java:355)   at org.embl.ebi.escience.scuflui.workbench.WorkbenchLauncher.main(WorkbenchLauncher.java:40)

Oh, and rumours go that gcjwebplugin can run the Jmol applet now, except for the JavaScript interaction, that is.

## Thursday, May 11, 2006

### New open access journal Source Code for Biology and Medicine

BioMed Central is setting up a new peer-reviewed, open access journal Source Code for Biology and Medicine. It will "encompass all aspects of workflow for information systems, decision support systems, client user networks, database management, and data mining". Basically, anything that fits into chem-bla-ics. (Thanx to Werner, for pointing me to the website!)

The 'source code' aspect is the interesting thing of this new journal. The editorial board set the aim to publish source code for distribution and use in the public domain in order to advance biological and medical research. And, in a bit more detail, they list the following goals:

• increase productivity

• reduce discovery times

• reduce search times for source code

• provide a historical reflection of source code applied

• serve as a repository

This comes close to what open source is trying to achieve too, but I do not differences. For example, the announcement mentions the public domain (see the WikiPedia entry). I tend to be a bit confused by the use of this term: to me the public domain is where things end up after copyright claims have ended, and everyone is free to do with it whatever he wants, and, very important in this case, that open source software is not in the public domain. Do they mean that they will not allow open source in the new journal?

I also wonder wether we need a journal like this? Open source projects often have other resources available that serve as repository (e.g. SourceForge), and the use version control systems as repositories (like CVS, Subversion) is widespread too, which takes care of the historical reflection. Indeed, many open source software is already published in other journals.

The process of picking the journal to submit to, often involves looking up the journals impact factor. Is this new journal expected to get a high impact factor? How many people will regularly read the journal? Will it be read by the right audience, or just by fellow bioinformaticians?

Though I have my doubts about the success of this journal, I am looking forward to the first issue!

Update: Pedro pointed me to the About page of the SCFBM, giving details on the types of articles taken into consideration.

## Sunday, May 07, 2006

### Open Text Mining Interface and Bioclipse

Timo Hannay blogged in Nature's Nascent blog about the Open Text Mining Interface (OTMI), which is "a suggestion from Nature about how we might achieve text-mining and indexing purposes". The idea is that each article has a link pointing to a machine readable file containing raw data about (and from?) the article. The standing example uses Atom 1.0 as a container, allowing raw data to be included using foreign namespaces, such as Dublic Core (for metadata) and Prism (for bibliographic data), and the OTMI text mining statistics uses a namespace too.

In a comment, Henry Rzepa proposed inclusion of CML, and refers to earlier work on CMLRSS where Chemical Markup Language is embedded in RSS news feeds for which I wrote readers for Jmol and JChemPaint (DOI: 10.1021/ci034244p).

As readers of my blog know, the Bioclipse project has been working hard on an integrated (bio)chemistry workbench, and the latest release includes a CMLRSS reader plugin too, which supports CML embedded in Atom 0.3/1.0 and RSS 1.0/2.0 feeds. Now, adding support for other embedded namespaces is trivial, and this morning I hacked in support for OTMI:

(Click to enlarge.)

This screenshot show the original OTMI example with the Atom 1.0 entry now wrapped in an Atom 1.0 <feed> element. There is no nice OTMI icon for the OTMI content in the Atom 1.0 entry, neither did I make a 'view' yet showing the actual vector's or the snippet's, but that's a piece of cake too.

Now, the nice thing about this is that the Bioclipse code for the Atom and RSS feeds, just greps through the feed entry and show whatever CML or OTMI content is present. When Nature decides to include CML in these OTMI files too, I will not have to update the current code.

## Wednesday, May 03, 2006

### Four graph mining methods integrated in ParMol

Joerg Wegner mentioned in his blog the graph mining program ParMol which integrates four mining algorithms: MoSS (aka MoFa) and Gaston, which I mentioned in November last year, and FFSM and gSpan, which I did not know about yet. ParMol provides a common interface to the four different algorithms and is, like the four mining modules, licensed GPL. An interesting aspect is that Gaston was originally written in C++.

## Monday, May 01, 2006

### Nightly CDK builds now available

Rajarshi Guha has set a nightly build service for the Chemistry Development Kit (CDK). The output is pretty, but information rich: it includes results for the JUnit test, DocCheck, and PMD. The compiled jar and the corresponding JavaDoc can be downloaded, offering a cutting edge distribution for users.