Pages

Saturday, December 29, 2012

CTR #1: Heavy atom counts from an SD file

The first Chemistry Toolkit Rosetta task is to count the number of heavy atoms in the structures given in a MDL SD file. This tasks starts with an SD file and counts for each structure in the file the number of heavy atoms (non-hydrogen atoms). Because we simply handle the structures one by one, the solution uses the IteratingMDLReader reader. The input file (benzodiazepine.sdf.gz) is a gziped file, which we handle by using a GZIPInputStream. Because we want to make sure the input file does not have any unexpected content, we use the STRICT mode. The input file turns out to do not have non-standard features, so that we do not have to worry about D and T element symbols.

The solution lists all heavy atom counts:

import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.io.*;
import org.openscience.cdk.io.iterator.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.tools.manipulator.*;
import java.util.zip.GZIPInputStream;

iterator = new IteratingMDLReader(
  new GZIPInputStream(
    new File("benzodiazepine.sdf.gz")
      .newInputStream()
  ),
  SilentChemObjectBuilder.getInstance()
)
iterator.setReaderMode(
  IChemObjectReader.Mode.STRICT
)
while (iterator.hasNext()) {
  mol = iterator.next()
  heavyAtomCount = 0
  for (atom in mol.atoms()) {
    if (1 == atom.atomicNumber ||
        "H".equals(atom.symbol)) {
      // do not count hydrogens
    } else {
      heavyAtomCount++
    }
  }
  println heavyAtomCount
}