I am a computational biologist who puts equal emphasis on developing algorithms to allow scientists to grok large, complex, high-dimensional datasets and on creating visualizations to present, support, and validate the results of those algorithms. Find below a sampling of projects I have worked on. For more information please view my resume or contact me.
Summary:
In 2007, I led a team of software developers to produce Swift, a web application that provides mass spectrometrists and proteomics researcher-physicians with more accurate and comprehensive protein identifications by automatically analyzing mass spectrometry data with a number of popular commercial and open source tools. Swift presents an easy to use web interface that, after a few clicks and without further user intervention, can process data through the commerical protein search tools Mascot, Sequest and the open source tool X!Tandem, and combine the results using Scaffold. I was responsible for design and planning of this tool; my perl prototypes (developed for RAAMS) informed the initial implementation. I designed a FASTA protein sequence database creation tool, and then mentored a junior programmer to implement it. In addition to planning, designing and mentoring for Swift as a whole, I designed, developed and implemented a parameters editor that creates parameters files for all three search engines from one user interface; in the process, I taught myself groovy and Google Web Toolkit (GWT). In addition to Mayo, Swift has been successfully deployed at the University of Minnesota Supercomputing Institute.
SpecView was a Java mass spectrum viewer client I designed and implemented in mid 2001 to early 2002 for viewing the results of Sequel Genetics' (now Spectra Genetics') proprietary Peptide Mass Signature Genotyping technology. SpecView used CORBA (specifically, Jacorb and Mico) to communicate between the C++ server responsible for spectral processing and the Java Swing client; this architecture was driven, among other factors, by a business goal to make data analysis a sustainable revenue stream. In addition to architecting the CORBA interfaces (IDL), I co-authored the C++ backend and implemented the client in its entirety. I taught myself both Swing and CORBA in the process of implementing this visualization tool, the first version of which I had working in just over a month (about 15,000 of an eventual 40,000 lines of Java). SpecView was the first major user facing software I had designed and a significant amount of effort was devoted to usability; SpecView employed extensive drag and drop, made restrained use of transparency/alpha for indicating selection of spectra and molecules, and was fully threaded, allowing rapid, highly interactive browsing of spectra even over slow network connections. Moreover, SpecView incorporated several novel algorithms including a quadtree- based scheme for non-overlapping label placement, and a sophisticated client-side caching architecture that combined a multi-level binary search tree with a soft-reference cache. See an early prototype sketch (pdf), an overall screenshot and one showing label placement.
Following a suggestion by Terry Therneau, in late 2005 and early 2006 I designed and implemented an algorithm to interpret 16O/18O differential proteomics mass spec data created in the Mayo Proteomics Research Center (MPRC). This labeling technique allows two complex protein samples (for example, from a patient with Pancreatic Cancer, and from an age-, sex-, and diabetes status-matched control patient) to be compared, in order to find differences in the expression of proteins that might serve as diagnostic or prognostic biomarkers. The technique works by digesting the samples with the endoprotease trypsin (cutting the large proteins into smaller "peptides"), in the presence of H218O (water where the oxygen atom has been substituted for a non-radioactive isotope that is two Daltons heavier). The natural mechanism of trypsin causes the incorporation of up to two of these heavier oxygen atoms and the difference in mass results in complex mass spectra that can be automatically interpreted by the algorithm to give the relative amounts of protein in each sample. The algorithm, based on linear regression, is the subject of a paper I wrote that was accepted to Molecular and Cellular Proteomics, as well as an article in Bioinformatics by our biostatistics collaborators. The C++ code implementing the algorithm has been released as open source.
To validate the RAAMS algorithm and to display its results to users,
I mentored a summer student to produce a DHTML spectrum viewer. Very similar in concept to Google Maps, the server side (C++) produces image tiles (using ImageMagick and libpng) that are sent to the client (javascript) that places them (absolutely positioned) in a <div>. Based on user input to pan or zoom, the client fetches additional tiles and also JSON formatted data which is overlayed on top of the tiles. Currently both raw spectrum data (samples) and peaks are supported, with better support for peaks. We would like to move to using the <canvas> tag for drawing both the peaks and annotations (squares). We also embeded this web application inside Spotfire Decisionsite (which itself embeds Internet Explorer), shown displaying all the peptide expression ratios (y axis, > 0 is upregulated in diseased sample) for a given protein.
Over a cold weekend in late 2005, I taught myself OpenGL and implemented, in C++ and GLUT, an interactive three dimensional Liquid Chromat- ogram Mass Spec (LC-MS) viewer. The viewer made use of modern graphics card features like support for vertex buffer objects, to allow geometry to be DMAed to the graphics card without using host CPU. I plan to eventually return to this proof-of-concept project and make it threaded, client-server, and utilize a terrain decimation level-of-detail algorithm to display the chromatogram as a dynamically refined surface. In the screenshot, retention time runs diagonally from bottom left to top right and mass-to-charge runs from middle left to bottom right, while abundance moves from bottom to top; each line is a peak and the red lines are those peaks that have been determined by RAAMS to be part of an isotope cluster.