Discovery Stage

Figure captions for software pipelines: discovery
One of the major outputs from the Clinical Proteomic Technologies for Cancer initiative
has been the development of software analysis tools. Analysis of mass
spectrometry data for protein identification includes a number of steps, as
depicted above. Briefly, the raw mass spectra is first processed to
improve the quality of the spectra. Poor spectra are discarded.
Next, peptides are identified from the spectra. If one is searching
for post-translational modifications, those would also be identified at this
point. After that, protein identities are inferred from the identified
peptides. Finally, a number of quantitative and semi-quantitative methods
are available to differentiate proteins upregulated in specific disease states.
These disease-linked proteins may then comprise a biomarker candidate list.
See below for further descriptions of these tools.
Data Pre-processing
- ScanSifter: The “ScanSifter” algorithm, a Vanderbilt in-house-developed software, assesses the quality of each raw spectrum and discards poor quality spectra. As such, this application streamlines data analysis systems. In detail, the algorithm reads tandem mass spectra stored as centroided peak lists from Thermo RAW files and transcodes them to text-based files. Spectra that contain fewer than six peaks or that have less than 20 measured peaks in the total ion chromatogram are not transcoded. If 90% of the intensity of a tandem mass spectrum appears at a lower m/z than the precursor ion, a single precursor charge is assumed; otherwise, the spectrum is processed under both double and triple precursor charge assumptions.
- MAZIE (Mass and Charge (Z) Interface Engine): This software improved identification of peptide ion mass and charge, based on the isotopic distribution of peptide ion envelopes. This software will be distributed freely to the research community upon publication. MAZIE was written to enhance the fidelity of ion clustering. MAZIE is however suitable for general use to enhance database searching for mid-resolution mass spectrometers.
- ProteoWizard: The ProteoWizard software provides modular and open source, cross platform tools and libraries. The tools perform proteomic data analysis, while the libraries enable rapid tool creation by providing a robust, pluggable framework that simplifies and unifies data file access and performs standard proteomics and LC-MS dataset computations.
- MassQC: Mass QC is a software package that serves to diagnose mass spectrometry instrument hardware. Using data from CPTAC inter-lab studies, NIST developed a number of metrics to assess instrument performance. Through careful examination, NIST developed relationships between specific metrics and aspects of the measurement process. For instance, a decrease in the chromatographic elution time for a sample may indicate that a column should be replaced. ProteomeSoftware, a small software company in Portland, Oregon, built a graphical user interface over the NIST metrics. The resulting software package is called MassQC and was released June 5, 2009.
Peptide ID
- DirecTag: DirecTag, is a tag-based identification which has been shown to be an accurate way to recover sequences from tandem mass spectra. It uses identification of peptides through sequence tagging using automated sequence tag inference. The algorithm has been evaluated on a diversity of MS instruments, from TOF/TOF to quadrapole ion trap. DirecTag has been released with source code to the research community.
- HMMatch: HMMatch tool demonstrates the ability to confidently assign more peptide identifications than is possible with a single search engine score with no loss of statistical significance. The increased number of peptide identifications improves protein coverage and the ability to discern protein isoforms. A statistical significance model permits HMMatch scores to be compared with each other, and with other peptide identification tools, on a unified scale. It is possible to extrapolate HMMatch models beyond a single peptide's training spectra to the spectra of related peptides, expanding the application of spectral matching techniques beyond the set of peptides previously observed. In contrast to sequence-based tandem mass spectrometry search engines used for peptides, spectral matching can make use of the intensities of fragment peaks in library spectra to assess the quality of a match.
- Myrimatch: MyriMatch makes more effective use of fragment ion intensity in comparison to X!Tandem Expect and Sequest XCorr and is robust against noise peaks. MyriMatch has been selected as the standard search engine for processing the data sets of the CPTAC Unbiased Working group.
- TagRecon: The process of tag reconciliation can allow amino acid changes to either side of the inferred sequence. In this process the tag sequences for an MS/MS are reconciled against the protein sequences from the database. The “TagRecon” software conducts this process using the same scoring algorithm as in MyriMatch. Therefore, combining TagRecon and MyriMatch search results increases confident peptide identification.
PTM Assignment
- PepCyber: The PepCyber database focuses primarily on the interactions between binding domains in phosphoprotein binding proteins (PPBPs) and phosphopeptides (PPEP). A procedure has been established for database and literature curation to populate PepCyber with information regarding the interactions between PPBD and their PPEP substrates which are mediated through SH2 domain binding.
- Monstermod: MonsterMod, a successor to the P-mod algorithm, matches a user-supplied list of peptide or protein sequences to a collection of tandem mass spectra via the MVH scorer (first developed for MyriMatch). This functionality will be separated into the "MassRecon" data analytical tool and a PTM explorer tool for visualizing and interacting with the resulting identifications.
Protein ID
- IDPicker: The IDPicker tool enables users to organize experimental data into complex hierarchies. It was developed for protein assembly and has proven to be invaluable in generating tables of spectral counts that can be used for identifying candidate biomarkers in large cancer data sets. It has also been instrumental in organizing the complex datasets from the Unbiased Discovery inter-laboratory studies.
- iProphet: iProphet, allows more precise integration of information supporting theidentification of each unique peptide sequence from multiple MS/MS spectra. iProphet takes as input PeptideProphet spectrum-level results from multiple LC-MS/MS runs, and then computes a new probability at the level of unique peptide sequence. The new framework allows combining results from multiple search tools, and also takes into account other supporting factors including: number of sibling experiments identifying same peptide ions, number of replicate ion identifications, sibling ions, and sibling modification states.
- PepArML: The PepArML Meta-Search engine provides access to large scale MS/MS sequence database searching infrastructure to researchers and labs without the computational resources or personnel to implement a distributed computing strategy in-house. Furthermore, this infrastructure provides a mechanism for search engines not intended or designed to run in a distributed or parallel fashion to be used in a distributed environment, without the need to modify the individual search engines — eliminating the need for the many ad hoc distributed computing solutions embedded in each individual search engine. Lastly, the meta-search engine is designed to be self-contained, platform independent, and require minimal operating system support, making it suitable for installation in small to medium size labs with little distributed computing expertise.
- Scaffold: Scaffold is a program that integrates search results from three algorithms (Sequest, X! tandem and Mascot) to generate peptide identification and protein identification probabilities. Scaffold also displays an overview of the protein identifications that can be validated by probability scores. Protein information can also be used to detect false positives and examine the peptide and spectral evidence used for identification.
ID-based Differentiation
- QSpec: The QSpec method is for data generated by the spectral count method. The spectral count method has become an accepted method for label-free quantitation in proteomics and comparable to measurements of extracted ion-chromatographic intensities. The spectral count method has the advantage of being applicable towards shotgun proteomics data using medium or even low resolution mass spectrometers, but comparison of proteins between two conditions is restricted to only those that are identified by MS/MS scans in both conditions.
- SASPECT: SASPECT provides a function for identifying differentially expressed proteins between two sample groups using spectral counts from LC-MS/MS Experiments. SASPECT employs the commonly used “spectral-count” assumption: the probability of a protein’s being observed in one LC-MS/MS experiment is proportional to its abundance in the complex sample. However, in contrast to spectral counting, SASPECT uses the Boolean values of whether the spectral count is greater than zero instead of the raw values of spectral counts, for the latter are more subjected to the changes of various experimental factors. In addition, by properly controlling the false discovery rates (FDR), SASPECT provides quantitative guidance in peptides and proteins selection.
- QuasiProto: QuasiProto is designed for spectral count differentiation in complex proteomic data sets. The software works from IDPicker tables that report the numbers of spectra matched to each protein. QuasiProto computes q-values for proteins by means of a quasi-likelihood model based on these spectral counts. This model enables the incorporation of features such as differences in instrument performance over time.
- VIBE-MS: The recently developed VIBE Toolkit for Mass Spectrometry gives users access to an integrated, modular environment for mass spectrometry data classification. The software provides an extensible ‘drag-and-drop’ graphical interface for creating workflows, which is an ideal environment to efficientlyevaluate and optimize mass spectrometry analysis pipelines. The software provides the required flexibility in the selection, comparison, and optimization of these analysis methods, as well as the optimization of the entire analysis pipeline. The ease with which modules can be removed from a pipeline, replaced with an alternate module, re-parameterized and used reduces the amount of time and effort required for a researcher to arrive at the optimal analysis protocol for a given problem.
Intensity-based Differentiation
- PICQuant: A new stable isotope mass tag (13C Phenylisocyanate, PIC) that has several advantages for protein quantification in complex mixtures was developed. This platform includes: 1) custom designed software (PICquant) to automatically quantify labeled peaks, 2) a spectrum-comparison algorithm that groups spectra into a registry of spectra representing unique peptide families, and 3) enhanced peptide sequencing by distinguishing b- and y-ion series in CID spectra. Completing the platform is a clinical registry that links acquired specimens to current and prospective clinical information including outcomes, and that enables multivariate clustering of disease states with quantified protein families. The mature PICquant platform will provide nearly completely automated data analysis, allowing assembly of numerous patient samples into complete protein abundance profiles akin to gene expression array data. Additionally, the expanding registry database will be of use to other investigators processing either PIC-labeled or unlabeled peptide spectra.
- PEPPeR: A Platform for Experimental Proteomic Pattern Recognition, “PEPPeR” uses high resolution and high mass accuracy LC-MS data from state-of-the-art mass spectrometers, and appropriately combines pattern-based (unidentified peptide peaks) and identity-based (peptides sequenced via MS/MS) information to generate peptide quantitation—thereby extending biomarker discovery to all charge identified MS1 peaks. From a computational perspective, the uniqueness of this approach arises from the use of: (i) identified peptides to guide alignment of unidentified peaks; (ii) matching unidentified peaks across multiple samples using mixture model based peak matching; and (iii) adaptive matching tolerances automatically calculated for each experiment.
- fPEPPeR: A recently-developed extension of PEPPeR, named fPEPPeR, incorporates the very first methodology for processing and computationally reassembling peptide fractions from multidimensional fractionation to facilitate data analysis at the sample level. The method works well despite imprecision of fraction boundaries or other variations during fractionation. In fPEPPeR, the PEPPeR peak-matching algorithm has been adapted to identify the same peptide species (peak) not only across multiple samples, but also across different fractions. fPEPPeR outputs intensity measurements for a common set of peaks spanning all the fractions under consideration. A sample is then computationally reassembled (i.e., defractionated) by summing the intensity measurements for each matched peak across all fractions from that sample. The defractionated samples can then be subject to biomarker discovery, class prediction, clustering and other pattern recognition algorithms without regard to the fractionation or any variations therein. This software is freely available as a GenePattern module.
- XAlign: XAlign is a two-step alignment algorithm. The first step is to detect significant peaks that are common to all samples. In the second step, all samples are aligned to the median sample using refined m/z and retention time variation values, where pattern recognition is applied as needed.
- Availability: Xalign software is available upon request from the author.
- Contact: xiang.zhang@louisville.edu