Skip to content

Learning the Grass, Not the Cow: How Overfitting Can Mislead Proteomics Searches

Learning the Grass, Not the Cow: How Overfitting Can Mislead Proteomics Searches
By Max Burq | March 12, 2024

Mass spectrometry proteomics has revolutionized how we study biology, allowing us to identify and quantify thousands of proteins in complex samples. At the heart of this process lie powerful search engines that match the complex spectral data generated by the mass spectrometer to peptide sequences in a database. To ensure we trust these identifications, we rely heavily on methods like the Target-Decoy Approach (TDA) to estimate the False Discovery Rate (FDR). But what happens when the sophisticated machine learning tools we use to improve these identifications start learning the wrong lessons?

This is where the concept of overfitting comes in, a potential pitfall that can subtly undermine the reliability of our results, particularly when using popular post-processing tools like Percolator.

The Confidence Game: Target-Decoy Explained

Section titled “The Confidence Game: Target-Decoy Explained”

Imagine you’re searching for specific faces (target peptides) in a massive crowd photo (your experimental spectra). It’s easy to mistakenly identify random patterns as faces. To estimate how often you make mistakes, you cleverly add known non-faces (decoy peptides) to your search. Decoys are typically generated by reversing or shuffling real protein sequences – they look like peptides but shouldn’t actually exist in your sample.

You then search your crowd photo for both real faces (targets) and non-faces (decoys). By seeing how many decoys you accidentally “identify” at a certain confidence score, you can estimate how many real faces you might also be misidentifying at that same score. This gives you the FDR – usually aimed for 1% or 5%.

Boosting Power with Machine Learning: Enter Percolator

Section titled “Boosting Power with Machine Learning: Enter Percolator”

Basic search engines assign scores to peptide-spectrum matches (PSMs). Tools like Percolator use machine learning (ML) to improve this process. They look at various features of a PSM (like the number of matched ions, intensity correlations, peptide length, retention time, etc.) and learn to distinguish between likely true matches (high-scoring targets) and likely false matches (decoys). By learning these patterns, Percolator can re-score the PSMs, often allowing us to identify more peptides at the same FDR threshold – a win for sensitivity!

Here’s the catch. Machine learning models learn from the data they are given. Overfitting occurs when the model learns the training data too specifically, including its noise and random quirks, rather than learning the underlying general patterns.

Let’s use an analogy. Imagine you’re training a computer to distinguish pictures of cows from pictures of dogs. You feed it thousands of labeled images. But, by chance, most of your cow pictures were taken in grassy fields, while most dog pictures were taken indoors or on pavement. An overfitted model might not learn the features of a cow (shape, horns, udder) versus a dog (snout, tail wag). Instead, it might learn a simpler, but wrong, rule: “If there’s a lot of green (grass) in the picture, it’s a cow.” This model works perfectly on your training images! But show it a picture of a cow on a rocky terrain or a dog in a park, and it will fail miserably because it learned an association (cows with grass) rather than the defining features of the animal itself.

Cow example 1Dog example 1Dog example 2Cow example 2

How does this apply to Percolator and TDA? The ML model in Percolator is trained using high-scoring target PSMs (assumed mostly true) and decoy PSMs (assumed false). If the model overfits, it might learn subtle, non-robust features that happen to perfectly separate these specific training decoys from these specific training targets, rather than learning the fundamental characteristics of good vs. bad peptide-spectrum matches.

For example, the decoy generation method itself might introduce a tiny, systematic artifact (like a slight bias in predicted fragment ions) that doesn’t exist in real peptides. An overfitted model could latch onto this artifact as a “tell-tale sign” of a decoy. It becomes exceptionally good at spotting these specific decoys used during training, but it hasn’t learned to spot genuinely bad matches among the targets that don’t share this artifact.

If the ML model overfits to the decoys, it becomes overly confident in its ability to distinguish true from false. It thinks it’s separating targets and decoys brilliantly based on the training data. This leads to an underestimation of the actual FDR. You might set your threshold for a 1% FDR, but because the model is overconfident from overfitting, the true FDR among your accepted target peptides could be significantly higher – perhaps 5% or even 10%.

This means more false positives – peptides and proteins you think are confidently identified but aren’t actually there. Building biological conclusions on shaky identifications can lead research down the wrong path.

Overfitting is a known challenge in machine learning, and researchers are actively developing methods to detect and mitigate it in proteomics pipelines. This doesn’t mean we should abandon powerful tools like Percolator, which have greatly advanced the field. However, it does mean we need to be aware of the potential pitfalls.

  • Be critical: Don’t treat FDR values as absolute truth, especially when pushing the boundaries of sensitivity.
  • Stay informed: Pay attention to developments in search engine algorithms and validation methods.
  • Consider controls: Orthogonal validation methods (like checking against known protein complex members or using synthetic peptides) can add layers of confidence.

Machine learning offers immense power for biological discovery, but like any powerful tool, it requires careful handling and a healthy dose of skepticism. Understanding potential issues like overfitting helps us use these tools wisely and ensure the robustness of our proteomic findings.