Back to Projects

Automated Proteomics Pipeline for Neurodegeneration Biomarker Discovery

Automated R pipeline from MaxQuant output to differential expression and ML-based biomarker discovery. 70% reduction in data cleaning time, deployed at LCSB University of Luxembourg.

Proteomics R MaxQuant Biomarker Discovery Neurodegeneration Reproducibility limma

Automated Proteomics Pipeline for Neurodegeneration Biomarker Discovery

Stack: R · DEP · limma · tidyverse · ggplot2 · MaxQuant · Python · Git Repository: github.com/SLopezBegines/Proteomics


Problem

Large-scale proteomics datasets from neurodegeneration studies required extensive manual curation before statistical analysis, creating bottlenecks and reproducibility risks. Processing a single dataset could take days of repetitive cleaning and validation work.

Solution

Built an automated R pipeline for MaxQuant label-free quantification output — covering data cleaning, normalization (VSN), differential expression (DEP/limma), and visualization. Integrated cross-validation frameworks for ML-based biomarker discovery. Each analysis is configured through a single RMarkdown file; modular scripts are reused without modification across datasets.

Result

70% reduction in data cleaning time. Pipeline deployed at LCSB (University of Luxembourg) across multiple neurodegeneration datasets. Contributed to peer-reviewed publications in high-impact journals.


Overview

A modular and reproducible R pipeline for analyzing label-free quantitative (LFQ) proteomics data from Orbitrap and Q-Exactive mass spectrometers. The pipeline processes MaxQuant output through a complete analytical workflow: data cleaning, mixed imputation, differential expression analysis, and multi-layered functional enrichment.

Each analysis is configured through a single RMarkdown file that defines organism parameters and experimental design, then calls reusable modular scripts. This architecture allows rapid deployment on new datasets without code modification.

Technical Approach

Proteomics experiments generate complex datasets with systematic missing values, batch effects, and thousands of protein measurements across conditions. Standard tools handle individual steps but lack integration. This pipeline addresses:

  • Missing value heterogeneity — Implements a mixed imputation strategy that distinguishes MNAR (below detection limit) from MAR (randomly absent) proteins, applying appropriate methods for each type.
  • Reproducibility — Modular scripts with centralized parameter control ensure consistent processing across datasets.
  • Multi-organism support — Configurable for human, mouse, and zebrafish without pipeline modification.

Analytical Workflow

flowchart TD
    A["📥 MaxQuant output · ProteinGroups.txt / .xlsx"] --> B

    subgraph QC ["1 · QC & Preprocessing"]
        B["Load & standardise columns · Remove contaminants"]
        B --> C["Define experiment design · conditions · replicates · contrasts"]
        C --> D["Filter missing values · fraction_NA threshold per condition"]
        D --> E["VSN normalisation"]
        E --> F["Mixed imputation · MNAR → zero/MinProb/QRILC · MAR → kNN"]
    end

    subgraph DE ["2 · Differential Expression"]
        F --> G["limma · empirical Bayes · ~0 + condition · manual contrasts"]
        G --> H["Log2FC · p-value · BH-adjusted p · UP / DOWN / NO per comparison"]
    end

    subgraph VIZ ["3 · Visualisation"]
        H --> I["Volcano plots · Heatmaps · PCA · UpSet"]
    end

    subgraph ENRICH ["4 · Functional Enrichment"]
        H --> J["ORA — enrichGO · GSEA — gseGO · gseKEGG · pathview"]
        H --> K["STRING PPI networks · PANTHER · EnrichR"]
    end

    subgraph SUMM ["5 · Summary"]
        I & J & K --> L["Statistics tables · DE counts · effect sizes"]
    end

    style QC fill:#1e3a5f,color:#fff,stroke:#1a7a7a
    style DE fill:#1e3a1e,color:#fff,stroke:#22c55e
    style VIZ fill:#3a1e1e,color:#fff,stroke:#ef4444
    style ENRICH fill:#3a2a1e,color:#fff,stroke:#f59e0b
    style SUMM fill:#2a1e3a,color:#fff,stroke:#8b5cf6

Key Technical Details

  • Differential expression via DEP::analyze_dep() wrapping limma with flexible manual contrasts
  • Configurable imputation: fraction_NA, factor_SD_impute, and MNAR method selection
  • Automated directory structure and sequential figure numbering for reproducible outputs
  • Dual export (TIFF raster + PDF vector) for all figures
  • Gene identifier mapping through biomaRt and AnnotationDbi (UNIPROT → ENSEMBL/ENTREZ)

Example Application — CLN3 Lysosomal Interactome

The repository includes a complete analysis of the CLN3 lysosomal interactome in human cell lines (ProteomeXchange PXD031582), comparing CTRL vs WT vs KO conditions across 12 samples and 3 pairwise contrasts.

Calcagni’ et al. Loss of the batten disease protein CLN3 leads to mis-trafficking of M6PR and defective autophagic-lysosomal reformation. Nat Commun 14, 3911 (2023). doi:10.1038/s41467-023-39643-7


Quality Control & Normalisation

QC overview
QC overview — protein identification and coverage
VSN normalisation
VSN normalisation diagnostics
SD before vs after imputation
SD before vs after imputation — 6 methods compared
Imputation distribution
Intensity distribution — imputation method comparison

Dimensionality Reduction & Differential Expression

PCA mixed imputation
PCA — mixed imputation
Volcano KO vs WT
Volcano plot — KO vs WT

Clustering & Functional Enrichment

Heatmap significant proteins
Heatmap — significant proteins across all comparisons
GO lolliplot
GO lolliplot — KO vs WT upregulated terms