Back to Projects

Automated Proteomics Pipeline for Neurodegeneration Biomarker Discovery

Automated R pipeline from MaxQuant output to differential expression and ML-based biomarker discovery. 70% reduction in data cleaning time, deployed at LCSB University of Luxembourg.

Proteomics R MaxQuant Biomarker Discovery Neurodegeneration Reproducibility limma

Automated Proteomics Pipeline for Neurodegeneration Biomarker Discovery

Stack: R · DEP · limma · tidyverse · ggplot2 · MaxQuant · Python · Git Repository: github.com/SLopezBegines/Proteomics

Problem

Large-scale proteomics datasets from neurodegeneration studies required extensive manual curation before statistical analysis, creating bottlenecks and reproducibility risks. Processing a single dataset could take days of repetitive cleaning and validation work.

Solution

Built an automated R pipeline for MaxQuant label-free quantification output — covering data cleaning, normalization (VSN), differential expression (DEP/limma), and visualization. Integrated cross-validation frameworks for ML-based biomarker discovery. Each analysis is configured through a single RMarkdown file; modular scripts are reused without modification across datasets.

Result

70% reduction in data cleaning time. Pipeline deployed at LCSB (University of Luxembourg) across multiple neurodegeneration datasets. Contributed to peer-reviewed publications in high-impact journals.

Overview

A modular and reproducible R pipeline for analyzing label-free quantitative (LFQ) proteomics data from Orbitrap and Q-Exactive mass spectrometers. The pipeline processes MaxQuant output through a complete analytical workflow: data cleaning, mixed imputation, differential expression analysis, and multi-layered functional enrichment.

Each analysis is configured through a single RMarkdown file that defines organism parameters and experimental design, then calls reusable modular scripts. This architecture allows rapid deployment on new datasets without code modification.

Technical Approach

Proteomics experiments generate complex datasets with systematic missing values, batch effects, and thousands of protein measurements across conditions. Standard tools handle individual steps but lack integration. This pipeline addresses:

Missing value heterogeneity — Implements a mixed imputation strategy that distinguishes MNAR (below detection limit) from MAR (randomly absent) proteins, applying appropriate methods for each type.
Reproducibility — Modular scripts with centralized parameter control ensure consistent processing across datasets.
Multi-organism support — Configurable for human, mouse, and zebrafish without pipeline modification.

Analytical Workflow

flowchart TD
    A["📥 MaxQuant output · ProteinGroups.txt / .xlsx"] --> B

    subgraph QC ["1 · QC & Preprocessing"]
        B["Load & standardise columns · Remove contaminants"]
        B --> C["Define experiment design · conditions · replicates · contrasts"]
        C --> D["Filter missing values · fraction_NA threshold per condition"]
        D --> E["VSN normalisation"]
        E --> F["Mixed imputation · MNAR → zero/MinProb/QRILC · MAR → kNN"]
    end

    subgraph DE ["2 · Differential Expression"]
        F --> G["limma · empirical Bayes · ~0 + condition · manual contrasts"]
        G --> H["Log2FC · p-value · BH-adjusted p · UP / DOWN / NO per comparison"]
    end

    subgraph VIZ ["3 · Visualisation"]
        H --> I["Volcano plots · Heatmaps · PCA · UpSet"]
    end

    subgraph ENRICH ["4 · Functional Enrichment"]
        H --> J["ORA — enrichGO · GSEA — gseGO · gseKEGG · pathview"]
        H --> K["STRING PPI networks · PANTHER · EnrichR"]
    end

    subgraph SUMM ["5 · Summary"]
        I & J & K --> L["Statistics tables · DE counts · effect sizes"]
    end

    style QC fill:#1e3a5f,color:#fff,stroke:#1a7a7a
    style DE fill:#1e3a1e,color:#fff,stroke:#22c55e
    style VIZ fill:#3a1e1e,color:#fff,stroke:#ef4444
    style ENRICH fill:#3a2a1e,color:#fff,stroke:#f59e0b
    style SUMM fill:#2a1e3a,color:#fff,stroke:#8b5cf6

Key Technical Details

Differential expression via DEP::analyze_dep() wrapping limma with flexible manual contrasts
Configurable imputation: fraction_NA, factor_SD_impute, and MNAR method selection
Automated directory structure and sequential figure numbering for reproducible outputs
Dual export (TIFF raster + PDF vector) for all figures
Gene identifier mapping through biomaRt and AnnotationDbi (UNIPROT → ENSEMBL/ENTREZ)

Example Application — CLN3 Lysosomal Interactome

The repository includes a complete analysis of the CLN3 lysosomal interactome in human cell lines (ProteomeXchange PXD031582), comparing CTRL vs WT vs KO conditions across 12 samples and 3 pairwise contrasts.

Calcagni’ et al. Loss of the batten disease protein CLN3 leads to mis-trafficking of M6PR and defective autophagic-lysosomal reformation. Nat Commun 14, 3911 (2023). doi:10.1038/s41467-023-39643-7

Output Gallery

Quality Control & Normalisation

QC overview — protein identification and coverage

SD before vs after imputation — 6 methods compared

Imputation distribution — Intensity distribution — imputation method comparison

Dimensionality Reduction & Differential Expression

PCA mixed imputation — PCA — mixed imputation

Volcano KO vs WT — Volcano plot — KO vs WT

Clustering & Functional Enrichment

Heatmap significant proteins — Heatmap — significant proteins across all comparisons

GO lolliplot — KO vs WT upregulated terms