Skip to content
View rbr7's full-sized avatar
🎯
Focusing
🎯
Focusing
  • Boston, MA

Block or report rbr7

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
rbr7/README.md

Hi, I'm Rohan 👋

Scalable Machine Learning · Foundation models · Explainable AI · Clinical NLP & Text Mining · Healthcare Data Quality

LLMs | OpenClaw | RL | Python | R | Scala | Spark

Senior Data Scientist and applied AI researcher at Massachusetts General Hospital (under MGB Inc.) -> turning messy, multi-modal, cross-sites healthcare & multi-omics data into validated, model-ready assets, then building interpretable ML and LLM systems that domain experts can trust.

LinkedIn Google Scholar GitHub Email


About Me 👋

Hi, thanks for stopping by! I'm a senior data scientist working at the intersection of machine learning, messy healthcare data, and translational AI. My work at Massachusetts General Hospital with building on prior applied ML role at the Broad Institute, NIH via ASRT Inc., and Bristol Myers Squibb -> spans predictive modeling, clinical NLP, explainable AI, developing foundation models, knowledge graphs and large-scale data pipelines that turn fragmented clinical, EHR, biomedical, and multi-omic data into reliable, interpretable insight.

I care about healthcare ML that works outside notebooks: robust data pipelines, auditable model decisions, leakage-safe evaluation, clear explanations, and software that scales from local prototypes to large clinical datasets ->> rigorous, reproducible, equitable across diverse populations, and explainable enough for domain experts to trust.

🔭 Currently focused on: healthcare data-quality systems, entity resolution, clinical NLP, explainable predictive modeling, multimodal foundation models, and scalable ML pipelines.


Current Projects & healthcare AI Work 🚀

  • AI for Alzheimer's : OpenAI Foundation initiative. Data Science Lead (MGH) within the OpenAI Foundation's AI for Alzheimer's initiative, a $100M+ program across six institutions for supporting the Mass General Brigham collaboration with the Institute for Protein Design (University of Washington). I build scalable, production-grade ML pipelines and active-learning loops that reason jointly across patient phenotypes, molecular biomarkers, and high-throughput screens to surface mechanistically interpretable, causally grounded drug targets, not black-box predictions.

  • RxFM ( Microsoft ) : Multimodal Foundation Model for ALS Therapeutic Science. Building RxFM at MGH (home development site) on Microsoft's Discovery agentic-AI platform (Azure) which is a multimodal therapeutic foundation model that integrates multi-omic, single-cell, metabolomic, clinical, and drug-perturbation data into a unified framework for ALS drug discovery and repurposing. By framing the disease's heterogeneous pathophysiology as a network-perturbation problem, the model is designed to identify interventions that reverse complex disease signatures, and to be fine-tuned for zero-shot discovery in diseases with little known biology.

  • DRIAD-FL ( US, UK, Israel ) : Federated Learning across fragmented healthcare data. Contributing data-science and federated-learning methods development for a six-site, cross-country effort (UK, US, and Israel / Clalit) -> building privacy-preserving FL methods for target-trial emulation, scaling across 100M+ healthcare records, and developing reusable pipelines for learning across fragmented, multi-source healthcare data without centralizing it.

    Directly relevant to healthcare data quality at scale: privacy-preserving learning, cross-site harmonization, and reusable pipelines over high-volume, multi-source records.

  • Interpretable Multi-modal Patient Stratification ( Google DeepMind collab ) : Built an interpretable, multimodal deep-learning pipeline that fuses high-dimensional omics, rare pathogenic variants, and longitudinal clinical features to resolve a clinically heterogeneous population into molecularly defined subgroups. A variational autoencoder compresses these noisy signals into a shared latent space; explainability methods (SHAP, latent-space traversal) tie each latent dimension to pathway dysregulation and to outcomes such as functional decline and survival surfacing distinct molecular subtypes (oxidative/proteotoxic stress, TDP-43/transposable-element, and neuroinflammatory). The fused model improved subgroup separation and outcome prediction over single-omic baselines (94.1% AUROC vs. single modality-only), yielding transparent, reproducible strata for clinical decision support and trial enrichment. Built in Python (PyTorch, scikit-learn) as a modular, production-style pipeline over large clinical-omics cohorts (Northeast US, MGB and MA, NYGC, others..), with mentorship and resources from members of the Google DeepMind team (Zurich & London).


What I Build 🧩

I build ML systems for high-noise, high-stakes healthcare data:

Healthcare ML problem What I work on
Messy healthcare records Data profiling, validation, schema-drift detection, missingness handling, normalization, and quality scoring
Entity resolution Deduplication and cross-source linkage for patient / provider / member / drug master data
Clinical NLP & text mining Entity extraction, ontology normalization, summarization, search, and weak-supervision workflows
Predictive modeling Risk, outcome, disease-progression, anomaly, and data-quality models with leakage-safe validation
Explainable AI InterpretML, SHAP / LIME / coefficient-level explanations, interpretable features, audit trails, and model cards
Scalable pipelines Python, R, Scala/Spark, PySpark, SQL, Docker, cloud/HPC workflows, and production-minded ML engineering

Featured Projects 🔬

DrugMesh : Scala/Spark Entity Resolution for Biomedical Master Data

A Spark-scale, explainable data-quality & entity-resolution engine that reconciles drug records across 10+ public biomedical databases.

DrugMesh ingests messy public drug sources, resolves which records refer to the same real-world entity despite missing identifiers and naming variation, flags inconsistent/bad records with confidence scores, and emits an auditable reason for every decision.

Why it matters: this is the same master-data problem as cleaning provider directories, payer rosters, formulary and member records, and clinical reference tables.

  • Scala + Apache Spark pipelines (typed Dataset DAG, broadcast joins, Adaptive Query Execution) for distributed entity resolution
  • Probabilistic matching (blocking → string-similarity features → a Fellegi-Sunter matcher whose per-field log-odds are the explanation) plus an MLlib GBT trained on Snorkel weak-supervision labels
  • Distributed anomaly detection (Isolation Forest over data-quality signals), biomedical NER (Spark NLP), BioBERT embeddings, and Elasticsearch search
  • Explainable match scoring with per-field evidence + SHAP/LIME, with MUnit/ScalaCheck tests and a GitHub Actions CI matrix (Spark 3.5 & 4.0)
  • A Scala/Spark re-implementation and extension of an open-source drug-mapping toolkit (attributed in the repo).

Scala · Apache Spark · Entity Resolution · Data Quality · Biomedical NLP · Elasticsearch · Explainability


ClinClaw : Healthcare Data Quality + Explainable Predictive Modeling

A medical-AI harness for healthcare data quality, clinical NLP, and interpretable ML workflows.

ClinClaw wraps ML/LLM components in a reusable execution harness (validation checkpoints, structured context, result validation, rollback, failure recovery). Its data-quality toolkit is built for records that are incomplete, inconsistent, duplicated, or text-heavy.

  • Record profiling across completeness, validity (NPI Luhn, ICD-10, ICD-10-CM, OMOP, ZIP, date, state), uniqueness, and consistency, with record-level quality scoring and remediation recommendations
  • Multivariate anomaly detection (IQR + scikit-learn Isolation Forest)
  • Clinical entity extraction with synonym→canonical ontology normalization (spaCy/scispaCy)
  • Interpretable from-scratch L2-regularized logistic regression (coefficients = feature importances; reports AUC/F1/precision/recall)
  • A/B testing utilities and a PySpark data-quality pipeline (load → standardize → dedup → validity/anomaly flags → write) for record-scale claims data, with a runnable offline demo

Python · PySpark · scikit-learn · Clinical NLP · Anomaly Detection · A/B Testing · Explainable ML


MedClawMini : Healthcare Data-Science Skill Library

A curated library of clinical-AI and healthcare data-science workflows spanning data quality, NLP, scalable ML, explainability, drug safety, and governance — codified as auditable, reusable skill modules.

Highlighted workflows: healthcare data-quality profiling · patient/provider/member entity resolution · claims anomaly detection · ICD/CPT/SNOMED/RxNorm/LOINC/UMLS mapping · clinical NER & summarization · ELK/OpenSearch clinical search · Snorkel weak supervision · Spark/Scala healthcare ETL · explainable ML & subgroup audits · A/B testing & model validation.

Healthcare Data Quality · Clinical NLP · Snorkel · Spark/Scala · ELK/OpenSearch · XAI · Model Governance


MedSift : Local Healthcare ML Workbench

A privacy-first healthcare-ML workbench that structures health records, then runs data-science workflows over them: clinical NER and concept tagging, data-quality normalization, statistical + Isolation-Forest anomaly detection, SHAP-style explanations, weak-supervision labeling, BM25 + dense hybrid search, and experiment evaluation (calibration, A/B, significance tests).

Python · Clinical NLP · Data Quality · Anomaly Detection · SHAP · Weak Supervision · Hybrid Search


More on GitHub

Project What it does Focus
PB-Dataset-Recommender_Engine Biomedical-NLP recommender that searches & ranks datasets across NCBI, SRA & EBI Bio-NLP · search/ranking
Human Protein Atlas : Kaggle 2021 Weakly-supervised, multi-label classification of single-cell protein localization weak supervision · multi-label DL
scRNASeq Workflow + Cell Predictions Single-cell RNA-seq analysis on the Geneformer transformer foundation model transformers · embeddings
scMultiOmics Integration Tool Reproducible pipeline integrating single-cell ATAC-seq + RNA-seq multi-omics · pipeline engineering
corral Terminal dashboard that corrals tmux sessions & Slurm jobs across SSH/compute nodes software engineering · MLOps/HPC
RxFM Foundation model · MLOps/HPC
CLIO Clinical AI model · Clinical AI/HPC/Healthcare
PRISM HGT model · LLMs/HPC

Experience 💼

  • Massachusetts General Hospital : Senior Data Scientist · Boston, MA
  • Broad Institute : Data Scientist II · Cambridge, MA
  • NIMHD (NIH) via ASRT Inc. : Bioinformatics Scientist · Atlanta, GA
  • Bristol Myers Squibb : Applied ML Intern (Applied Bioinformatics) · on-site
  • Georgia Tech : Computational Genomics Lab -> Graduate Research Assistant · Atlanta, GA
  • GATK Team, Broad : Open-Source Developer · remote
  • Quadrical.ai : Software Developer & Data Scientist · Gurugram, India
  • Elucidata : Data Science & Analytics · New Delhi, India

Technical Toolbox 🧰

Languages

Python R Scala SQL C++ Bash

Machine Learning & Deep Learning

PyTorch TensorFlow scikit-learn XGBoost Hugging Face

NLP & Text Mining

spaCy Snorkel Elasticsearch

Explainable AI

SHAP Captum LIME

Big Data, Pipelines & MLOps

Apache Spark Hadoop Docker Kubernetes AWS GCP Nextflow Git

Data Systems: PostgreSQL · MongoDB · BigQuery · ETL · schema design · data validation Biomedical Data & Cohorts: EHR-linked cohorts · Kaiser Permanente, RPDR, CPRD, Clalit · UK Biobank · All of Us · MGB Biobank · CELLxGENE · GTEx · gnomAD · ClinVar · TCGA


Service & Community 🤝

  • Reviewer, Machine Learning for Health (ML4H) and Conference on Health, Inference, and Learning (CHIL)
  • Poster presentations at ML4H and CHIL on [CLIO - Continuous-time clinical LLM-Informed Ordinal imputation, 2025; PRISM - Phenotype-Resolved Inference for Spectrum-wide Matching, 2026]

Selected Publications 📄

A few representative papers with full list on Google Scholar.


How I Work 🧭

I build systems where the model is only one part of the product. Good healthcare ML needs clean, validated, documented data · explicit assumptions and leakage checks · reproducible experiments · interpretable outputs for clinical and business stakeholders · failure handling and monitoring · domain-expert feedback loops · and production-quality code a team can maintain. That's the kind of healthcare AI I build.


Open to conversations on healthcare data quality, explainable ML, clinical NLP, scalable ML systems, and foundation models for human health.

Pinned Loading

  1. ClinClaw ClinClaw Public

    Medical AI agent orchestration framework built on Harness Theory with pluggable LLMs + MCP tools for clinical diagnosis, drug discovery, and health management.

    Python

  2. DrugMesh DrugMesh Public

    Explainable, Spark-scale data-quality engine that reconciles drug records across 10+ public databases for entity resolution, anomaly detection & biomedical NER in Scala.

    Scala

  3. ANNEALER ANNEALER Public

    Continual, training-free prompt optimization for clinical LLMs which re-anneals system prompts with textual gradients as new medical evidence lands, so generalist models stay current without fine-t…

    Python

  4. PB-Dataset-Recommender_Engine PB-Dataset-Recommender_Engine Public

    Algorithms for NCBI, SRA, EBI datasets recommendation and how to get around with comparing your own Datasets. Recommender systems | Bio-NLP

    Jupyter Notebook 2 1

  5. MedClawMini MedClawMini Public

    A focused, production-minded library of 197 clinical-AI and healthcare data-science skills for the OpenClaw agent platform featuring data quality, clinical NLP, big-data ML, explainable AI, drug sa…

    Jupyter Notebook

  6. MedSift MedSift Public

    Privacy-first health AI assistant for individuals and families, with a local ML workbench for clinical text mining, interpretable risk modeling, and anomaly detection.

    Python