Scalable Machine Learning · Foundation models · Explainable AI · Clinical NLP & Text Mining · Healthcare Data Quality
LLMs | OpenClaw | RL | Python | R | Scala | Spark
Senior Data Scientist and applied AI researcher at Massachusetts General Hospital (under MGB Inc.) -> turning messy, multi-modal, cross-sites healthcare & multi-omics data into validated, model-ready assets, then building interpretable ML and LLM systems that domain experts can trust.
Hi, thanks for stopping by! I'm a senior data scientist working at the intersection of machine learning, messy healthcare data, and translational AI. My work at Massachusetts General Hospital with building on prior applied ML role at the Broad Institute, NIH via ASRT Inc., and Bristol Myers Squibb -> spans predictive modeling, clinical NLP, explainable AI, developing foundation models, knowledge graphs and large-scale data pipelines that turn fragmented clinical, EHR, biomedical, and multi-omic data into reliable, interpretable insight.
I care about healthcare ML that works outside notebooks: robust data pipelines, auditable model decisions, leakage-safe evaluation, clear explanations, and software that scales from local prototypes to large clinical datasets ->> rigorous, reproducible, equitable across diverse populations, and explainable enough for domain experts to trust.
🔭 Currently focused on: healthcare data-quality systems, entity resolution, clinical NLP, explainable predictive modeling, multimodal foundation models, and scalable ML pipelines.
-
AI for Alzheimer's : OpenAI Foundation initiative. Data Science Lead (MGH) within the OpenAI Foundation's AI for Alzheimer's initiative, a $100M+ program across six institutions for supporting the Mass General Brigham collaboration with the Institute for Protein Design (University of Washington). I build scalable, production-grade ML pipelines and active-learning loops that reason jointly across patient phenotypes, molecular biomarkers, and high-throughput screens to surface mechanistically interpretable, causally grounded drug targets, not black-box predictions.
-
RxFM ( Microsoft ) : Multimodal Foundation Model for ALS Therapeutic Science. Building RxFM at MGH (home development site) on Microsoft's Discovery agentic-AI platform (Azure) which is a multimodal therapeutic foundation model that integrates multi-omic, single-cell, metabolomic, clinical, and drug-perturbation data into a unified framework for ALS drug discovery and repurposing. By framing the disease's heterogeneous pathophysiology as a network-perturbation problem, the model is designed to identify interventions that reverse complex disease signatures, and to be fine-tuned for zero-shot discovery in diseases with little known biology.
-
DRIAD-FL ( US, UK, Israel ) : Federated Learning across fragmented healthcare data. Contributing data-science and federated-learning methods development for a six-site, cross-country effort (UK, US, and Israel / Clalit) -> building privacy-preserving FL methods for target-trial emulation, scaling across 100M+ healthcare records, and developing reusable pipelines for learning across fragmented, multi-source healthcare data without centralizing it.
Directly relevant to healthcare data quality at scale: privacy-preserving learning, cross-site harmonization, and reusable pipelines over high-volume, multi-source records.
-
Interpretable Multi-modal Patient Stratification ( Google DeepMind collab ) : Built an interpretable, multimodal deep-learning pipeline that fuses high-dimensional omics, rare pathogenic variants, and longitudinal clinical features to resolve a clinically heterogeneous population into molecularly defined subgroups. A variational autoencoder compresses these noisy signals into a shared latent space; explainability methods (SHAP, latent-space traversal) tie each latent dimension to pathway dysregulation and to outcomes such as functional decline and survival surfacing distinct molecular subtypes (oxidative/proteotoxic stress, TDP-43/transposable-element, and neuroinflammatory). The fused model improved subgroup separation and outcome prediction over single-omic baselines (94.1% AUROC vs. single modality-only), yielding transparent, reproducible strata for clinical decision support and trial enrichment. Built in Python (PyTorch, scikit-learn) as a modular, production-style pipeline over large clinical-omics cohorts (Northeast US, MGB and MA, NYGC, others..), with mentorship and resources from members of the Google DeepMind team (Zurich & London).
I build ML systems for high-noise, high-stakes healthcare data:
| Healthcare ML problem | What I work on |
|---|---|
| Messy healthcare records | Data profiling, validation, schema-drift detection, missingness handling, normalization, and quality scoring |
| Entity resolution | Deduplication and cross-source linkage for patient / provider / member / drug master data |
| Clinical NLP & text mining | Entity extraction, ontology normalization, summarization, search, and weak-supervision workflows |
| Predictive modeling | Risk, outcome, disease-progression, anomaly, and data-quality models with leakage-safe validation |
| Explainable AI | InterpretML, SHAP / LIME / coefficient-level explanations, interpretable features, audit trails, and model cards |
| Scalable pipelines | Python, R, Scala/Spark, PySpark, SQL, Docker, cloud/HPC workflows, and production-minded ML engineering |
DrugMesh : Scala/Spark Entity Resolution for Biomedical Master Data
A Spark-scale, explainable data-quality & entity-resolution engine that reconciles drug records across 10+ public biomedical databases.
DrugMesh ingests messy public drug sources, resolves which records refer to the same real-world entity despite missing identifiers and naming variation, flags inconsistent/bad records with confidence scores, and emits an auditable reason for every decision.
Why it matters: this is the same master-data problem as cleaning provider directories, payer rosters, formulary and member records, and clinical reference tables.
- Scala + Apache Spark pipelines (typed
DatasetDAG, broadcast joins, Adaptive Query Execution) for distributed entity resolution - Probabilistic matching (blocking → string-similarity features → a Fellegi-Sunter matcher whose per-field log-odds are the explanation) plus an MLlib GBT trained on Snorkel weak-supervision labels
- Distributed anomaly detection (Isolation Forest over data-quality signals), biomedical NER (Spark NLP), BioBERT embeddings, and Elasticsearch search
- Explainable match scoring with per-field evidence + SHAP/LIME, with MUnit/ScalaCheck tests and a GitHub Actions CI matrix (Spark 3.5 & 4.0)
- A Scala/Spark re-implementation and extension of an open-source drug-mapping toolkit (attributed in the repo).
Scala · Apache Spark · Entity Resolution · Data Quality · Biomedical NLP · Elasticsearch · Explainability
ClinClaw : Healthcare Data Quality + Explainable Predictive Modeling
A medical-AI harness for healthcare data quality, clinical NLP, and interpretable ML workflows.
ClinClaw wraps ML/LLM components in a reusable execution harness (validation checkpoints, structured context, result validation, rollback, failure recovery). Its data-quality toolkit is built for records that are incomplete, inconsistent, duplicated, or text-heavy.
- Record profiling across completeness, validity (NPI Luhn, ICD-10, ICD-10-CM, OMOP, ZIP, date, state), uniqueness, and consistency, with record-level quality scoring and remediation recommendations
- Multivariate anomaly detection (IQR + scikit-learn Isolation Forest)
- Clinical entity extraction with synonym→canonical ontology normalization (spaCy/scispaCy)
- Interpretable from-scratch L2-regularized logistic regression (coefficients = feature importances; reports AUC/F1/precision/recall)
- A/B testing utilities and a PySpark data-quality pipeline (
load → standardize → dedup → validity/anomaly flags → write) for record-scale claims data, with a runnable offline demo
Python · PySpark · scikit-learn · Clinical NLP · Anomaly Detection · A/B Testing · Explainable ML
MedClawMini : Healthcare Data-Science Skill Library
A curated library of clinical-AI and healthcare data-science workflows spanning data quality, NLP, scalable ML, explainability, drug safety, and governance — codified as auditable, reusable skill modules.
Highlighted workflows: healthcare data-quality profiling · patient/provider/member entity resolution · claims anomaly detection · ICD/CPT/SNOMED/RxNorm/LOINC/UMLS mapping · clinical NER & summarization · ELK/OpenSearch clinical search · Snorkel weak supervision · Spark/Scala healthcare ETL · explainable ML & subgroup audits · A/B testing & model validation.
Healthcare Data Quality · Clinical NLP · Snorkel · Spark/Scala · ELK/OpenSearch · XAI · Model Governance
MedSift : Local Healthcare ML Workbench
A privacy-first healthcare-ML workbench that structures health records, then runs data-science workflows over them: clinical NER and concept tagging, data-quality normalization, statistical + Isolation-Forest anomaly detection, SHAP-style explanations, weak-supervision labeling, BM25 + dense hybrid search, and experiment evaluation (calibration, A/B, significance tests).
Python · Clinical NLP · Data Quality · Anomaly Detection · SHAP · Weak Supervision · Hybrid Search
| Project | What it does | Focus |
|---|---|---|
| PB-Dataset-Recommender_Engine | Biomedical-NLP recommender that searches & ranks datasets across NCBI, SRA & EBI | Bio-NLP · search/ranking |
| Human Protein Atlas : Kaggle 2021 | Weakly-supervised, multi-label classification of single-cell protein localization | weak supervision · multi-label DL |
| scRNASeq Workflow + Cell Predictions | Single-cell RNA-seq analysis on the Geneformer transformer foundation model | transformers · embeddings |
| scMultiOmics Integration Tool | Reproducible pipeline integrating single-cell ATAC-seq + RNA-seq | multi-omics · pipeline engineering |
| corral | Terminal dashboard that corrals tmux sessions & Slurm jobs across SSH/compute nodes | software engineering · MLOps/HPC |
| RxFM | Foundation model · MLOps/HPC | |
| CLIO | Clinical AI model · Clinical AI/HPC/Healthcare | |
| PRISM | HGT model · LLMs/HPC |
- Massachusetts General Hospital : Senior Data Scientist · Boston, MA
- Broad Institute : Data Scientist II · Cambridge, MA
- NIMHD (NIH) via ASRT Inc. : Bioinformatics Scientist · Atlanta, GA
- Bristol Myers Squibb : Applied ML Intern (Applied Bioinformatics) · on-site
- Georgia Tech : Computational Genomics Lab -> Graduate Research Assistant · Atlanta, GA
- GATK Team, Broad : Open-Source Developer · remote
- Quadrical.ai : Software Developer & Data Scientist · Gurugram, India
- Elucidata : Data Science & Analytics · New Delhi, India
Languages
Machine Learning & Deep Learning
NLP & Text Mining
Explainable AI
Big Data, Pipelines & MLOps
Data Systems: PostgreSQL · MongoDB · BigQuery · ETL · schema design · data validation Biomedical Data & Cohorts: EHR-linked cohorts · Kaiser Permanente, RPDR, CPRD, Clalit · UK Biobank · All of Us · MGB Biobank · CELLxGENE · GTEx · gnomAD · ClinVar · TCGA
- Reviewer, Machine Learning for Health (ML4H) and Conference on Health, Inference, and Learning (CHIL)
- Poster presentations at ML4H and CHIL on [CLIO - Continuous-time clinical LLM-Informed Ordinal imputation, 2025; PRISM - Phenotype-Resolved Inference for Spectrum-wide Matching, 2026]
A few representative papers with full list on Google Scholar.
I build systems where the model is only one part of the product. Good healthcare ML needs clean, validated, documented data · explicit assumptions and leakage checks · reproducible experiments · interpretable outputs for clinical and business stakeholders · failure handling and monitoring · domain-expert feedback loops · and production-quality code a team can maintain. That's the kind of healthcare AI I build.
Open to conversations on healthcare data quality, explainable ML, clinical NLP, scalable ML systems, and foundation models for human health.

