Skip to content

PyCampES/MedNERDS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MedNERDS

MedNERds (Medical Named Entity Recognition for Data Structuring) is a clinical natural language processing (NLP) project designed to extract structured medical information from unstructured clinical text. The goal is to convert free-text clinical narratives into structured, analyzable data. The system uses a pre-trained NER model based on the BioBERT transformer to identify relevant entities (e.g., medical history, symptoms, medications), and augments these predictions with post-processing steps such as negation detection.

Features

  • The Streamlit app allows the user to input a medical note in free-text format, along with a patient identifier.
  • By clicking the “Analyze” button, the text is passed to a BioBERT-based NER model (d4data/biomedical-ner-all), which detects entities such as patient age, sex, medical history, symptoms, and medications. The NegSpaCy library is also used to identify whether detected entities are negated.
  • The entities identified by the model can be visualized in the app as highlighted spans within the input text.
  • The output of the NER model is then converted into a structured table, which is also displayed in the app.
  • The output table is automatically appended to a SQLite database, along with the date and time of the record.
  • In the “Find” tab, the user can search for records of a specific patient by entering the patient identifier.

Streamlit App

How to run the app

Install dependencies with uv:

uv sync

Run Streamlit app (note: models are downloaded on first launch, which may take a few minutes):

uv run streamlit run app.py

Future steps

  • Fine-tune a custom NER model. Adapt a transformer model such as ClinicalBERT to a specific use case by training on a curated corpus of EHR data. This would allow to define custom entity types (e.g., comorbidities, lab values, procedures) and improve performance on domain-specific language compared to general biomedical models.
  • Standardize extracted entities. Map detected entities to controlled vocabularies such as SNOMED CT or UMLS. This enables consistent representation of clinical concepts, facilitates interoperability, and allows downstream analysis (e.g., grouping synonymous terms under the same concept ID).
  • Integrate PoS tagging and relation extraction. Use part-of-speech tagging and relation extraction techniques to capture relationships between entities (e.g., medication–dosage, symptom–anatomical location). This moves the system beyond entity detection toward structured clinical understanding.
  • Add OCR for document ingestion. Incorporate optical character recognition using tools like Tesseract OCR to process scanned PDFs or images. This would allow the pipeline to handle non-digital clinical documents and expand input sources.
  • Add multi-language support.
  • Leverage large language models (LLMs). Explore using LLMs for entity extraction, normalization, or validation. Models such as GPT-4 can complement traditional NER by handling ambiguous cases, improving recall, or assisting in tasks like relation extraction and summarization.

About

Medical Named Entity Recognition for Data Structuring

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 56.2%
  • Jupyter Notebook 43.8%