Skip to content

ProQuest: Protein Querying Using Semantic Technology for developing the UniProtKB LLM Query Interface

Notifications You must be signed in to change notification settings

HUBioDataLab/PROQUEST

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProQuest: Protein Querying Using Semantic Technology for developing the UniProtKB LLM Query Interface

Abstract

Efficiently accessing and analyzing comprehensive biological datasets remains challenging due to traditional querying complexities. To address this, we developed an intuitive, scalable query interface using advanced large language models. Our system enables users, regardless of computational expertise, to formulate natural-language queries that automatically translate into precise Solr database searches, significantly simplifying interaction with UniProtKB. Additionally, we implemented a semantic vector search for rapid protein similarity analyses, using protein embeddings generated by ProtT5 protein language model within an optimized approximate nearest-neighbor search framework (Annoy). This method significantly outperforms conventional BLAST searches, offering a speed increase of up to tenfold on GPU hardware. Functional insights are further enriched through integrated Gene Ontology analyses, providing biologically meaningful context to similarity searches. Currently, we are expanding the system using Retrieval-Augmented Generation, integrating real-time annotations from UniProt flat files to enhance contextual relevance and accuracy of generated responses. Evaluations using diverse biological queries demonstrated the robustness of our interface, highlighting its ability to mitigate intrinsic variability in LLM outputs through controlled prompt engineering and query retry mechanisms. Overall, our novel project substantially streamlines the retrieval process, facilitating quicker, more accurate exploration of protein functions, evolutionary relationships, and annotations.

Application Setup and Run Instructions

Follow these steps to set up and run the application on Windows and macOS systems.

Table of Contents

  1. Prerequisites
  2. Requirements
  3. Installation
  4. Running the Program
  5. Obtaining Free API Keys

Prerequisites

  • Python 3.x installed on your system.
  • Streamlit library installed.

Ensure you have the necessary permissions to run scripts and install packages on your machine.


Requirements

Before installing and running the application, ensure you have the following databases:

  • asset tables in protein_index.db: Generated by executing ./Uniprot-LLM/config/setUpDatabase.py.
  • protein_embeddings.ann and protein_index.db: Generated by executing ./Uniprot-LLM/config/implementVectorDatabase.py.

NOTE: protein_embeddings.ann also can be downloaded from: https://drive.google.com/drive/folders/1aJTXtk0QiqiCiH2a-CK0mnund_4JwlGh


Installation

1. Create the Asset Database

Open your terminal or command prompt and execute the following commands:

Windows:

python .\config\setupdatabase.py

macOS:

python ./config/setupdatabase.py

2. Create the Working Environment

Install the required packages by running:

Windows:

python .\config\installpackages.py

macOS:

python ./config/installpackages.py

3. (Optional) Delete Previously Installed Libraries

If you encounter issues and need to uninstall all previously installed libraries, use the following commands:

Windows (PowerShell):

pip freeze | %{$_.split('==')[0]} | %{pip uninstall -y $_}

macOS:

pip freeze | awk -F'==' '{print $1}' | xargs pip uninstall -y

Linux:

1. Filter and Save Required Libraries to a File: Generate a list of specific libraries you wish to uninstall and save them into a requirements.txt file using the command:

pip freeze | grep -E 'streamlit|requests|langchain_openai|langchain_google_genai|langchain_anthropic|langchain_nvidia_ai_endpoints|langchain|scikit-learn|langchain_mistralai|openpyxl|matplotlib|annoy|h5py' > requirements.txt

2.Inspect the File: Review the list of libraries to be uninstalled to ensure accuracy:

cat requirements.txt

Check for Local Packages: If there are any lines containing @ file:///, indicating a locally installed package, you will need to manually find the correct version to replace the path. Use the following command to find the version of a problematic module (replace <module_name> with the actual name of the module):

pip show <module_name>

After identifying the correct version, use a text editor like vim to manually replace the @ file:/// path in requirements.txt with the correct version number.

vim requirements.txt

3.Uninstall the Libraries: Uninstall the libraries listed in the requirements.txt file using:

cat requirements.txt | xargs -n1 pip uninstall -y

Running the Program

Start the application by executing:

Windows, macOS and Linux:

python -m streamlit run main.py

Obtaining Free API Keys

To utilize all features of the application, obtain free API keys from the following providers:

About

ProQuest: Protein Querying Using Semantic Technology for developing the UniProtKB LLM Query Interface

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%