ProQuest: Protein Querying Using Semantic Technology for developing the UniProtKB LLM Query Interface

Abstract

Efficiently accessing and analyzing comprehensive biological datasets remains challenging due to traditional querying complexities. To address this, we developed an intuitive, scalable query interface using advanced large language models. Our system enables users, regardless of computational expertise, to formulate natural-language queries that automatically translate into precise Solr database searches, significantly simplifying interaction with UniProtKB. Additionally, we implemented a semantic vector search for rapid protein similarity analyses, using protein embeddings generated by ProtT5 protein language model within an optimized approximate nearest-neighbor search framework (Annoy). This method significantly outperforms conventional BLAST searches, offering a speed increase of up to tenfold on GPU hardware. Functional insights are further enriched through integrated Gene Ontology analyses, providing biologically meaningful context to similarity searches. Currently, we are expanding the system using Retrieval-Augmented Generation, integrating real-time annotations from UniProt flat files to enhance contextual relevance and accuracy of generated responses. Evaluations using diverse biological queries demonstrated the robustness of our interface, highlighting its ability to mitigate intrinsic variability in LLM outputs through controlled prompt engineering and query retry mechanisms. Overall, our novel project substantially streamlines the retrieval process, facilitating quicker, more accurate exploration of protein functions, evolutionary relationships, and annotations.

Application Setup and Run Instructions

Follow these steps to set up and run the application on Windows and macOS systems.

Prerequisites

Python 3.x installed on your system.
Streamlit library installed.

Ensure you have the necessary permissions to run scripts and install packages on your machine.

Requirements

Before installing and running the application, ensure you have the following databases:

asset tables in protein_index.db: Generated by executing ./Uniprot-LLM/config/setUpDatabase.py.
protein_embeddings.ann and protein_index.db: Generated by executing ./Uniprot-LLM/config/implementVectorDatabase.py.

NOTE: protein_embeddings.ann also can be downloaded from: https://drive.google.com/drive/folders/1aJTXtk0QiqiCiH2a-CK0mnund_4JwlGh

Installation

1. Create the Asset Database

Open your terminal or command prompt and execute the following commands:

Windows:

python .\config\setupdatabase.py

macOS:

python ./config/setupdatabase.py

2. Create the Working Environment

Install the required packages by running:

Windows:

python .\config\installpackages.py

macOS:

python ./config/installpackages.py

3. (Optional) Delete Previously Installed Libraries

If you encounter issues and need to uninstall all previously installed libraries, use the following commands:

Windows (PowerShell):

pip freeze | %{$_.split('==')[0]} | %{pip uninstall -y $_}

macOS:

pip freeze | awk -F'==' '{print $1}' | xargs pip uninstall -y

Linux:

1. Filter and Save Required Libraries to a File: Generate a list of specific libraries you wish to uninstall and save them into a requirements.txt file using the command:

pip freeze | grep -E 'streamlit|requests|langchain_openai|langchain_google_genai|langchain_anthropic|langchain_nvidia_ai_endpoints|langchain|scikit-learn|langchain_mistralai|openpyxl|matplotlib|annoy|h5py' > requirements.txt

2.Inspect the File: Review the list of libraries to be uninstalled to ensure accuracy:

cat requirements.txt

Check for Local Packages: If there are any lines containing @ file:///, indicating a locally installed package, you will need to manually find the correct version to replace the path. Use the following command to find the version of a problematic module (replace <module_name> with the actual name of the module):

pip show <module_name>

After identifying the correct version, use a text editor like vim to manually replace the @ file:/// path in requirements.txt with the correct version number.

vim requirements.txt

3.Uninstall the Libraries: Uninstall the libraries listed in the requirements.txt file using:

cat requirements.txt | xargs -n1 pip uninstall -y

Running the Program

Start the application by executing:

Windows, macOS and Linux:

python -m streamlit run main.py

Obtaining Free API Keys

To utilize all features of the application, obtain free API keys from the following providers:

Google Generative AI Models

Get your API key at:

https://aistudio.google.com/prompts/new_chat
Chat NVIDIA Models

Get your API key at:

https://build.nvidia.com/ibm/granite-3_0-8b-instruct?snippet_tab=LangChain
Chat Mistral AI Models

Get your API key at:

https://console.mistral.ai/api-keys/

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
asset		asset
config		config
src		src
test		test
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProQuest: Protein Querying Using Semantic Technology for developing the UniProtKB LLM Query Interface

Abstract

Application Setup and Run Instructions

Table of Contents

Prerequisites

Requirements

Installation

1. Create the Asset Database

2. Create the Working Environment

3. (Optional) Delete Previously Installed Libraries

Running the Program

Obtaining Free API Keys

About

Releases

Packages

Languages

HUBioDataLab/PROQUEST

Folders and files

Latest commit

History

Repository files navigation

ProQuest: Protein Querying Using Semantic Technology for developing the UniProtKB LLM Query Interface

Abstract

Application Setup and Run Instructions

Table of Contents

Prerequisites

Requirements

Installation

1. Create the Asset Database

2. Create the Working Environment

3. (Optional) Delete Previously Installed Libraries

Running the Program

Obtaining Free API Keys

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages