ProQuest: Protein Querying Using Semantic Technology for developing the UniProtKB LLM Query Interface
Efficiently accessing and analyzing comprehensive biological datasets remains challenging due to traditional querying complexities. To address this, we developed an intuitive, scalable query interface using advanced large language models. Our system enables users, regardless of computational expertise, to formulate natural-language queries that automatically translate into precise Solr database searches, significantly simplifying interaction with UniProtKB. Additionally, we implemented a semantic vector search for rapid protein similarity analyses, using protein embeddings generated by ProtT5 protein language model within an optimized approximate nearest-neighbor search framework (Annoy). This method significantly outperforms conventional BLAST searches, offering a speed increase of up to tenfold on GPU hardware. Functional insights are further enriched through integrated Gene Ontology analyses, providing biologically meaningful context to similarity searches. Currently, we are expanding the system using Retrieval-Augmented Generation, integrating real-time annotations from UniProt flat files to enhance contextual relevance and accuracy of generated responses. Evaluations using diverse biological queries demonstrated the robustness of our interface, highlighting its ability to mitigate intrinsic variability in LLM outputs through controlled prompt engineering and query retry mechanisms. Overall, our novel project substantially streamlines the retrieval process, facilitating quicker, more accurate exploration of protein functions, evolutionary relationships, and annotations.
Follow these steps to set up and run the application on Windows and macOS systems.
- Python 3.x installed on your system.
- Streamlit library installed.
Ensure you have the necessary permissions to run scripts and install packages on your machine.
Before installing and running the application, ensure you have the following databases:
- asset tables in protein_index.db: Generated by executing
./Uniprot-LLM/config/setUpDatabase.py
. - protein_embeddings.ann and protein_index.db: Generated by executing
./Uniprot-LLM/config/implementVectorDatabase.py
.
NOTE: protein_embeddings.ann also can be downloaded from: https://drive.google.com/drive/folders/1aJTXtk0QiqiCiH2a-CK0mnund_4JwlGh
Open your terminal or command prompt and execute the following commands:
Windows:
python .\config\setupdatabase.py
macOS:
python ./config/setupdatabase.py
Install the required packages by running:
Windows:
python .\config\installpackages.py
macOS:
python ./config/installpackages.py
If you encounter issues and need to uninstall all previously installed libraries, use the following commands:
Windows (PowerShell):
pip freeze | %{$_.split('==')[0]} | %{pip uninstall -y $_}
macOS:
pip freeze | awk -F'==' '{print $1}' | xargs pip uninstall -y
Linux:
1. Filter and Save Required Libraries to a File:
Generate a list of specific libraries you wish to uninstall and save them into a requirements.txt
file using the command:
pip freeze | grep -E 'streamlit|requests|langchain_openai|langchain_google_genai|langchain_anthropic|langchain_nvidia_ai_endpoints|langchain|scikit-learn|langchain_mistralai|openpyxl|matplotlib|annoy|h5py' > requirements.txt
2.Inspect the File: Review the list of libraries to be uninstalled to ensure accuracy:
cat requirements.txt
Check for Local Packages: If there are any lines containing @ file:///, indicating a locally installed package, you will need to manually find the correct version to replace the path. Use the following command to find the version of a problematic module (replace <module_name> with the actual name of the module):
pip show <module_name>
After identifying the correct version, use a text editor like vim to manually replace the @ file:/// path in requirements.txt with the correct version number.
vim requirements.txt
3.Uninstall the Libraries: Uninstall the libraries listed in the requirements.txt file using:
cat requirements.txt | xargs -n1 pip uninstall -y
Start the application by executing:
Windows, macOS and Linux:
python -m streamlit run main.py
To utilize all features of the application, obtain free API keys from the following providers:
-
Google Generative AI Models
Get your API key at:
-
Chat NVIDIA Models
Get your API key at:
https://build.nvidia.com/ibm/granite-3_0-8b-instruct?snippet_tab=LangChain
-
Chat Mistral AI Models
Get your API key at: