This repository contains an OCR application used to store temporary images, read receipts and write record propositions in database as a part of Be Part Of the Event application.
The purpose of this project is to build an OCR microservice.
- S3 database integration,
- basic OCR via open source Python's libraries,
- modular use to future implementation of AI-based OCR models,
- MongoDB cluster for data persistence,
- accessible only via a gateway connection.
- Python 3.14+ with UV package manager
- Docker Desktop / Docker + Compose
- just task runner (optional, but recommended)
- Clone the repository:
git clone https://github.com/Cybernetic-Ransomware/bpoe-ocr.git
- Set .env file based on the template.
- Run using Docker:
docker-compose -f .\docker\docker-compose.yml up --build -d
- Clone the repository:
git clone https://github.com/Cybernetic-Ransomware/bpoe-ocr.git
- Set .env file based on the template.
- Create a directory:
/temp/minio_data - Provide access to a MiniO/S3 instance with a bucket and writer/reader users that match the .env.template file.
- writer should have both polices: readwrite and writeonly
- Provide access to a Mongodb instance that match the .env.template file.
- Install UV:
pip install uv
- Install dependencies:
uv sync
- Install pre-commit hooks:
uv run pre-commit install uv run pre-commit autoupdate uv run pre-commit run --all-files
- Run the application locally:
uv run uvicorn src.main:app --host 0.0.0.0 --port 8080 --reload
- The repository will include a Postman collection with ready-to-import webhook mockers
Common tasks are available via the just task runner:
just test # unit tests
just test-integration # integration tests (requires Docker)
just lint # ruff + ty + codespell + bandit
just format # ruff format
just up / just down # Docker stackuv sync
uv run pytestuv sync
uv run ruff check src/
uv run ty check src/docker run -p 9000:9000 -p 9001:9001 \
quay.io/minio/minio server /data --console-address ":9001"- mounted by default on WSL, e.g.
docker-desktop->/var/lib/docker/volumes/minio_minio_data/_data
To connect to the MongoDB cluster with MongoDB Compass:
- Open MongoDB Compass
- Use the connection string, by default:
mongodb://localhost:27017/ - Click "Connect"
To verify if sharding is enabled for a collection:
- Open the MongoDB Shell in Compass and check the sharding status:
sh.status() - Look for information about a sharded collection, for example:
sh.shardCollection("ocr.ocr_images", { _id: 1 }) - If the collections section is empty, the collection is not sharded yet:
"ocr": { primary: 'rs-shard02', collections: {} }
- To enable sharding, run the following commands:
sh.enableSharding("ocr") sh.shardCollection("ocr.ocr_images", { _id: 1 })
- Orphaned Mongo records after S3 delete failure — when
process_ocrsucceeds but the subsequent S3 cleanup fails, the Mongo record is retained and the file remains in the bucket. A scheduled cleanup job (or a TTL index on the collection) should identify and remove records whose corresponding S3 objects no longer exist, or vice versa. - Readiness probe —
GET /healthzis a liveness check only (process is alive). A/readyzendpoint performing lightweight Mongo + S3 pings is needed for orchestrators to distinguish a live-but-not-ready container from a healthy one. - MongoDB authentication — the MongoDB cluster runs without
--auth, giving any service onmongonetworkfull R/W access. Requires enabling auth inmongo-init.sh(keyfile between replica members,admin/clusterAdmin/writer users), updating allmongod/mongoscommands, and injecting credentials viaMONGO_WRITER_URI/MONGO_ADMIN_URIin.env. Also:MONGODB_URIin theenvironment:block ofdocker-compose.ymlappears unused — the application readsMONGO_WRITER_URIandMONGO_ADMIN_URIfromenv_file. - Inter-service authorization — endpoints are currently accessible to any caller that can reach the service. Requests from the API gateway should be authenticated (e.g. shared secret header, mTLS, or a service token) to prevent unauthorized access to OCR and storage operations.
- Boto3 examples: Amazon doc
- MiniIO docker image: DockerHub
- AsyncBoto3 for further refactorings: PyPi
- Pytesseract configs: pyimagesearch
- Mongo Compass winget command winget