One-time GPU training pipeline that produces a frozen LoRA adapter and deploys that same artifact everywhere: backtests, paper trading, live bot, and the research API. No retraining unless explicitly triggered.
+-----------------------------+
| Data Ingestion + FeatureStore|
| (OHLCV, fundamentals, news, |
| transcripts, macro, social) |
+------------------+------------+
|
v
+------------------------+
| Training Corpus Builder|
| - Tabular (parquet) |
| - Text (jsonl prompts) |
+-----------+------------+
|
+---------------+---------------+
| |
v v
+----------------------+ +----------------------+
| Tabular Model (CPU) | | LoRA LLM (GPU) |
| XGBoost + Optuna | | L40S or A100 |
+-----------+----------+ +-----------+----------+
| |
+---------------+---------------+
v
+----------------------+
| Ensemble Scoring |
+----------+-----------+
|
+---------------------+---------------------+
| | |
v v v
Backtester (CPU) Paper Bot (CPU) FastAPI (CPU)
- Train ONCE on GPU, save a LoRA adapter artifact.
- All downstream uses load the same frozen adapter and tabular model.
- No GPU usage after training. Inference uses CPU or NVIDIA NIM.
- Backtests are the source of truth. No alpha claims without metrics.
- OHLCV:
yfinance(daily + limited intraday), optional Polygon/Alpha Vantage. - Fundamentals:
yfinancemetadata + SEC Financial Statement Data Sets (FSDS). - FSDS: bulk quarterly 10-Q/10-K numeric data from SEC (10GB+ when spanning 2010-2024).
- News: NewsAPI (free tier), Yahoo RSS.
- Transcripts: local JSON or SEC 8-K ingestion (stubbed).
- Macro: FRED API +
yfinance(VIX, DXY, gold, crude). - Sentiment: placeholders for Reddit/Twitter/short interest.
- Default GPU: NVIDIA L40S (lower cost, strong throughput).
- Optional: A100-40GB if you want extra headroom.
- Training budget target: 4 hours.
src/
data/
training/
model/
backtest/
bot/
api/
monitoring/
configs/
artifacts/
data/
reports/
.github/workflows/
- Create repo secrets:
MODAL_TOKEN_ID,MODAL_TOKEN_SECRETNVIDIA_NIM_API_KEYSEC_USER_AGENT(required for SEC downloads)- Optional:
NEWSAPI_KEY,FRED_API_KEY,WANDB_API_KEY
- Trigger training workflow:
- GitHub Actions → Train LoRA (Modal GPU).
2.5. The training workflow uploads
artifacts.tar.gz. Extract it intoartifacts/before CPU runs.
- GitHub Actions → Train LoRA (Modal GPU).
2.5. The training workflow uploads
- Run CPU backtest:
- GitHub Actions → Backtest CPU.
The default configs already target these tickers for 2023-2024 test period.
python -m src.cli build-data --config configs/data.yaml
python -m src.cli build-corpus --config configs/data.yaml
python -m src.training.train_lora --config configs/training.yaml
python -m src.backtest.engine --config configs/backtest.yaml
In GitHub Actions, run Backtest CPU after training completes.
- Free APIs have rate limits and partial history for intraday data.
- Full multimodal corpus at scale needs substantial CPU memory for preprocessing.
- Transcript/news ingestion is stubbed unless you provide sources.
- NIM inference requires a valid API key and model access.
- GPU:
src/training/train_lora.pyvia Modal (L40S or A100). - CPU: feature engineering, corpus build, backtests, bot, API.
artifacts/tabular_model.ubjartifacts/lora_adapter/artifacts/training_metadata.jsonreports/backtest/*
TradingView proprietary data is not publicly accessible. The pipeline includes TradingView-equivalent indicators and can ingest TradingView CSV exports if you provide TRADINGVIEW_CSV_PATH or TRADINGVIEW_CSV_URL as a secret/environment variable.
Use the Build Large Corpus (Modal CPU) workflow to generate a large dataset in the Modal volume. It outputs a small corpus_summary.json artifact with row counts and sizes.
If you want Lightning.ai to survive interruptions without trying to chain free interactive sessions forever, use the included Lightning run workflow:
- Configure lightning_run.yaml
- Add GitHub secrets:
LIGHTNING_USERNAMELIGHTNING_API_KEY
- Launch Launch Lightning Auto-Resume Run
- Let Lightning Progress Snapshot archive status and checkpoint manifests every 4 hours
Details: lightning_autoresume.md
If you want to avoid Modal for CPU, use the Build Corpus Chunk (GitHub CPU) workflow. It writes each chunk to external S3-compatible storage (OCI Object Storage works) and is limited by GitHub’s 6-hour runner cap, so keep chunk_size small.
If you prefer OCI CPU, use the Launch OCI CPU VM (Time-Boxed) workflow. It launches a VM with a strict auto-shutdown window and provides an instance id artifact. Terminate any VM with Terminate OCI VM.
Required GitHub secrets:
OCI_TENANCY_OCIDOCI_USER_OCIDOCI_FINGERPRINTOCI_REGIONOCI_PRIVATE_KEYOCI_ADOCI_COMPARTMENT_OCIDOCI_SUBNET_OCIDOCI_IMAGE_OCID
If you might switch Modal accounts mid-run, enable external checkpointing to S3-compatible storage. Each chunk uploads to a bucket so a new Modal account can continue and the merge step can download chunks.
Add these GitHub Secrets (optional):
CHECKPOINT_S3_BUCKETCHECKPOINT_S3_ACCESS_KEYCHECKPOINT_S3_SECRET_KEYCHECKPOINT_S3_REGION(defaultus-east-1if omitted)CHECKPOINT_S3_ENDPOINT(for R2/MinIO)CHECKPOINT_S3_PREFIX(defaulttrain-once)CHECKPOINT_S3_USE_PATH_STYLE(truefor MinIO)
See docs/pine_sources.md for open-source Pine Script indicator references and licenses.