Skip to content

Latest commit

 

History

History
749 lines (566 loc) · 17.2 KB

File metadata and controls

749 lines (566 loc) · 17.2 KB

Code Guidelines

This document outlines coding standards, patterns, and best practices for contributing to Transcript Create.

Table of Contents

General Principles

Code Quality

  • Clarity over cleverness: Write code that's easy to understand
  • Consistency: Follow existing patterns in the codebase
  • DRY (Don't Repeat Yourself): Extract common logic into reusable functions
  • YAGNI (You Aren't Gonna Need It): Don't add features or abstractions prematurely
  • Test your code: Write tests that verify your changes work correctly

Performance

  • Optimize when necessary: Profile before optimizing
  • Async when beneficial: Use async/await for I/O-bound operations
  • Database efficiency: Use appropriate indexes, batch operations, avoid N+1 queries
  • Minimize dependencies: Only add dependencies that provide significant value

Maintainability

  • Keep functions small: Each function should do one thing well
  • Limit nesting: Max 3 levels of indentation
  • Meaningful variable names: Prefer user_email over ue
  • Comment when needed: Explain "why", not "what"

Python Style Guide

PEP 8 Compliance

We follow PEP 8 with the following specifics:

# Line length: 120 characters (configured in pyproject.toml)
# Indentation: 4 spaces (never tabs)
# String quotes: Double quotes " preferred (configurable)

Type Hints

Always use type hints for function signatures:

# Good
def get_video_by_id(video_id: int) -> Optional[Video]:
    """Fetch a video by its ID."""
    return db.query(Video).filter(Video.id == video_id).first()

# Bad - missing type hints
def get_video_by_id(video_id):
    return db.query(Video).filter(Video.id == video_id).first()

Use modern type syntax (Python 3.10+):

from typing import Optional, Union, Any

# Good - Python 3.10+ syntax
def process_video(video_id: int | None = None) -> dict[str, Any]:
    pass

# Also acceptable - backwards compatible
def process_video(video_id: Optional[int] = None) -> dict:
    pass

Imports

Order and style:

# 1. Standard library imports
import os
import sys
from datetime import datetime
from pathlib import Path

# 2. Third-party imports
import sqlalchemy
from fastapi import APIRouter, Depends, HTTPException
from pydantic import BaseModel

# 3. Local application imports
from app.crud import get_video
from app.db import get_db
from app.settings import settings

# Tools: isort automatically organizes imports

Function Definitions

Docstrings for public functions:

def transcribe_chunk(
    audio_path: Path,
    model,
    start_offset: float = 0.0
) -> list[dict]:
    """
    Transcribe an audio chunk using Whisper.

    Args:
        audio_path: Path to the audio file
        model: Loaded Whisper model instance
        start_offset: Time offset in seconds to add to timestamps

    Returns:
        List of segment dictionaries with text, start, end, and confidence

    Raises:
        RuntimeError: If transcription fails
        FileNotFoundError: If audio file doesn't exist
    """
    # Implementation...

Keep functions focused:

# Good - single responsibility
def download_audio(video_url: str) -> Path:
    """Download audio from YouTube."""
    # Download logic only
    pass

def transcode_audio(input_path: Path, output_path: Path) -> None:
    """Convert audio to 16kHz mono WAV."""
    # Transcoding logic only
    pass

# Bad - doing too much
def download_and_process_audio(video_url: str) -> Path:
    # Download, transcode, chunk, everything...
    pass

Error Handling

Use specific exceptions:

# Good - specific exceptions
def get_video_or_404(video_id: int) -> Video:
    video = db.query(Video).filter(Video.id == video_id).first()
    if not video:
        raise HTTPException(status_code=404, detail=f"Video {video_id} not found")
    return video

# Bad - generic exceptions
def get_video(video_id: int) -> Video:
    video = db.query(Video).filter(Video.id == video_id).first()
    if not video:
        raise Exception("Not found")
    return video

Context managers for resource cleanup:

# Good
with open(audio_path, "rb") as f:
    data = f.read()

with engine.begin() as conn:
    result = conn.execute(text("SELECT * FROM videos"))

# Bad
f = open(audio_path, "rb")
data = f.read()
f.close()  # May not execute if exception occurs

Database Access

Use parameterized queries:

# Good - parameterized
query = text("SELECT * FROM videos WHERE youtube_id = :youtube_id")
result = conn.execute(query, {"youtube_id": youtube_id})

# Bad - SQL injection risk
query = f"SELECT * FROM videos WHERE youtube_id = '{youtube_id}'"
result = conn.execute(text(query))

Use transactions:

# Good - explicit transaction
with engine.begin() as conn:
    conn.execute(text("UPDATE videos SET state = 'processing' WHERE id = :id"), {"id": video_id})
    conn.execute(text("INSERT INTO transcripts ..."))
    # Automatically commits on success, rolls back on exception

# Bad - autocommit mode with multiple writes
conn.execute(text("UPDATE videos ..."))
conn.execute(text("INSERT INTO transcripts ..."))  # May fail, leaving inconsistent state

Configuration

Use settings, not hardcoded values:

# Good - configurable
from app.settings import settings

chunk_seconds = settings.CHUNK_SECONDS
whisper_model = settings.WHISPER_MODEL

# Bad - hardcoded
chunk_seconds = 900
whisper_model = "large-v3"

Logging

Use structured logging:

import logging

logger = logging.getLogger(__name__)

# Good - structured with context
logger.info("Starting transcription", extra={
    "video_id": video_id,
    "model": model_name,
    "duration_seconds": duration
})

# Also good - f-string for readability
logger.info(f"Transcribing video {video_id} with {model_name}")

# Bad - concatenation
logger.info("Transcribing video " + str(video_id) + " with " + model_name)

TypeScript / React Style Guide

TypeScript Basics

Always use TypeScript, avoid any:

// Good - explicit types
interface Video {
  id: number;
  title: string;
  duration: number;
}

function displayVideo(video: Video): void {
  console.log(video.title);
}

// Bad - any
function displayVideo(video: any) {
  console.log(video.title);
}

Use interfaces for objects:

// Good
interface SearchParams {
  query: string;
  source: 'native' | 'youtube';
  page?: number;
}

// Also acceptable for simple types
type SearchSource = 'native' | 'youtube';

React Components

Functional components with hooks:

// Good - functional component
import React, { useState, useEffect } from 'react';

interface VideoPlayerProps {
  videoId: number;
  autoplay?: boolean;
}

export const VideoPlayer: React.FC<VideoPlayerProps> = ({ videoId, autoplay = false }) => {
  const [isPlaying, setIsPlaying] = useState(autoplay);
  
  useEffect(() => {
    // Effect logic
  }, [videoId]);
  
  return (
    <div className="video-player">
      {/* JSX */}
    </div>
  );
};

// Bad - class component (unless needed)
class VideoPlayer extends React.Component {
  // Avoid for new components
}

Props destructuring:

// Good
export const SearchBar: React.FC<SearchBarProps> = ({ 
  query, 
  onSearch, 
  placeholder = 'Search transcripts...' 
}) => {
  // Use props directly
  return <input value={query} onChange={onSearch} placeholder={placeholder} />;
};

// Bad
export const SearchBar: React.FC<SearchBarProps> = (props) => {
  return <input value={props.query} onChange={props.onSearch} placeholder={props.placeholder} />;
};

State Management

Use appropriate hooks:

// Good - useState for simple state
const [searchQuery, setSearchQuery] = useState('');

// Good - useReducer for complex state
const [state, dispatch] = useReducer(searchReducer, initialState);

// Good - useContext for shared state
const { user } = useContext(UserContext);

API Calls

Use async/await with proper error handling:

// Good
const fetchVideo = async (videoId: number): Promise<Video> => {
  try {
    const response = await axios.get<Video>(`/api/videos/${videoId}`);
    return response.data;
  } catch (error) {
    if (axios.isAxiosError(error)) {
      throw new Error(`Failed to fetch video: ${error.message}`);
    }
    throw error;
  }
};

// Bad
const fetchVideo = (videoId: number) => {
  return axios.get(`/api/videos/${videoId}`)
    .then(res => res.data)
    .catch(err => console.error(err));  // Swallows error
};

Styling

Use Tailwind utility classes:

// Good - Tailwind utilities
<div className="flex items-center space-x-4 p-6 bg-white rounded-lg shadow-md">
  <h2 className="text-xl font-semibold text-gray-800">{title}</h2>
</div>

// Avoid inline styles (unless dynamic)
<div style={{ display: 'flex', padding: '24px' }}>
  <h2 style={{ fontSize: '20px' }}>{title}</h2>
</div>

Naming Conventions

Python

# Variables and functions: snake_case
video_id = 123
def process_video(video_id: int) -> None:
    pass

# Classes: PascalCase
class VideoProcessor:
    pass

# Constants: UPPER_SNAKE_CASE
MAX_CHUNK_SECONDS = 900
DEFAULT_MODEL = "large-v3"

# Private methods: _leading_underscore
def _internal_helper():
    pass

# File names: snake_case.py
# video_processor.py, audio_utils.py

TypeScript / React

// Variables and functions: camelCase
const videoId = 123;
function processVideo(videoId: number): void {
  // ...
}

// React components: PascalCase
const VideoPlayer = () => { /* ... */ };
export const SearchBar: React.FC<Props> = () => { /* ... */ };

// Interfaces and Types: PascalCase
interface Video {
  // ...
}
type SearchSource = 'native' | 'youtube';

// Constants: UPPER_SNAKE_CASE
const MAX_RESULTS_PER_PAGE = 50;
const API_BASE_URL = 'http://localhost:8000';

// File names: PascalCase for components, camelCase for utilities
// VideoPlayer.tsx, SearchBar.tsx
// apiClient.ts, searchUtils.ts

Database

-- Tables: snake_case, plural
CREATE TABLE videos (
  id SERIAL PRIMARY KEY,
  youtube_id TEXT NOT NULL
);

-- Columns: snake_case
CREATE TABLE transcripts (
  video_id INTEGER REFERENCES videos(id),
  created_at TIMESTAMPTZ DEFAULT now()
);

-- Indexes: idx_table_column
CREATE INDEX idx_videos_youtube_id ON videos(youtube_id);

-- Enums: snake_case, singular
CREATE TYPE job_state AS ENUM ('pending', 'processing', 'completed', 'failed');

Error Handling

Python

Be specific about exceptions:

# Good
try:
    audio_path = download_audio(url)
except HTTPError as e:
    logger.error(f"Failed to download audio: {e}")
    raise
except IOError as e:
    logger.error(f"Failed to save audio: {e}")
    raise

# Bad
try:
    audio_path = download_audio(url)
except Exception as e:
    pass  # Swallows all errors

Provide context in exceptions:

# Good
if not audio_path.exists():
    raise FileNotFoundError(f"Audio file not found: {audio_path}")

# Bad
if not audio_path.exists():
    raise FileNotFoundError()

TypeScript

Handle promise rejections:

// Good
async function fetchTranscript(videoId: number): Promise<Transcript> {
  try {
    const response = await api.get(`/videos/${videoId}/transcript`);
    return response.data;
  } catch (error) {
    logger.error('Failed to fetch transcript', { videoId, error });
    throw new Error(`Failed to fetch transcript for video ${videoId}`);
  }
}

// Bad
async function fetchTranscript(videoId: number) {
  const response = await api.get(`/videos/${videoId}/transcript`);
  return response.data;  // Unhandled rejection
}

Testing Requirements

Test Coverage

  • Aim for 70%+ coverage for backend code
  • Test critical paths thoroughly
  • Test edge cases and error conditions

Backend Tests (pytest)

# tests/test_video_processing.py

def test_download_audio_success(tmp_path):
    """Test successful audio download."""
    url = "https://youtube.com/watch?v=test"
    audio_path = download_audio(url, output_dir=tmp_path)
    
    assert audio_path.exists()
    assert audio_path.suffix == ".m4a"

def test_download_audio_invalid_url():
    """Test audio download with invalid URL."""
    with pytest.raises(ValueError, match="Invalid YouTube URL"):
        download_audio("not-a-url")

def test_transcode_audio_creates_wav(tmp_path):
    """Test audio transcoding to WAV format."""
    input_path = tmp_path / "input.m4a"
    output_path = tmp_path / "output.wav"
    
    # Setup: create test audio file
    create_test_audio(input_path)
    
    transcode_audio(input_path, output_path)
    
    assert output_path.exists()
    # Verify WAV format properties
    assert get_audio_format(output_path) == "wav"
    assert get_sample_rate(output_path) == 16000

Frontend Tests (Vitest)

// tests/SearchBar.test.tsx

import { render, screen, fireEvent } from '@testing-library/react';
import { SearchBar } from './SearchBar';

describe('SearchBar', () => {
  it('renders with placeholder', () => {
    render(<SearchBar query="" onSearch={() => {}} />);
    expect(screen.getByPlaceholderText('Search transcripts...')).toBeInTheDocument();
  });

  it('calls onSearch when input changes', () => {
    const onSearch = vi.fn();
    render(<SearchBar query="" onSearch={onSearch} />);
    
    const input = screen.getByRole('textbox');
    fireEvent.change(input, { target: { value: 'test query' } });
    
    expect(onSearch).toHaveBeenCalledWith('test query');
  });

  it('displays current query value', () => {
    render(<SearchBar query="existing query" onSearch={() => {}} />);
    expect(screen.getByDisplayValue('existing query')).toBeInTheDocument();
  });
});

Documentation Expectations

Code Comments

When to comment:

  • Complex algorithms or business logic
  • Non-obvious design decisions
  • Workarounds for bugs or limitations
  • Public API functions

When NOT to comment:

  • Obvious code (i++ # increment i)
  • Variable declarations that are self-explanatory
  • Restating what the code does
# Good - explains why
def calculate_chunk_size(duration: float) -> int:
    """
    Calculate optimal chunk size for audio processing.
    
    We use 15-minute chunks to balance memory usage with processing
    efficiency. Smaller chunks increase overhead, larger chunks risk
    OOM errors with large models.
    """
    return min(duration, settings.CHUNK_SECONDS)

# Bad - restates the obvious
def add_numbers(a: int, b: int) -> int:
    """Add two numbers together."""
    # Add a and b
    result = a + b  # Store result
    return result  # Return the result

README and Documentation

  • Keep README up to date with setup instructions
  • Document API endpoints in docstrings (OpenAPI auto-generates from these)
  • Add migration guides for breaking changes
  • Update CHANGELOG.md for user-facing changes

Security Guidelines

Input Validation

# Good - validate and sanitize
from pydantic import BaseModel, validator

class JobCreate(BaseModel):
    url: str
    kind: Literal["single", "channel"]
    
    @validator("url")
    def validate_youtube_url(cls, v):
        if not v.startswith("https://youtube.com") and not v.startswith("https://www.youtube.com"):
            raise ValueError("Must be a valid YouTube URL")
        return v

SQL Injection Prevention

# Good - parameterized queries
query = text("SELECT * FROM videos WHERE youtube_id = :youtube_id")
result = conn.execute(query, {"youtube_id": youtube_id})

# Bad - string concatenation
query = f"SELECT * FROM videos WHERE youtube_id = '{youtube_id}'"

Secrets Management

# Good - environment variables
from app.settings import settings
api_key = settings.STRIPE_API_KEY

# Bad - hardcoded
api_key = "sk_live_abc123"  # Never commit secrets!

Authentication

# Good - require authentication
from app.auth import get_current_user

@router.get("/admin/users")
async def list_users(user: User = Depends(get_current_user)):
    if not user.is_admin:
        raise HTTPException(status_code=403, detail="Admin access required")
    return get_all_users()

Tools and Automation

Linting and Formatting

Run before committing:

# Python
ruff check app/ worker/ scripts/
black app/ worker/ scripts/
isort app/ worker/ scripts/
mypy app/ worker/

# TypeScript
cd frontend
npm run lint
npm run format
npx tsc --noEmit

Pre-commit Hooks

Install once:

pre-commit install

Runs automatically on git commit:

  • Linting (ruff)
  • Formatting (black, prettier)
  • Import sorting (isort)
  • Secret detection (gitleaks)
  • Trailing whitespace removal

Questions?

Happy coding! 🚀