feat(sparql-anything): add SparqlAnythingConverter for chunked non-RDF to RDF conversion#511
Open
ddeboer wants to merge 2 commits into
Open
feat(sparql-anything): add SparqlAnythingConverter for chunked non-RDF to RDF conversion#511ddeboer wants to merge 2 commits into
ddeboer wants to merge 2 commits into
Conversation
…F to RDF conversion
- Run the SPARQL Anything CLI once per input chunk (via an @lde/task-runner)
to bound memory use, then stream-concatenate the per-chunk N-Triples into
one file.
- Substitute each chunk's path into the query's `{SOURCE}` placeholder through
a temporary `-q` file, avoiding shell-escaping a large inline SPARQL query.
- Abort the whole conversion when any chunk's process exits non-zero, so a
crashed chunk can never be silently dropped from the output.
- Scaffold the @lde/sparql-anything package (0.1.0) and list it in the root
README packages table and architecture diagram.
80488b3 to
9e59373
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the
@lde/sparql-anythingpackage with aSparqlAnythingConverter— the first LDE piece of the geonames-rdf migration. It portsgeonames-rdf'smap.shconvert loop to TypeScript on top of an@lde/task-runner. Convert-only scope (no selector, transform, validation, or chaining).What it does
new SparqlAnythingConverter({ queryFile, jarPath, adminCodesFile, taskRunner }).convert(chunkPaths, outputPath):fx:ondiskplus per-process isolation), then stream-concatenates the per-chunk N-Triples into a single output file. Streaming keeps multi-gigabyte outputs out of memory; N-Triples has no prefixes or document structure, so plain concatenation is always valid.{SOURCE}placeholder through a temporary-qfile, rather thanmap.sh's inline--query "$(sed …)". A temp file sidesteps shell-escaping a large SPARQL query passed to ashell: truerunner.wait()rejects), so a crashed chunk can never be silently dropped from the output — the file-output equivalent ofmap.sh'sset -e.It is not a pipeline
Executor: it produces a file, not quads, so it sits standalone rather than insidePipeline → Stage → Executor. Chunking stays in the caller (geonamesdownload.sh); the converter consumes pre-split chunks.Design notes
TaskRunnerrather than spawning processes itself, so it runs unchanged on the host (NativeTaskRunner), in Docker, or anywhere else.src/sparql-anything-converter.ts(re-exported fromindex.ts) so V8 coverage measures it — the base config excludes**/index.ts.Tests
Unit tests drive the converter through a stubbed
TaskRunner(vertical TDD slices): the per-chunk CLI contract (-qtemp file,--load,--format NT,--output),{SOURCE}substitution, multi-chunk concatenation order, and the fail-fast abort contract. 100% coverage.The real-jar integration test (actual SPARQL Anything jar over a tiny CSV fixture) and the geonames-rdf consumption bridge (how the shell/Docker repo invokes a TS converter) are deferred follow-ups.