This repo ships two packages that together give BioFSharp users an INSDC (International Nucleotide Sequence Database Collaboration) read/write surface:
BioFSharp.FileFormats.INSDC— a C# library whose types are auto-generated from the official ENA XSD schemas viadotnet-xscgen(theXmlSchemaClassGeneratorCLI). It is C# rather than F# because there is no F# equivalent ofXmlSchemaClassGenerator.BioFSharp.IO.INSDC— an F# library that wraps the generated C# type model with idiomaticread/readString/write/writeStringfunctions per INSDC entity.
The naming mirrors the base BioFSharp namespace layout so both packages can be picked up directly as dependencies of BioFSharp. Both target netstandard2.0 for parity with the rest of the BioFSharp ecosystem.
The repo was scaffolded from the BioFSharp.XYZ template and still carries placeholder content that must be cleared before the real work lands:
README.md— replace the template content with an INSDC-specific README.RELEASE_NOTES.md— seed an initial entry, e.g.0.0.0-preview.1.- Both
src/**/*.fsprojfiles containBioFSharp.XYZinPackageProjectUrl,RepositoryUrl,FsDocsLicenseLink, andFsDocsReleaseNotesLink— retarget every URL toBioFSharp.INSDC. Fill inAuthors,Description,Summary. - Delete the template
Library.fsin bothsrc/projects. tests/BioFSharp.INSDC.Tests/Tests.fsstill referencesBioFSharp.XYZ.BioTalk— drop the test, replace with INSDC tests (section 5).docs/index.fsxis left as-is (no docs site for this repo). Confirmdocs/img/icon.pngexists sopackdoes not fail.
- Rename
src/BioFSharp.FileFormats.INSDC/BioFSharp.FileFormats.INSDC.fsproj→.csproj. - Switch SDK config to a standard C# project:
<Project Sdk="Microsoft.NET.Sdk">, drop<Compile Include="Library.fs" />. - Keep
<TargetFramework>netstandard2.0</TargetFramework>,<GenerateDocumentationFile>true</GenerateDocumentationFile>, the symbol package settings, and the icon<None Include="..\..\docs\img\icon.png" .../>. - Update
BioFSharp.INSDC.slnxto reference the new.csprojpath.
The generator is the dotnet-xscgen package. Install it locally so the version is pinned and reproducible:
dotnet new tool-manifest # one-time, creates .config/dotnet-tools.json
dotnet tool install dotnet-xscgen --version 3.0.1270Commit .config/dotnet-tools.json. Contributors run dotnet tool restore once after clone. Invoke the tool as dotnet xscgen ... from the repo root.
Download the XSDs from https://ftp.ebi.ac.uk/pub/databases/ena/doc/xsd/sra_1_5/ (latest available — sra_1_6 does not exist). Commit them under src/BioFSharp.FileFormats.INSDC/schemas/ so generation is reproducible offline.
All schemas present in sra_1_5/ are in scope for v1. At minimum:
SRA.project.xsd(BioProject)SRA.study.xsdSRA.sample.xsd(Sample / BioSample)SRA.experiment.xsdSRA.run.xsdSRA.analysis.xsdSRA.submission.xsdSRA.receipt.xsdSRA.common.xsd(shared include)
Pull the whole directory rather than cherry-picking, so any siblings (e.g. EGA-specific schemas) are also covered.
Do not wire generation into every build — the generated .cs files are committed and compile on their own. Instead add a single FAKE target regenerateInsdcTypes (in build/BasicTasks.fs, or a new build/GeneratorTasks.fs) that:
- wipes
src/BioFSharp.FileFormats.INSDC/Generated/ - invokes
dotnet xscgenagainst every.xsdinsrc/BioFSharp.FileFormats.INSDC/schemas/ - outputs into
src/BioFSharp.FileFormats.INSDC/Generated/ - maps every xsd → namespace
BioFSharp.FileFormats.INSDC - documents the chosen generator flags in a comment on the target (e.g.
--integer-type=System.Int64, nullable-element handling)
Contributors run build.cmd regenerateInsdcTypes (or build.sh regenerateInsdcTypes) only when schemas change. buildSolution does not depend on this target.
Replace the template Library.fs with one F# source file per INSDC entity:
BioProject.fsStudy.fsSample.fsExperiment.fsRun.fsAnalysis.fsSubmission.fsReceipt.fsInternal/XmlSerializer.fs— shared helper wrappingSystem.Xml.Serialization.XmlSerializerfor genericread/readString/write/writeString.
Namespace: BioFSharp.IO.INSDC. INSDC files are XML, so there is no readLines (XML is not line-based). Every entity module exposes the same four functions:
module BioProject =
/// Read INSDC project XML records from disk.
val read : filePath: string -> seq<ProjectType>
/// Parse INSDC project XML records from an in-memory string.
val readString : xml: string -> seq<ProjectType>
/// Write an INSDC project to disk as XML.
val write : filePath: string -> project: ProjectType -> unit
/// Serialize an INSDC project to an XML string.
val writeString : project: ProjectType -> stringProjectType is the C# type emitted by the generator, re-exported via a type alias so consumers do not need to open BioFSharp.FileFormats.INSDC. Same shape for set-backed entities (BioProject, Study, Sample, Experiment, Run, Analysis, Submission): read and readString return seq<_> because ENA responses commonly use *_SET roots. Receipt remains single-record because there is no generated ReceiptSet type.
- All public members carry
///XML doc comments. BioFSharp.IO.INSDC.fsprojadds a<ProjectReference>toBioFSharp.FileFormats.INSDC.csproj.- Project metadata (
Authors,Description, repo URLs) retargeted toBioFSharp.INSDC.
State: fixture-based coverage is complete for every IO module. The old smoke test was replaced with deep object-graph roundtrip tests against committed ENA fixtures.
One test module per IO module:
- Done in
Tests.fs:BioProject,Study,BioSample,Experiment,Run,Analysis,Submission,Receipt. - Future cleanup: split the current
Tests.fsinto one test module/file per IO module if desired.
Each module covers three cases:
- Read — load a committed ENA XML fixture and assert a handful of key field values.
- Roundtrip —
read >> write >> readproduces a structurally equal value. Pick one comparison strategy (deep object-graph equality is simpler than XML canonicalisation) and document it inline. read/readStringparity — parsing the same XML via file vs. string produces equal results.
Keep xunit (already wired). Drop the existing BioTalk test.
Real ENA records, downloaded once and committed:
- Current path:
tests/fixtures/<accession>.xml - Source URL pattern:
https://www.ebi.ac.uk/ena/browser/api/xml/<ACCESSION>(download by hand; do not fetch at test time) - Committed accessions:
PRJDB5192,DRP003416,SAMD00064197,DRX066772,DRR072834,ERZ496533,DRA005154. receipt-sample.xmlis hand-crafted becauseRECEIPTis a submission-API response, not a stored record. Shape mirrors the ENA programmatic submission guide example.- Source URLs and download date are recorded in
tests/fixtures/README.md.
Tests load fixtures off disk via a small relative-path helper. No network at test time.
- Confirm
build/ProjectInfo.fssolutionFileresolves (alreadyBioFSharp.INSDC.slnx). - Verify
build.cmd/build.shentry points still work after the.fsproj→.csprojswap in the slnx. - Add the
regenerateInsdcTypestarget. It must not be chained into the default build — generated code is committed precisely so contributors can build without the tool restored. - Audit
.github/workflows/build-and-test.ymlfor template residue: SDK pin and the Codecov slug.
The regenerateInsdcTypes target lives in build/BasicTasks.fs. It is standalone (no dependencies on clean / buildSolution) and is not referenced by any release pipeline.
CI uses global-json-file: global.json so the workflow tracks the SDK version pinned at the repo root (currently 10.0.100). The Codecov slug is retargeted to BioFSharp/BioFSharp.INSDC.
Each step gates the next:
-
dotnet tool restoresucceeds (verified 2026-05-23 in devcontainer; restoresdotnet-xscgen3.0.1270 andfsdocs-tool20.0.1). -
build.sh regenerateInsdcTypesproduces.csfiles undersrc/BioFSharp.FileFormats.INSDC/Generated/(verified 2026-05-23; only diff vs. committed output is the absolute-path noise in the generator-header comment). -
build.sh(defaultbuildSolution) succeeds with zeroCS1591(missing-XML-doc) warnings (verified 2026-05-23;0 Error(s)). -
bash build.sh runtestspasses locally in the devcontainer (24/24tests, re-confirmed 2026-05-23). -
build.sh packproduces both nupkgs with non-template metadata (verified 2026-05-23 —pkg/BioFSharp.FileFormats.INSDC.0.1.0.nupkgandpkg/BioFSharp.IO.INSDC.0.1.0.nupkg). ThePacktarget uses an interactiveY/nconfirmation prompt; pipeecho Y |when invoking non-interactively.
- No fsdocs site for this repo; usage examples live in the base
BioFSharpdocs.
- ALWAYS add XML documentation to all public types and functions in both projects.
- ALWAYS run the FAKE build targets (
build.cmd/build.sh) for solution-wide work — they are the source of truth for build/test/pack configuration. Use rawdotnet build/dotnet testonly when iterating on a single project.
- Do not hand-edit anything under
src/BioFSharp.FileFormats.INSDC/Generated/— it is regenerated from XSDs. - Do not chain
regenerateInsdcTypesinto the default build. - Do not add an fsdocs site to this repo.