Managing a 1,500-Book Digital Archive: Lessons from the Field
How we built systems to manage, process, and publish from a personal archive of 1,537 books totalling 44+ million words.
The Scale of the Problem
Atharva Inamdar's book archive contains:
- 1,537 books written between 2007 and 2026
- 44+ million words of content
- 19 years of continuous writing output
- 13 genres from literary fiction to quantum spirituality
Managing this archive is not a creative problem. It is an engineering problem.
The Systems We Built
1. Manuscript Processor
A Node.js script that ingests raw book files (Markdown, DOCX) and outputs structured JSON with:
- Chapter segmentation (splitting long documents into chapters)
- Word count calculation at book and chapter level
- Metadata extraction (title, genre, themes, settings)
- Quality classification (Hero, Support, Archive tiers)
2. Quality Audit System
Automated quality scoring based on:
- Book completeness (is it a finished work?)
- Prose quality indicators (sentence variety, dialogue ratio)
- Genre classification accuracy
- Content warning detection
Output: A quality report for each book with actionable editorial notes.
3. Duplicate Detection
With 1,537 books spanning 19 years, duplication is inevitable — revised versions, renamed titles, partial rewrites. Our detection system uses:
- Title similarity matching (Levenshtein distance)
- Opening paragraph fingerprinting
- Word count clustering (books within 10% of each other)
4. Publishing Pipeline
From book to published book:
- Raw book → Markdown conversion
- Chapter segmentation → Individual chapter files
- Metadata extraction → JSON catalog entry
- Quality audit → Editorial classification
- ISBN assignment → Catalog integration
- Page generation → Individual reader pages
- Export generation → EPUB, PDF, BibTeX
This pipeline runs in under 30 seconds for the entire 68-book published catalog.
5. Editorial Content Generator
The archive doesn't just produce books — it produces editorial content:
- Daily Pages: One passage per day from the archive (204+ pre-generated)
- First Lines: Opening sentence of every book (68 entries)
- Revision Theater: Draft vs. published comparisons (15 examples)
- Emotional Map: 19-year timeline of writing output by year
- Reading Guides: Curated pathways through the archive
All generated programmatically from the book data. No manual content creation.
Lessons Learned
- Treat content as data: Books are not just creative works. They are data that can be processed, analyzed, and transformed.
- Build for scale: Systems designed for 68 books should work for 680. And 6,800.
- Automate editorial: If content can be derived from existing data, it should be generated, not written.
- Version everything: Every book, every script, every configuration file lives in Git.
BogaDoga Ltd
Publishing & Digital Innovation, London