Technology · 2 min read

Managing a 1,500-Book Digital Archive: Lessons from the Field

How we built systems to manage, process, and publish from a personal archive of 1,537 books totalling 44+ million words.

The Scale of the Problem

Atharva Inamdar's book archive contains:

  • 1,537 books written between 2007 and 2026
  • 44+ million words of content
  • 19 years of continuous writing output
  • 13 genres from literary fiction to quantum spirituality

Managing this archive is not a creative problem. It is an engineering problem.

The Systems We Built

1. Manuscript Processor

A Node.js script that ingests raw book files (Markdown, DOCX) and outputs structured JSON with:

  • Chapter segmentation (splitting long documents into chapters)
  • Word count calculation at book and chapter level
  • Metadata extraction (title, genre, themes, settings)
  • Quality classification (Hero, Support, Archive tiers)

2. Quality Audit System

Automated quality scoring based on:

  • Book completeness (is it a finished work?)
  • Prose quality indicators (sentence variety, dialogue ratio)
  • Genre classification accuracy
  • Content warning detection

Output: A quality report for each book with actionable editorial notes.

3. Duplicate Detection

With 1,537 books spanning 19 years, duplication is inevitable — revised versions, renamed titles, partial rewrites. Our detection system uses:

  • Title similarity matching (Levenshtein distance)
  • Opening paragraph fingerprinting
  • Word count clustering (books within 10% of each other)

4. Publishing Pipeline

From book to published book:

  1. Raw book → Markdown conversion
  2. Chapter segmentation → Individual chapter files
  3. Metadata extraction → JSON catalog entry
  4. Quality audit → Editorial classification
  5. ISBN assignment → Catalog integration
  6. Page generation → Individual reader pages
  7. Export generation → EPUB, PDF, BibTeX

This pipeline runs in under 30 seconds for the entire 68-book published catalog.

5. Editorial Content Generator

The archive doesn't just produce books — it produces editorial content:

  • Daily Pages: One passage per day from the archive (204+ pre-generated)
  • First Lines: Opening sentence of every book (68 entries)
  • Revision Theater: Draft vs. published comparisons (15 examples)
  • Emotional Map: 19-year timeline of writing output by year
  • Reading Guides: Curated pathways through the archive

All generated programmatically from the book data. No manual content creation.

Lessons Learned

  1. Treat content as data: Books are not just creative works. They are data that can be processed, analyzed, and transformed.
  2. Build for scale: Systems designed for 68 books should work for 680. And 6,800.
  3. Automate editorial: If content can be derived from existing data, it should be generated, not written.
  4. Version everything: Every book, every script, every configuration file lives in Git.
— BogaDoga Engineering

BogaDoga Ltd

Publishing & Digital Innovation, London

← Back to Blog