DOCX to HTML and plain text

Extracting readable content from Word files with mammoth — what to expect from the output.

What you can do

Open a Word (.docx) file and download either:

  • HTML — for viewing in a browser or light editing, or
  • Plain text — paragraphs without formatting.

How it works (simple)

  1. The .docx file is read as a ZIP of XML parts (that is how Office Open XML works).
  2. mammoth walks the document XML and maps Word structures to HTML or text.
  3. The result is wrapped in a minimal HTML page (for HTML export) or plain lines (for text export).
  4. You download the output — still entirely on your machine.

What runs in your browser

mammoth.js focuses on semantic content — headings, paragraphs, lists, links — rather than pixel-perfect layout. It is a popular choice for “good enough” Word extraction in the browser.

Tradeoffs and limits

  • Layout fidelity: Complex templates, text boxes, and exact positioning are not preserved.
  • Images: Embedded images may be omitted or simplified depending on the document.
  • Macros & forms: Not supported; only document body content is targeted.
  • Legacy .doc: Only modern .docx is supported, not the older binary .doc format.

See also