How Micromark Handles Markdown and the AST

How Micromark Handles Markdown and the AST: The Ultimate Deep Dive


1. What is Micromark?

Micromark is the low-level, highly efficient streaming tokenizer and parser at the core of the modern markdown ecosystem (remark, unified, etc). It is responsible for:
  • Turning raw markdown text into a stream of tokens (not an AST!)
  • Handling every byte, line ending, and markdown edge case according to the CommonMark and GFM specs
  • Providing extension points for plugins (like GFM tables, footnotes, etc)
Key fact: Micromark itself does NOT build an AST. It emits a token/event stream. The AST (MDAST, HAST, etc) is built by higher-level utilities (like mdast-util-from-markdown).

2. The Micromark Pipeline: Step by Step

Step 1: Preprocessing

  • File: lib/preprocess.js
  • Handles normalization of line endings, encodings, and prepares the input for streaming parsing.

Step 2: Parsing (Tokenization)

  • File: lib/parse.js
  • The heart of micromark. It:
    • Combines built-in and extension constructs (syntax rules)
    • Sets up the parsing context (lines, columns, buffers)
    • Uses the createTokenizer function (from lib/create-tokenizer.js) to walk the input and emit tokens
  • Constructs:
    • Each markdown feature (heading, list, code block, table, etc) is a "construct" (see lib/constructs.js and micromark-core-commonmark)
    • Constructs are organized by context: document, content, flow, string, text, etc

Step 3: Tokenizer State Machine

  • File: lib/create-tokenizer.js
  • This is a streaming state machine:
    • Maintains a Point (line, column, offset) as it walks the input
    • At each character, checks the current construct(s) to see if a match is possible
    • Emits tokens for open/close/enter/exit of each markdown element
    • Handles nested constructs (e.g., emphasis inside a link inside a table cell)
    • Uses effects (enter, exit, consume, etc) to manage state

Step 4: Postprocessing

  • File: lib/postprocess.js
  • Final adjustments to the token stream (e.g., resolving references, normalizing whitespace)

Step 5: Compilation (to HTML or other output)

  • File: lib/compile.js
  • By default, micromark can compile the token stream directly to HTML.
  • The compiler walks the token stream, mapping tokens to HTML tags, handling escaping, and applying extensions (e.g., GFM tables, autolinks).
  • You can swap in your own compiler to output a CST, AST, or any other format.

3. How Extensions (like GFM) Plug In

  • Extensions are objects that define additional constructs (syntax rules) and/or HTML handlers.
  • When you call micromark(markdown, { extensions: [gfm()] }), the GFM constructs (tables, strikethrough, etc) are merged with the core constructs.
  • Each extension can add, override, or modify constructs for any context (document, flow, text, etc).

4. Token/Event Stream Format

  • Each token/event is an object with:
    • type (e.g., 'heading', 'list', 'tableCell', etc)
    • start and end points (line, column, offset)
    • value (the matched text, if relevant)
  • The token stream is a flat list, not a tree.
  • Example (simplified):
    js
    [
      { type: 'heading', start: {line:1,column:1}, end: {line:1,column:7}, value: '# Hello' },
      { type: 'paragraph', ... },
      { type: 'table', ... },
      ...
    ]

5. How the AST is Actually Built

  • Micromark does NOT build the AST.
  • Instead, higher-level utilities (like mdast-util-from-markdown) walk the token stream and build the MDAST (Markdown AST) or HAST (HTML AST).
  • These utilities use the start/end info and nesting of tokens to build the correct tree structure.

6. Anatomy of a Construct (Syntax Rule)

  • Each construct is an object with:
    • tokenize: the main function for matching the syntax
    • resolve: (optional) post-processing for matched tokens
  • Example: the heading construct checks for # at the start of a line, then consumes the rest of the line as heading text.
  • Constructs can be as simple as a character match or as complex as a full table parser (see GFM extension).

7. How to Build Your Own Markdown Feature

  • Define a construct (with tokenize and optionally resolve)
  • Add it to the relevant context (document, flow, text, etc)
  • Pass your extension as { extensions: [myExtension] } to micromark
  • Optionally, add HTML handlers for direct HTML output

8. Key Files for Reference

  • index.js: Main entry point, wires up all phases
  • lib/parse.js: Parsing/tokenization logic
  • lib/constructs.js: List of all built-in constructs
  • lib/create-tokenizer.js: The streaming state machine
  • lib/compile.js: HTML compiler
  • lib/preprocess.js, lib/postprocess.js: Input/output normalization

9. Official Docs and Source


10. Life-or-Death Summary

  • Micromark is the streaming, spec-accurate tokenizer for markdown.
  • It emits a flat token/event stream, not an AST.
  • Extensions add new syntax rules (constructs) and output handlers.
  • The AST is built by utilities like mdast-util-from-markdown.
  • You can build your own extensions, compilers, or AST builders on top of micromark.

If you need a code sample, a walk-through of a specific construct, or a guide to writing your own extension, just ask.