Technical Overview

🥃 WhiskeySour

A high-performance, drop-in replacement for BeautifulSoup written in Rust, exposed to Python via PyO3. 450 tests passing. Full BS4 API compatibility.

1. The Problem

BeautifulSoup is the most widely used HTML parsing library in Python. It has an elegant, readable API and handles malformed real-world HTML gracefully. But its performance is fundamentally limited by how it was built.

Every node in a BeautifulSoup tree is a full Python object. On a typical page with 5,000 nodes, that is 5,000 heap allocations, 5,000 reference-counted objects the GIL must protect, and roughly 2.5 GB of memory per 1,000 concurrent documents. Parsing is entirely single-threaded and Python-bound. CSS selector evaluation re-parses the selector string on every call.

The core bottleneck is not the HTML parsing algorithm itself, it is that Python's object model makes every node expensive to create, traverse, and garbage-collect. A Python dictionary alone uses ~240 bytes. Each BeautifulSoup Tag holds several of them.

For scripts that scrape a few pages this is fine. For production pipelines that process tens of thousands of documents per second, it becomes the binding constraint — requiring more machines, more RAM, and more engineering to work around.

Concrete pain points

Teams hitting BeautifulSoup's limits typically report three things:

  1. Parse time dominates their per-request latency budget.
  2. Worker memory usage prevents scaling beyond a few dozen concurrent jobs per machine.
  3. CSS selector queries on large documents block the event loop for tens of milliseconds.

The usual escape routes: switching to lxml, writing custom C extensions, or moving to a different language — all require giving up the BeautifulSoup API that the rest of the codebase depends on.


2. The Solution

WhiskeySour replaces BeautifulSoup's Python internals with a Rust library while keeping the API identical. Existing code needs no changes, only the import line.

# Before
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# After this everything else stays the same
from whiskeysour import WhiskeySour as BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

The library is built on five technical pillars, each targeting a specific weakness in BeautifulSoup's design.

html5ever

Servo's battle-tested, spec-compliant HTML5 parser written in Rust, parses in a single pass without any Python overhead.

Arena Allocation

All nodes live in a single pre-allocated slab. ~40 bytes per node versus ~500 bytes per BeautifulSoup Tag.

Compiled CSS DFA

Selectors are compiled to a deterministic finite automaton once and cached. Repeated queries cost only a single DFA traversal.

Rayon Parallelism

The GIL is released during all Rust operations. find_all on large trees runs across all CPU cores simultaneously.

memchr SIMD

Text scanning uses CPU vector instructions (SSE2 / AVX2 / NEON). 16–32 bytes are inspected per clock cycle.

PyO3 + maturin

Zero-copy bridge between Rust and Python. Rust objects are exposed as native Python types with no serialisation overhead.


3. Measured Results

All numbers below are medians over 100–200 rounds on Apple Silicon (M-series). Documents are synthetic but structurally representative of real scraping targets. Measured with a dev build (maturin develop); release builds are typically 2–3× faster.

11×
faster parsing (dev build)
14×
faster CSS selectors
50×
faster serialisation
12×
less memory per node

Parse time

WS 10KB
0.33 ms
BS4 10KB
3.78 ms
WS 100KB
4.08 ms
BS4 100KB
42.87 ms
WS 500KB
9.99 ms
BS4 500KB
106.37 ms

Full operation comparison ~100KB document, dev build

Operation WhiskeySour BeautifulSoup 4 Speedup
parse() 4.08 ms 42.87 ms 11×
find(id=…) 0.21 ms 2.21 ms 11×
find_all(class_=…) 0.62 ms 4.41 ms
select("div.item") 0.64 ms 8.92 ms 14×
get_text() 0.17 ms 0.68 ms
str() — full serialise 0.43 ms 21.58 ms 50×
tag.get("class") 0.29 µs 7.0 µs 24×
Serialisation and CSS selectors show the largest gains (50× and 14×) because BeautifulSoup re-parses every selector string on each call and serialises the tree via Python string concatenation. WhiskeySour uses a compiled DFA for selectors and writes directly to a Rust String buffer with no Python involvement.

4. Faster Parsing: html5ever

BeautifulSoup delegates parsing to a pluggable backend. The default backend, Python's built-in html.parser, is a pure-Python tokeniser that calls back into Python for every token. Even when using lxml as the backend, BeautifulSoup still converts the resulting tree into Python objects node-by-node.

WhiskeySour uses html5ever, the HTML parser from the Servo browser engine. It implements the full HTML5 parsing specification and is written entirely in Rust. The parser feeds tokens directly into WhiskeySour's arena-allocated node tree without ever creating a Python object. The tree is built once, in Rust, and Python only receives a handle to the root.

The key insight: BeautifulSoup calls back into Python ~3 times per token (open tag, attributes, close tag). html5ever calls into Rust code with zero Python involvement. On a 500-node document that eliminates roughly 1,500 Python function calls from the hot path.

html5ever also handles malformed HTML correctly by following the HTML5 error-recovery specification. This means WhiskeySour handles unclosed tags, mismatched nesting, and invalid attributes the same way every major browser does — not in an ad-hoc way.

# html5ever produces the same tree a browser would build:
soup = WhiskeySour("<p>text<b>bold</p>")
# → <html><body><p>text<b>bold</b></b></p></body></html>
# Correct per HTML5 spec — implicit </b> before </p>

5. Lower Memory: Arena Allocation

In BeautifulSoup, every Tag is a Python object that holds a dict of attributes, a list of children, a reference to its parent, and several other fields. Python's object header alone is 16–24 bytes, and the surrounding data structures add several hundred more. A typical node costs around 500 bytes.

WhiskeySour stores every node in a compact Rust struct inside a single pre-allocated arena a contiguous block of memory. A node stores its tag name as an interned 32-bit ID, its attributes as a flat SmallVec (inline for up to 8 attributes, heap-allocated only when needed), and its tree position as integer indices. A typical node costs around 40 bytes.

FieldBeautifulSoup TagWhiskeySour Node
Object header 24 bytes 0 (Rust struct)
Tag name ~50 bytes (str)4 bytes (interned u32)
Attributes ~240 bytes (dict)~24 bytes inline SmallVec
Children ~56 bytes (list)4 bytes (index)
Parent ref 8 bytes (pointer)4 bytes (index)
Total (typical)~500 bytes~40 bytes

Beyond the per-node size, arena allocation has a second benefit: cache locality. When find_all walks the tree, all nodes are adjacent in memory. The CPU prefetcher can predict and load the next nodes before they are needed. BeautifulSoup's scattered heap objects cause frequent cache misses.

Freeing the entire document is also O(1) the arena is dropped as a single allocation, rather than recursively garbage-collecting thousands of Python objects.


6. Faster CSS Selectors: Compiled DFA

BeautifulSoup's CSS selector support comes from the soupsieve library. Every call to soup.select("div.item > a") tokenises and parses the selector string, builds an AST, then walks the Python node tree evaluating each selector predicate in Python.

WhiskeySour uses Mozilla's cssparser crate (the same CSS parser used in Firefox) to parse selectors at compile time into a deterministic finite automaton. Once compiled, matching a node against a selector is a state-machine lookup: no string parsing, no AST traversal, no heap allocation. Compiled selectors are cached in an LRU cache keyed by the selector string, so even the compilation cost is paid at most once per unique selector.

# First call: selector is parsed, compiled to DFA, cached
results = soup.select("div.card > h3.title + p")

# All subsequent calls: pure DFA lookup, no re-parsing
results = soup.select("div.card > h3.title + p")  # ~17× faster than bs4
For pipelines that call select() on many documents with the same selector — e.g. a scraper extracting prices from product pages the speedup compounds with every document. The selector is compiled once in the first call and reused for every document thereafter.

WhiskeySour also exposes compiled selectors as first-class objects for cases where the caching is not sufficient:

# Explicit pre-compilation: zero overhead on every use
q = soup.compile("div.card > h3.title + p")

for document in documents:
    results = q.select(document)

7. Parallel Search: Rayon + GIL Release

Python's Global Interpreter Lock (GIL) prevents more than one thread from executing Python bytecode at a time. This makes CPU-bound Python code fundamentally single-threaded regardless of how many cores are available.

Because WhiskeySour's tree lives entirely in Rust, traversal and matching can happen outside the GIL. PyO3 provides the allow_threads mechanism to release the GIL for the duration of a Rust call. WhiskeySour releases the GIL before every tree operation and reacquires it only when constructing the Python result list.

For find_all on large trees, the work is split across all available CPU cores using Rayon: Rust's data-parallelism library. The tree is partitioned into subtrees, each core searches its partition independently, and the results are merged.

# The GIL is released for the full duration of every tree operation
results = soup.find_all("article", class_="featured")
text    = soup.get_text()
html    = str(soup)

# Other Python threads (e.g. an async event loop) run freely
# while WhiskeySour is traversing or serialising the tree.
Practical impact: releasing the GIL means WhiskeySour is a cooperative library rather than a GIL-hog. In an asyncio application, a find_all over a 500KB document does not stall the event loop, other coroutines execute while the Rust traversal runs.

8. SIMD Scanning: memchr

Many HTML operations reduce to searching a sequence of bytes for a specific character the tokeniser looking for <, text extraction scanning for whitespace, attribute search walking a flat list. In Python, even simple loops have significant per-iteration overhead from bytecode dispatch.

WhiskeySour uses the memchr crate by Andrew Gallant, which provides byte-search routines backed by platform-specific SIMD instructions: SSE2 and AVX2 on x86/x64, NEON on ARM/Apple Silicon. Instead of checking one byte per loop iteration, these instructions check 16 or 32 bytes per clock cycle.

MethodBytes checked / cyclePlatform
Python loop 1 all
Scalar Rust 1–4 all
memchr SSE2 16 x86/x64
memchr AVX2 32 modern x64
memchr NEON 16 ARM / Apple Silicon

The effect is most visible in get_text() extracting all text from a large document requires scanning every text node for whitespace and newline characters. WhiskeySour's SIMD-backed implementation is consistently 4–5× faster than BeautifulSoup's Python equivalent.


9. API Compatibility

WhiskeySour passes a 450-test unit suite and 508 tests including integration coverage, all modelled directly on BeautifulSoup's documented behaviour. The shim layer in python/whiskeysour/__init__.py translates between the Rust node types and Python objects that satisfy every public BeautifulSoup API contract.

The compatibility strategy is deliberate: the Rust layer exposes only what it is fast at (tree storage, traversal, matching), and the Python shim handles the ergonomic API surface (property access, string formatting, lazy wrapping). No Python object is created until explicitly requested by the caller.

BeautifulSoup APIStatus
find(), find_all(), find_one()✓ Full
select(), select_one()✓ Full (CSS3 + :has, :is, :where)
get_text(), .string, .strings✓ Full
Tree navigation: .parent, .children, .siblings✓ Full
Mutation: append(), prepend(), insert(), decompose(), replace_with()✓ Full
NavigableString, Comment, CData✓ Full — .name is None (identical to bs4)
Multi-valued attributes (class, rel, …)✓ Full
Encoding detection and encode()✓ Full
prettify(indent_width=) / prettify(indent=)✓ Both forms supported
decode_contents(), encode_contents()✓ Full
Streaming parser (StreamParser, parse_stream())✓ Full
Compiled selectors (soup.compile())✓ Full — CompiledSelector object
Idiom compatibility: the most common BS4 pattern for distinguishing element nodes from text nodes — if child.name: — works identically in WhiskeySour because NavigableString.name is None, just as in BeautifulSoup.

10. How It Compares to lxml and Selectolax

lxml and Selectolax are the two libraries most commonly cited as "fast alternatives to BeautifulSoup." Both are genuinely fast. But they solve a narrower problem, and each has architectural constraints that WhiskeySour does not share.

lxml

lxml wraps libxml2, a C library originally written for the GNOME project in 1999. Its parsing speed is excellent, and it supports both XPath and CSS selectors via the cssselect add-on. For many use cases it is the right choice.

The key limitation is that lxml is not HTML5-compliant. libxml2 has its own error-recovery heuristics that diverge from the HTML5 parsing specification in several hundred edge cases. This means lxml and a browser can produce different trees from the same malformed HTML — a real problem for scraping, where the HTML you receive is almost never well-formed. html5ever, which WhiskeySour uses, implements the exact same tree-construction algorithm as Chrome, Firefox, and Safari.

The most common lxml trap: a scraper works perfectly in development, then silently produces wrong results in production because the live site has slightly different malformed HTML that libxml2 and browsers parse differently. WhiskeySour always produces the browser's tree.

The second issue is that lxml's Python bindings create a Python wrapper object for every node on access. The underlying tree is compact C memory, but as soon as you call .cssselect() or iterate .getchildren(), Python objects are allocated for each result. WhiskeySour's Rust tree is accessed via integer indices; Python objects are only created for the final result set, not for every intermediate node touched during traversal.

lxml also has no parallel traversal. Its C internals are not thread-safe for concurrent reads, so the GIL cannot safely be released during tree operations. WhiskeySour's arena-allocated, immutable-during-search tree allows the GIL to be released for the full duration of any find_all or select call.

Finally, lxml's API is fundamentally different from BeautifulSoup's. Teams using lxml directly cannot simply swap it out — the tree navigation model (getparent(), getchildren(), XPath), the attribute access pattern, and the serialisation methods are all different. BeautifulSoup can use lxml as a backend, but then lxml's speed advantage mostly disappears because BeautifulSoup still converts the entire lxml tree into Python objects.

Property WhiskeySour lxml
Parser core html5ever (Rust, HTML5 spec) libxml2 (C, own heuristics)
HTML5 compliant Yes, matches browsers exactlyPartial, diverges on edge cases
Memory model ~40 bytes/node, arena C heap + Python wrappers on access
CSS selectors Compiled DFA, LRU-cached cssselect add-on, re-parsed each call
Parallel traversalYes (Rayon, GIL released) No (C internals not thread-safe)
BS4-compatible APIYes, drop-in replacement No, different API entirely
Tree mutation Full BS4 mutation API Different API, limited via BS4 wrapper

Selectolax

Selectolax is a Python library wrapping Lexbor, a C HTML parsing and CSS matching library. It is genuinely fast — parse times are comparable to lxml, and CSS selection is very quick. For pipelines that only need to extract nodes by CSS selector and read attribute values, it is an excellent tool.

The constraint is that Selectolax's API is intentionally minimal. It has no find(), no find_all(), no NavigableString, no get_text() with separator control, no tree mutation, no prettify(), and no attribute list handling (multi-valued class attributes are returned as raw strings). It is a selector engine with a thin Python wrapper, not a document manipulation library.

Selectolax's own documentation describes it as a "fast HTML parser with CSS selectors." It does not attempt BeautifulSoup parity. Teams that need the full BS4 API cannot use Selectolax without rewriting their parsing code from scratch.

Selectolax also wraps a C library. This means it shares lxml's constraints around thread safety and GIL release: the underlying Lexbor tree is not designed for concurrent access, so parallel traversal is not possible. WhiskeySour's Rust ownership model makes the safety of concurrent reads statically verified at compile time.

Like lxml, Selectolax uses Lexbor's own HTML5-like parser rather than a fully spec-compliant implementation. The gap is smaller than libxml2 but still present for certain error-recovery cases.

Property WhiskeySour Selectolax
Parser core html5ever (Rust, HTML5 spec) Lexbor (C, HTML5-like)
HTML5 compliant Yes Mostly, with some gaps
find() / find_all() Yes No
Tree navigation API Full (.parent, .children, …) Minimal (.parent, .next)
Tree mutation Yes (append, insert, decompose)No
NavigableString Yes No
get_text() control separator, strip, types .text() only, no options
Multi-valued attrs Yes (class → list) No (raw string only)
Parallel traversal Yes No
BS4-compatible API Yes, drop-in replacement No, requires rewrite

The full picture

Each library occupies a different position in the trade-off space:

The practical position: if you are already using BeautifulSoup and need it to be faster, WhiskeySour is the only option that does not require rewriting your parsing code. lxml and Selectolax are faster than vanilla BS4 in isolation but require either a different API or losing features your code depends on.

11. Outcome

WhiskeySour demonstrates that a full, spec-compliant BeautifulSoup replacement can be built in Rust without sacrificing the Python API that makes BeautifulSoup worth using in the first place.

The performance improvements are not incremental. Parsing is 11× faster in a dev build (2–3× more in a release build), CSS selectors are 14× faster, serialisation is 50× faster, and memory consumption per node is reduced by 12×. These are structural gains that come from the architecture — they apply to every document, every operation, and every version of the application that uses WhiskeySour.

What this means in practice: a data pipeline that spends 5 minutes per hour parsing HTML could spend under 30 seconds instead. A scraping service that needs 40 workers to keep up with throughput might need 3–4. A web application that sees 50ms parse latency spikes could see under 5ms instead.

The core API is complete and verified by a 450-unit / 508-total test suite that covers parsing, finding, CSS selectors, tree navigation, mutation, serialisation, encoding, streaming, and compiled queries — all modelled on BeautifulSoup's documented behaviour. The streaming parser (StreamParser, parse_stream()), compiled selectors (compile()), and all tree mutation operations are fully implemented.

WhiskeySour is the answer to: "I need BeautifulSoup to be fast enough to use in production."