RecipeStripper

Recipe Extractor

Pull clean recipes from any website. Four extraction methods, 120+ supported sites, AI fallback for everything else.

Try it now — paste any recipe URL

How Recipe Data Gets onto Webpages

Every major recipe site embeds machine-readable recipe data into its pages. This isn't for your benefit — it's for Google's. Structured data tells search engines "this is a recipe page, here are the ingredients, here's the cook time, here are the steps." Google rewards this with rich search results: the star ratings, cook times, and ingredient previews you see in recipe search results.

The two dominant formats are JSON-LD (a block of JSON embedded in a script tag) and Microdata (HTML attributes on elements). Both use the Schema.org Recipe schema to describe the same information in slightly different ways. Either format contains everything you need: a structured list of ingredients with quantities and a sequential list of instruction steps.

Recipe extractors read this data directly. The prose, ads, and pop-ups are just HTML around the data — they get ignored. RecipeStripper reads the data and rebuilds a clean display from it.

RecipeStripper's Four-Tier Extraction System

Not every site implements structured data correctly, and some don't implement it at all. RecipeStripper handles this with a cascade of four extraction methods:

Tier 1

JSON-LD Parser

~70% of recipe sites

Reads the application/ld+json script blocks embedded in the page. Handles arrays, @graph wrappers, HowToSection groupings, and malformed JSON gracefully. This covers the majority of WordPress-based food blogs using plugins like WP Recipe Maker, Tasty Recipes, and WPRM.

Tier 2

Microdata Parser

~15% of recipe sites

Reads Schema.org Recipe attributes embedded directly on HTML elements. Older recipe implementations, some large publishers, and custom CMSes use this format instead of JSON-LD.

Tier 3

Heuristic HTML Parser

~10% of recipe sites

For pages with no structured data, the heuristic parser uses CSS selectors, heading patterns, and content structure to identify and extract ingredient lists and instruction steps. Works on many personal recipe sites and older implementations.

Tier 4

AI Fallback (GPT-4o-mini)

~5% of recipe sites

For completely unstructured pages where all other methods fail, the page content is sent to GPT-4o-mini with instructions to identify and structure the recipe. This is rate-limited per IP address to control API costs.

These tiers run in sequence. If Tier 1 produces a complete recipe (with both ingredients and instructions), extraction stops there. If the result is incomplete or empty, the next tier runs. This approach means RecipeStripper can extract recipes from sites that simpler tools can't handle.

Handling Bot-Protected Sites

Some recipe sites use bot-detection systems that block server-side requests, even those mimicking normal browser behavior. RecipeStripper addresses this with two additional strategies:

  • Headless browser fallback: A full Chromium instance with stealth settings runs the page as if a real user opened it, executing JavaScript and handling dynamic content.
  • Wayback Machine fallback: For completely blocked sites, RecipeStripper checks archive.org for a recent cached version of the page that can be extracted without bot detection interference.

A handful of sites (notably some Dotdash Meredith properties) use PerimeterX bot detection that defeats even stealth browser approaches. RecipeStripper provides a clear error message for these rather than showing a partial or empty result.

After Extraction: The Matching Pass

Raw extraction gives you an ingredient list and a list of instruction steps — the same split layout every recipe site uses. RecipeStripper then runs a second pass: the ingredient-matching algorithm.

This pass reads each step and identifies where it references an ingredient from the extracted list. When it finds a match, it embeds the quantity directly into the step text. The result is a recipe where every instruction contains the exact amount of each ingredient it uses — no scrolling required.

See the recipe without scrolling page for a detailed explanation of how inline quantities work in practice.

Frequently Asked Questions

How does recipe extraction work technically?

Recipe extraction reads a webpage's structured data — specifically Schema.org Recipe markup embedded in the HTML. This markup exists because Google requires it for recipe-rich search results. RecipeStripper's extractor reads it from the page server-side, then runs additional matching and formatting passes to produce the clean inline-quantity display. For sites with missing or malformed markup, it falls back to heuristic HTML parsing or an AI model.

Why can't some recipe sites be extracted?

A small number of sites use bot-detection systems (like PerimeterX or Cloudflare) that block automated requests, even those that look like normal browser traffic. RecipeStripper attempts multiple fetch strategies including a headless browser and archived versions, but some heavily protected sites remain inaccessible. These are the exception — the vast majority of recipe sites extract cleanly.

Does RecipeStripper work with sites that have missing structured data?

Yes. RecipeStripper uses a four-tier parser: JSON-LD structured data, Microdata schema, heuristic HTML parsing (looking for recipe-shaped content), and finally a GPT-4o-mini AI fallback for completely unstructured pages. Even sites with no Schema.org markup can often be successfully extracted.

How accurate is the extraction?

For sites with standard JSON-LD markup (the majority of WordPress-based food blogs), extraction accuracy is very high — ingredients and steps are read directly from machine-readable data with no interpretation required. For heuristic and AI extractions, accuracy varies by site, but RecipeStripper always shows what it extracted so you can verify before cooking.