If you open the source of almost any recipe page on a major food blog and search for "schema.org," you'll find a block of structured data that looks nothing like the visual page you loaded. It's machine-readable metadata — a JSON object describing the recipe in a standardized vocabulary — and it's the primary thing RecipeStripper reads when it extracts a recipe.

Understanding how this works explains both why recipe extraction succeeds most of the time and why it occasionally fails.

What Is Schema.org?

Schema.org is a collaborative vocabulary maintained by Google, Microsoft, Yahoo, and Yandex. It defines a common set of types and properties for describing things on the web: events, products, organizations, people, places — and recipes.

When you add Schema.org markup to a webpage, you're telling search engines: "this page is a recipe, and here's the machine-readable representation of it." Search engines use this data to power rich results — the recipe carousels, the recipe cards with photos and ratings that appear directly in search results, the cooking time and calorie information displayed below the page title.

Critically, this markup is standardized. Every site that uses Schema.org's Recipe type follows the same property names: recipeIngredient, recipeInstructions, cookTime, recipeYield, name. The values differ, but the structure is the same.

The Two Formats: JSON-LD and Microdata

Schema.org markup can be added to a page in two main ways.

JSON-LD (JavaScript Object Notation for Linked Data) is a separate <script> block in the page's HTML. It looks like this:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Recipe",
  "name": "Classic Chocolate Chip Cookies",
  "recipeYield": "48 cookies",
  "recipeIngredient": [
    "2 cups all-purpose flour",
    "1 tsp baking soda",
    "1 cup butter, softened"
  ],
  "recipeInstructions": [
    {
      "@type": "HowToStep",
      "text": "Preheat oven to 375°F."
    },
    {
      "@type": "HowToStep",
      "text": "Whisk together flour and baking soda in a medium bowl."
    }
  ]
}
</script>

JSON-LD is completely separate from the visible HTML. You can have a page where the visual design is entirely one thing and the JSON-LD block is another — they don't need to be in sync (though they should be).

Microdata is older and more verbose. Instead of a separate block, the structured data is woven into the HTML elements themselves using itemprop attributes:

<div itemscope itemtype="https://schema.org/Recipe">
  <h1 itemprop="name">Classic Chocolate Chip Cookies</h1>
  <ul>
    <li itemprop="recipeIngredient">2 cups all-purpose flour</li>
    <li itemprop="recipeIngredient">1 tsp baking soda</li>
  </ul>
</div>

Microdata is less common today — JSON-LD is the format Google recommends and WordPress recipe plugins generate by default.

Why Recipe Sites Use Structured Data

The short answer: Google rich results. A recipe page with valid Schema.org markup can appear in Google's recipe carousel — the prominent visual blocks that appear at the top of recipe search results with photos, ratings, and cooking time displayed inline.

The difference in click-through rates between appearing in the recipe carousel and appearing as a standard text result is significant. Recipe bloggers are extremely motivated to implement structured data correctly because it directly affects their traffic. This is why Schema.org implementation in the recipe space is high-quality and consistent in a way it isn't in other content categories.

A WordPress blogger who installs WP Recipe Maker, Tasty Recipes, or WPRM gets Schema.org JSON-LD generated automatically. They don't need to write any JSON — the plugin handles it. This means even small blogs with no technical sophistication tend to have valid structured data.

Roughly 70% of recipe sites have JSON-LD structured data. This isn't a guess — it's borne out in RecipeStripper's production data, where JSON-LD is the successful extraction method for about 70% of recipes processed.

RecipeStripper's 4-Tier Parser Chain

RecipeStripper uses structured data as its first and primary extraction method, with three fallbacks for when it's absent or malformed.

Tier 1: JSON-LD

The parser looks for <script type="application/ld+json"> blocks and finds any object with "@type": "Recipe". It handles the variations that appear in the wild: arrays of schema objects (where only one is the Recipe), @graph wrappers (a common pattern from Yoast SEO), and nested HowToSection objects inside the instructions array.

When JSON-LD is present and valid, extraction takes milliseconds. The data is already structured — there's no HTML parsing involved, no ambiguity about which text is an ingredient versus prose. It's just reading a well-defined JSON object.

Tier 2: Microdata

If no JSON-LD is found (or the JSON-LD is malformed), the parser looks for Microdata using CSS selectors targeting itemtype="https://schema.org/Recipe" and extracting itemprop values. This covers older sites that implemented structured data before JSON-LD became dominant.

Tier 3: Heuristic

Some sites have no structured data at all — especially older sites, foreign-language sites, and sites that predate the recipe plugin era. The heuristic parser attempts to identify recipe content using visual patterns: headings that look like section labels ("Ingredients," "Instructions," "Directions"), lists that follow the structural patterns of ingredient lists, numbered lists that look like instructions.

This parser is less reliable than the first two because it's guessing based on structure rather than reading explicit semantic markup. It works well on well-formatted pages and poorly on sites with unusual layouts.

Tier 4: GPT-4o-mini Fallback

When all three parsers fail — or when they succeed but the result is missing ingredients or instructions — the page content is sent to GPT-4o-mini with a prompt asking it to extract the recipe in a structured format.

This catches the long tail: hand-coded pages, unusual templates, recipes buried in narrative prose without any structural signals. The tradeoff is latency (1-3 seconds additional) and cost (the AI call), so there's a rate limiter to prevent abuse.

What This Means for Extraction Reliability

The four-tier chain produces high extraction reliability across the recipe web, but not perfect reliability. The failure modes are:

Bot protection: Some sites (notably Dotdash Meredith properties like Serious Eats and The Kitchn) use PerimeterX, a bot protection service that can detect and block automated fetching. RecipeStripper can't read the HTML if it can't load the page. Structured data doesn't help if the server never sends it.

Paywalled content: If a recipe requires login to view, the structured data in the HTML may only show a partial recipe. What gets extracted is what's publicly visible.

Malformed structured data: Some sites have JSON-LD blocks with schema.org markup that violates the spec — missing required fields, wrong data types, encoding issues. The parser is tolerant of common variations but some malformed data defeats extraction even when the JSON-LD is present.

Dynamic rendering: Some recipe sites render their content entirely via JavaScript — the initial HTML is a shell, and the recipe content only appears after JavaScript runs. RecipeStripper handles this with a headless browser fallback, but it adds latency and resource cost.

The Irony of Recipe Structured Data

Recipe sites add Schema.org markup to get better placement in Google search results — to drive more traffic to their ad-supported pages. The markup is generated for Google's benefit, not for tools like RecipeStripper.

But the same structured data that makes a recipe show up in Google's carousel is the data RecipeStripper reads to extract the recipe cleanly. The infrastructure food bloggers built for SEO purposes turns out to be equally useful for ad-free recipe extraction. The more carefully a site implements its Schema.org markup (because they care about Google rich results), the more reliably RecipeStripper can extract from it.

It's a serendipitous alignment. The sites that benefit most from structured data — high-traffic food blogs with carefully maintained WordPress installations — are also the sites where recipe extraction works most reliably. The structured data that exists to serve Google ends up serving the cook trying to get clean access to the recipe.

How Recipe Structured Data Works (And Why Most Sites Have It)

What Is Schema.org?

The Two Formats: JSON-LD and Microdata

Why Recipe Sites Use Structured Data

RecipeStripper's 4-Tier Parser Chain

Tier 1: JSON-LD

Tier 2: Microdata

Tier 3: Heuristic

Tier 4: GPT-4o-mini Fallback

What This Means for Extraction Reliability

The Irony of Recipe Structured Data

Try RecipeStripper

Related Articles

Related Content

Popular Recipe Sites

RecipeStripper Features