Opensolr Web Crawler Standards

Web Crawler

Opensolr Web Crawler Standards

The same web standards that Google expects from your site apply here. If your pages are well-built for Googlebot, they are well-built for the Opensolr Web Crawler.

The Opensolr Web Crawler follows the same rules and conventions as major search engines. It respects robots.txt, honors meta robots directives, requires valid titles, and rewards well-structured metadata. The standards below are not Opensolr-specific — they are universal web standards. A site that follows them will perform well in Google, Bing, and Opensolr alike.

Crawl Pipeline

How pages are evaluated

Performance

Response time limits

URL Standards

Clean URLs & dynamic pages

Robots Directives

robots.txt & meta tags

Content Quality

Titles, dates & metadata

Deduplication

Canonical URLs

How the Crawler Evaluates a Page

Every URL goes through a multi-step evaluation pipeline before it enters your search index. This is the same general process Googlebot follows — and failing any step means the page is skipped.

Fetch

Request the URL. Must respond within 5 seconds.

→

robots.txt

Check if the URL is allowed by robots.txt rules.

→

Meta Tags

Check for noindex, nofollow, or none directives.

→

URL Check

Skip dynamic URLs with query strings by default.

→

Content

Extract title, text, metadata. Title is required.

→

Index

AI enrichment, embeddings, and Solr indexing.

Performance & Accessibility

5-Second Response Timeout

The page must begin responding within 5 seconds. This is the server response time (Time to First Byte), not the full page download. Pages that time out are skipped entirely.

Why it matters: Google recommends a TTFB under 200ms and will deprioritize slow pages. If your site takes more than 5 seconds to even start responding, it has a serious infrastructure problem that affects all crawlers — not just ours.

HTTP Status Codes

The crawler only indexes pages that return 200 OK. Pages returning 403, 404, 500, or any other error status are recorded but not indexed. Redirects (301/302) are followed, but if a redirect leaves the crawl domain, the page is skipped.

URL Standards

Dynamic URLs Are Not Indexed

Pages with a ? query string in the URL are followed (the crawler will visit them to discover more links) but are not indexed. This prevents duplicate content from URL parameters like session IDs, tracking codes, sort orders, and pagination tokens — the same reason Google recommends using rel="canonical" on parameterized URLs.

Indexed

https://example.com/products/shoes

https://example.com/blog/my-article

https://example.com/about

Clean, path-based URLs with no query strings.

Not Indexed

https://example.com/search?q=shoes

https://example.com/products?sort=price

https://example.com/page?id=42&ref=home

URLs with ? query parameters are skipped.

Robots Directives

The Opensolr Web Crawler respects the same robots directives as Google. If you block a page from Googlebot, it will also be blocked from the Opensolr crawler.

robots.txt

Pages that are disallowed in your site's /robots.txt file will not be crawled or indexed. The crawler checks robots.txt before requesting any URL, just like Googlebot does.

If you want certain sections of your site excluded from search results, robots.txt is the standard way to do it — and it works identically here.

Meta Robots Tags

`noindex`

Pages with <meta name="robots" content="noindex"> are crawled (to discover links) but not added to the search index. This is identical to how Google handles noindex — the page exists on your server but will never appear in search results.

`nofollow`

Pages with <meta name="robots" content="nofollow"> are indexed normally, but the crawler will not follow any outgoing links on that page. Use this on pages where you don't want the crawler discovering linked content (e.g., user-generated content with untrusted links).

`none`

Equivalent to noindex, nofollow combined. The page is neither indexed nor followed. This is the strictest directive — the page is effectively invisible to search.

Page Will Be Indexed

No robots meta tag at all (default: index, follow)

<meta name="robots" content="index, follow">

<meta name="robots" content="index">

Page Will NOT Be Indexed

<meta name="robots" content="noindex">

<meta name="robots" content="noindex, follow">

<meta name="robots" content="none">

Content Quality Requirements

Page Title Required

Every page must have a title — either via <title> or <meta property="og:title">. Pages without any title are skipped entirely. This is a hard requirement, not a suggestion.

Titles should also be unique across the site whenever possible. Duplicate titles make it harder for users to distinguish between results. Google flags duplicate titles as an issue in Search Console — it is equally problematic for site search.

Creation / Publication Date Recommended

Pages should include a publication or modification date via one of these meta tags:

<meta property="article:published_time" content="2026-03-15T10:00:00Z">
<meta property="og:updated_time" content="2026-03-15T10:00:00Z">

Why it matters: Without a date, the crawler cannot determine content freshness. Opensolr's Search Tuning includes a freshness boost that prioritizes newer content — but it only works when pages have dates. This applies to articles, blog posts, product pages, and any content where recency matters.

Author Best Practice

Include an author tag via <meta name="author">, <meta property="og:author">, or <meta property="article:creator">. Even a generic value like "Admin" or "Editorial Team" is better than nothing — it provides structured data that can be used for filtering and faceting in the future.

Category Best Practice

A <meta property="og:category"> or <meta property="article:section"> tag helps with faceted search — allowing users to filter results by topic. Google uses similar signals to understand content classification.

Duplicate Content & Canonical URLs

When two or more URLs serve the same content, they should declare a canonical URL:

<link rel="canonical" href="https://example.com/the-real-page">

Without a canonical tag, the crawler indexes every URL it finds. If the same content exists at /products/shoes and /catalog/shoes, both will appear in search results — creating duplicates that confuse users and dilute relevance.

This is identical to how Google handles canonicalization. If you have already set canonical tags for Google, they will work for the Opensolr Web Crawler automatically.

Correct

Both /products/shoes and /catalog/shoes include:

<link rel="canonical" href="/products/shoes">

Result: Only /products/shoes appears in search results.

Problem

Neither page has a canonical tag.

Result: Both appear in search — duplicate results for the same content.

Quick Reference

Standard	Status	What Happens
Response time > 5 seconds	Required	Page is skipped — not crawled at all
URL has `?` query string	Required	Page is visited for link discovery but not indexed
Blocked in robots.txt	Required	Page is not crawled or indexed
Meta `noindex`	Required	Page is crawled for links but not indexed
Meta `nofollow`	Required	Page is indexed but outgoing links are not followed
Meta `none`	Required	Page is neither indexed nor followed
No page title	Required	Page is skipped entirely
Unique titles across site	Recommended	Prevents confusing duplicate results
Publication date meta tag	Recommended	Enables freshness boost in search ranking
Author meta tag	Best Practice	Better data structure for future faceting
Category meta tag	Best Practice	Enables topic-based filtering in results
Canonical URL tag	Recommended	Prevents duplicate content in search index

Ready to Add Search to Your Site?

The Opensolr Web Crawler indexes your entire site — HTML, PDF, DOCX — with AI enrichment, vector embeddings, and hybrid search built in.

Web Crawler Overview Getting Started Guide Index Field Reference Embed the Search UI