Opensolr Web Crawler Standards
The same web standards that Google expects from your site apply here. If your pages are well-built for Googlebot, they are well-built for the Opensolr Web Crawler.
The Opensolr Web Crawler follows the same rules and conventions as major search engines. It respects robots.txt, honors meta robots directives, requires valid titles, and rewards well-structured metadata. The standards below are not Opensolr-specific — they are universal web standards. A site that follows them will perform well in Google, Bing, and Opensolr alike.
How the Crawler Evaluates a Page
Every URL goes through a multi-step evaluation pipeline before it enters your search index. This is the same general process Googlebot follows — and failing any step means the page is skipped.
Fetch
Request the URL. Must respond within 5 seconds.
robots.txt
Check if the URL is allowed by robots.txt rules.
Meta Tags
Check for noindex, nofollow, or none directives.
URL Check
Skip dynamic URLs with query strings by default.
Content
Extract title, text, metadata. Title is required.
Index
AI enrichment, embeddings, and Solr indexing.
Performance & Accessibility
5-Second Response Timeout
The page must begin responding within 5 seconds. This is the server response time (Time to First Byte), not the full page download. Pages that time out are skipped entirely.
Why it matters: Google recommends a TTFB under 200ms and will deprioritize slow pages. If your site takes more than 5 seconds to even start responding, it has a serious infrastructure problem that affects all crawlers — not just ours.
HTTP Status Codes
The crawler only indexes pages that return 200 OK. Pages returning 403, 404, 500, or any other error status are recorded but not indexed. Redirects (301/302) are followed, but if a redirect leaves the crawl domain, the page is skipped.
URL Standards
Dynamic URLs Are Not Indexed
Pages with a ? query string in the URL are followed (the crawler will visit them to discover more links) but are not indexed. This prevents duplicate content from URL parameters like session IDs, tracking codes, sort orders, and pagination tokens — the same reason Google recommends using rel="canonical" on parameterized URLs.
Indexed
https://example.com/products/shoes
https://example.com/blog/my-article
https://example.com/about
Clean, path-based URLs with no query strings.
Not Indexed
https://example.com/search?q=shoes
https://example.com/products?sort=price
https://example.com/page?id=42&ref=home
URLs with ? query parameters are skipped.
Robots Directives
The Opensolr Web Crawler respects the same robots directives as Google. If you block a page from Googlebot, it will also be blocked from the Opensolr crawler.
robots.txt
Pages that are disallowed in your site's /robots.txt file will not be crawled or indexed. The crawler checks robots.txt before requesting any URL, just like Googlebot does.
If you want certain sections of your site excluded from search results, robots.txt is the standard way to do it — and it works identically here.
Meta Robots Tags
noindex
Pages with <meta name="robots" content="noindex"> are crawled (to discover links) but not added to the search index. This is identical to how Google handles noindex — the page exists on your server but will never appear in search results.
nofollow
Pages with <meta name="robots" content="nofollow"> are indexed normally, but the crawler will not follow any outgoing links on that page. Use this on pages where you don't want the crawler discovering linked content (e.g., user-generated content with untrusted links).
none
Equivalent to noindex, nofollow combined. The page is neither indexed nor followed. This is the strictest directive — the page is effectively invisible to search.
Page Will Be Indexed
No robots meta tag at all (default: index, follow)
<meta name="robots" content="index, follow">
<meta name="robots" content="index">
Page Will NOT Be Indexed
<meta name="robots" content="noindex">
<meta name="robots" content="noindex, follow">
<meta name="robots" content="none">
Content Quality Requirements
Page Title Required
Every page must have a title — either via <title> or <meta property="og:title">. Pages without any title are skipped entirely. This is a hard requirement, not a suggestion.
Titles should also be unique across the site whenever possible. Duplicate titles make it harder for users to distinguish between results. Google flags duplicate titles as an issue in Search Console — it is equally problematic for site search.
Creation / Publication Date Recommended
Pages should include a publication or modification date via one of these meta tags:
<meta property="article:published_time" content="2026-03-15T10:00:00Z">
<meta property="og:updated_time" content="2026-03-15T10:00:00Z">
Why it matters: Without a date, the crawler cannot determine content freshness. Opensolr's Search Tuning includes a freshness boost that prioritizes newer content — but it only works when pages have dates. This applies to articles, blog posts, product pages, and any content where recency matters.
Author Best Practice
Include an author tag via <meta name="author">, <meta property="og:author">, or <meta property="article:creator">. Even a generic value like "Admin" or "Editorial Team" is better than nothing — it provides structured data that can be used for filtering and faceting in the future.
Category Best Practice
A <meta property="og:category"> or <meta property="article:section"> tag helps with faceted search — allowing users to filter results by topic. Google uses similar signals to understand content classification.
Duplicate Content & Canonical URLs
When two or more URLs serve the same content, they should declare a canonical URL:
<link rel="canonical" href="https://example.com/the-real-page">
Without a canonical tag, the crawler indexes every URL it finds. If the same content exists at /products/shoes and /catalog/shoes, both will appear in search results — creating duplicates that confuse users and dilute relevance.
This is identical to how Google handles canonicalization. If you have already set canonical tags for Google, they will work for the Opensolr Web Crawler automatically.
Correct
Both /products/shoes and /catalog/shoes include:
<link rel="canonical" href="/products/shoes">
Result: Only /products/shoes appears in search results.
Problem
Neither page has a canonical tag.
Result: Both appear in search — duplicate results for the same content.
Quick Reference
| Standard | Status | What Happens |
|---|---|---|
| Response time > 5 seconds | Required | Page is skipped — not crawled at all |
URL has ? query string | Required | Page is visited for link discovery but not indexed |
| Blocked in robots.txt | Required | Page is not crawled or indexed |
Meta noindex | Required | Page is crawled for links but not indexed |
Meta nofollow | Required | Page is indexed but outgoing links are not followed |
Meta none | Required | Page is neither indexed nor followed |
| No page title | Required | Page is skipped entirely |
| Unique titles across site | Recommended | Prevents confusing duplicate results |
| Publication date meta tag | Recommended | Enables freshness boost in search ranking |
| Author meta tag | Best Practice | Better data structure for future faceting |
| Category meta tag | Best Practice | Enables topic-based filtering in results |
| Canonical URL tag | Recommended | Prevents duplicate content in search index |
Ready to Add Search to Your Site?
The Opensolr Web Crawler indexes your entire site — HTML, PDF, DOCX — with AI enrichment, vector embeddings, and hybrid search built in.