Import XML Data into Your Opensolr Index
Solr's DataImportHandler (DIH) was removed in Solr 9. The modern approach is simple: POST your XML data directly to Solr's /update endpoint via HTTP. No plugins needed, no data-config.xml, no special configuration — just send XML over HTTP.
This guide shows you how to parse XML files and index them into your Opensolr Index using curl, PHP, and Python.
The Solr XML Format
Solr accepts documents in a specific XML format. Every XML file you send must follow this structure:
<?xml version="1.0" encoding="UTF-8"?> <add> <doc> <field name="id">product-001</field> <field name="title">Wireless Headphones</field> <field name="description">Premium noise-cancelling wireless headphones</field> <field name="price">149.99</field> <field name="category">Electronics</field> <field name="in_stock">true</field> </doc> <doc> <field name="id">product-002</field> <field name="title">Bluetooth Speaker</field> <field name="description">Portable waterproof Bluetooth speaker</field> <field name="price">79.99</field> <field name="category">Electronics</field> <field name="in_stock">true</field> </doc> </add>
Key rules:
- The root element is
<add> - Each document is wrapped in
<doc> - Each field uses
<field name="fieldname">value</field> - Field names must match fields defined in your
schema.xml - The
idfield is required (unique key for each document) - You can include as many
<doc>elements as you want in a single<add>block
Example 1: curl — Quick Import from Command Line
The simplest way to import. Post the XML file directly to Solr:
#!/bin/bash # ===================================================================== # CONFIGURATION — Replace these with YOUR Opensolr index details # ===================================================================== SOLR_HOST="YOUR_HOST" INDEX_NAME="YOUR_INDEX" USERNAME="opensolr" PASSWORD="YOUR_API_KEY" XML_FILE="products.xml" # Import the XML file into Solr echo "Importing $XML_FILE into $INDEX_NAME..." curl -u "$USERNAME:$PASSWORD" \ -H "Content-Type: text/xml" \ --data-binary @"$XML_FILE" \ "https://$SOLR_HOST/solr/$INDEX_NAME/update?commit=true&wt=json" echo "" echo "Done!"
For large files, you can split and send in chunks — see the PHP and Python examples below.
Example 2: PHP — Parse & Index XML with Chunking
This script reads an XML file, parses each <doc> element, and sends them to Solr in configurable batches. Handles large files without running out of memory.
<?php // ===================================================================== // CONFIGURATION — Replace these with YOUR Opensolr index details // ===================================================================== $solr_host = "YOUR_HOST"; $index_name = "YOUR_INDEX"; $username = "opensolr"; $password = "YOUR_API_KEY"; $xml_file = "products.xml"; $batch_size = 500; // Documents per batch (adjust based on doc size) $solr_url = "https://$solr_host/solr/$index_name/update?commit=true&wt=json"; // Load and parse the XML file $xml = simplexml_load_file($xml_file); if ($xml === false) { die("Error: Could not parse $xml_file\n"); } $docs = $xml->doc; $total = count($docs); echo "Found $total documents in $xml_file\n"; // Send documents in batches $batch = []; $sent = 0; foreach ($docs as $doc) { $batch[] = $doc->asXML(); if (count($batch) >= $batch_size) { $sent += send_batch($batch, $solr_url, $username, $password); $batch = []; echo " Indexed $sent / $total\n"; } } // Send remaining documents if (!empty($batch)) { $sent += send_batch($batch, $solr_url, $username, $password); echo " Indexed $sent / $total\n"; } echo "Done! $sent documents indexed.\n"; function send_batch($docs, $url, $user, $pass) { $xml_body = "<add>\n" . implode("\n", $docs) . "\n</add>"; $ch = curl_init($url); curl_setopt_array($ch, [ CURLOPT_USERPWD => "$user:$pass", CURLOPT_POST => true, CURLOPT_POSTFIELDS => $xml_body, CURLOPT_HTTPHEADER => ["Content-Type: text/xml"], CURLOPT_RETURNTRANSFER => true, CURLOPT_TIMEOUT => 120, ]); $response = curl_exec($ch); $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); curl_close($ch); if ($http_code !== 200) { echo " ERROR (HTTP $http_code): $response\n"; return 0; } return count($docs); }
Example 3: Python — Parse & Index XML with Chunking
Same approach in Python. Uses xml.etree.ElementTree for parsing and requests for HTTP.
# import_xml.py — Parse and index XML documents into Opensolr import xml.etree.ElementTree as ET import requests import sys # ===================================================================== # CONFIGURATION — Replace these with YOUR Opensolr index details # ===================================================================== SOLR_HOST = "YOUR_HOST" INDEX_NAME = "YOUR_INDEX" USERNAME = "opensolr" PASSWORD = "YOUR_API_KEY" XML_FILE = "products.xml" BATCH_SIZE = 500 # Documents per batch SOLR_URL = f"https://{SOLR_HOST}/solr/{INDEX_NAME}/update?commit=true&wt=json" def send_batch(docs, url, auth): """Send a batch of <doc> elements to Solr.""" xml_body = "<add>\n" + "\n".join(docs) + "\n</add>" resp = requests.post( url, data=xml_body.encode("utf-8"), headers={"Content-Type": "text/xml"}, auth=auth, timeout=120, ) if resp.status_code != 200: print(f" ERROR (HTTP {resp.status_code}): {resp.text}") return 0 return len(docs) def main(): xml_file = sys.argv[1] if len(sys.argv) > 1 else XML_FILE auth = (USERNAME, PASSWORD) # Parse the XML file tree = ET.parse(xml_file) root = tree.getroot() # Collect all <doc> elements docs = [] for doc in root.findall("doc"): docs.append(ET.tostring(doc, encoding="unicode")) total = len(docs) print(f"Found {total} documents in {xml_file}") # Send in batches sent = 0 for i in range(0, total, BATCH_SIZE): batch = docs[i : i + BATCH_SIZE] sent += send_batch(batch, SOLR_URL, auth) print(f" Indexed {sent} / {total}") print(f"Done! {sent} documents indexed.") if __name__ == "__main__": main()
Example 4: Python — Convert Non-Solr XML to Solr Format
If your XML is not in Solr's <add><doc> format (e.g., it's a product feed, RSS, or custom XML), you need to convert it first:
# convert_and_index.py — Convert arbitrary XML to Solr format and index import xml.etree.ElementTree as ET import requests SOLR_HOST = "YOUR_HOST" INDEX_NAME = "YOUR_INDEX" USERNAME = "opensolr" PASSWORD = "YOUR_API_KEY" SOLR_URL = f"https://{SOLR_HOST}/solr/{INDEX_NAME}/update?commit=true&wt=json" # Example: convert a product catalog XML like: # <catalog> # <product sku="ABC123"> # <name>Widget</name> # <price>19.99</price> # </product> # </catalog> tree = ET.parse("catalog.xml") root = tree.getroot() solr_docs = [] for product in root.findall("product"): sku = product.get("sku", "") name = product.findtext("name", "") price = product.findtext("price", "0") category = product.findtext("category", "") desc = product.findtext("description", "") doc = f"""<doc> <field name="id">{sku}</field> <field name="title">{name}</field> <field name="price">{price}</field> <field name="category">{category}</field> <field name="description">{desc}</field> </doc>""" solr_docs.append(doc) print(f"Converted {len(solr_docs)} products to Solr format") # Send to Solr in one batch (or chunk for large datasets) xml_body = "<add>\n" + "\n".join(solr_docs) + "\n</add>" resp = requests.post( SOLR_URL, data=xml_body.encode("utf-8"), headers={"Content-Type": "text/xml"}, auth=(USERNAME, PASSWORD), timeout=120, ) if resp.status_code == 200: print("Indexed successfully!") else: print(f"Error: {resp.status_code} — {resp.text}")
Handling Large XML Files
For files with thousands or millions of documents, keep these in mind:
| Concern | Solution |
|---|---|
| Memory | Use batch processing (500-1000 docs per batch) — shown in examples above |
| Timeouts | Set commit=false during import, then send a final commit: curl -u user:pass "https://HOST/solr/INDEX/update?commit=true" |
| Speed | Use commitWithin=10000 instead of commit=true to let Solr batch commits every 10 seconds |
| Errors | Send small batches so you can identify which batch has bad data |
| Encoding | Always use UTF-8. Escape XML special characters (& < > ") |
Quick Reference
| Placeholder | Where to Find It |
|---|---|
YOUR_HOST |
Your Index Control Panel → "Hostname" |
YOUR_INDEX |
Your Index name |
YOUR_API_KEY |
Control Panel → Dashboard → "Secret API Key" |
| Endpoint | Method | Content-Type |
|---|---|---|
/solr/INDEX/update |
POST | text/xml |
/solr/INDEX/update?commit=true |
POST | text/xml (auto-commit after import) |
/solr/INDEX/update?commitWithin=10000 |
POST | text/xml (commit within 10 seconds) |
You can also use the Opensolr Data Ingestion API to push documents in JSON format without needing to deal with XML at all.