How to Import XML Data into Your Opensolr Index

Import XML Data into Your Opensolr Index

Solr's DataImportHandler (DIH) was removed in Solr 9. The modern approach is simple: POST your XML data directly to Solr's /update endpoint via HTTP. No plugins needed, no data-config.xml, no special configuration — just send XML over HTTP.

This guide shows you how to parse XML files and index them into your Opensolr Index using curl, PHP, and Python.

The Solr XML Format

Solr accepts documents in a specific XML format. Every XML file you send must follow this structure:

<?xml version="1.0" encoding="UTF-8"?>
<add>
  <doc>
    <field name="id">product-001</field>
    <field name="title">Wireless Headphones</field>
    <field name="description">Premium noise-cancelling wireless headphones</field>
    <field name="price">149.99</field>
    <field name="category">Electronics</field>
    <field name="in_stock">true</field>
  </doc>
  <doc>
    <field name="id">product-002</field>
    <field name="title">Bluetooth Speaker</field>
    <field name="description">Portable waterproof Bluetooth speaker</field>
    <field name="price">79.99</field>
    <field name="category">Electronics</field>
    <field name="in_stock">true</field>
  </doc>
</add>

Key rules:

The root element is <add>
Each document is wrapped in <doc>
Each field uses <field name="fieldname">value</field>
Field names must match fields defined in your schema.xml
The id field is required (unique key for each document)
You can include as many <doc> elements as you want in a single <add> block

Example 1: curl — Quick Import from Command Line

The simplest way to import. Post the XML file directly to Solr:

#!/bin/bash
# =====================================================================
# CONFIGURATION — Replace these with YOUR Opensolr index details
# =====================================================================
SOLR_HOST="YOUR_HOST"
INDEX_NAME="YOUR_INDEX"
USERNAME="opensolr"
PASSWORD="YOUR_API_KEY"

XML_FILE="products.xml"

# Import the XML file into Solr
echo "Importing $XML_FILE into $INDEX_NAME..."

curl -u "$USERNAME:$PASSWORD" \
  -H "Content-Type: text/xml" \
  --data-binary @"$XML_FILE" \
  "https://$SOLR_HOST/solr/$INDEX_NAME/update?commit=true&wt=json"

echo ""
echo "Done!"

For large files, you can split and send in chunks — see the PHP and Python examples below.

Example 2: PHP — Parse & Index XML with Chunking

This script reads an XML file, parses each <doc> element, and sends them to Solr in configurable batches. Handles large files without running out of memory.

<?php
// =====================================================================
// CONFIGURATION — Replace these with YOUR Opensolr index details
// =====================================================================
$solr_host  = "YOUR_HOST";
$index_name = "YOUR_INDEX";
$username   = "opensolr";
$password   = "YOUR_API_KEY";
$xml_file   = "products.xml";
$batch_size = 500;  // Documents per batch (adjust based on doc size)

$solr_url = "https://$solr_host/solr/$index_name/update?commit=true&wt=json";

// Load and parse the XML file
$xml = simplexml_load_file($xml_file);
if ($xml === false) {
    die("Error: Could not parse $xml_file
");
}

$docs = $xml->doc;
$total = count($docs);
echo "Found $total documents in $xml_file
";

// Send documents in batches
$batch = [];
$sent = 0;

foreach ($docs as $doc) {
    $batch[] = $doc->asXML();

    if (count($batch) >= $batch_size) {
        $sent += send_batch($batch, $solr_url, $username, $password);
        $batch = [];
        echo "  Indexed $sent / $total
";
    }
}

// Send remaining documents
if (!empty($batch)) {
    $sent += send_batch($batch, $solr_url, $username, $password);
    echo "  Indexed $sent / $total
";
}

echo "Done! $sent documents indexed.
";

function send_batch($docs, $url, $user, $pass) {
    $xml_body = "<add>
" . implode("
", $docs) . "
</add>";

    $ch = curl_init($url);
    curl_setopt_array($ch, [
        CURLOPT_USERPWD        => "$user:$pass",
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => $xml_body,
        CURLOPT_HTTPHEADER     => ["Content-Type: text/xml"],
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_TIMEOUT        => 120,
    ]);

    $response = curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($http_code !== 200) {
        echo "  ERROR (HTTP $http_code): $response
";
        return 0;
    }

    return count($docs);
}

Example 3: Python — Parse & Index XML with Chunking

Same approach in Python. Uses xml.etree.ElementTree for parsing and requests for HTTP.

# import_xml.py — Parse and index XML documents into Opensolr
import xml.etree.ElementTree as ET
import requests
import sys

# =====================================================================
# CONFIGURATION — Replace these with YOUR Opensolr index details
# =====================================================================
SOLR_HOST  = "YOUR_HOST"
INDEX_NAME = "YOUR_INDEX"
USERNAME   = "opensolr"
PASSWORD   = "YOUR_API_KEY"
XML_FILE   = "products.xml"
BATCH_SIZE = 500  # Documents per batch

SOLR_URL = f"https://{SOLR_HOST}/solr/{INDEX_NAME}/update?commit=true&wt=json"

def send_batch(docs, url, auth):
    """Send a batch of <doc> elements to Solr."""
    xml_body = "<add>
" + "
".join(docs) + "
</add>"
    resp = requests.post(
        url,
        data=xml_body.encode("utf-8"),
        headers={"Content-Type": "text/xml"},
        auth=auth,
        timeout=120,
    )
    if resp.status_code != 200:
        print(f"  ERROR (HTTP {resp.status_code}): {resp.text}")
        return 0
    return len(docs)

def main():
    xml_file = sys.argv[1] if len(sys.argv) > 1 else XML_FILE
    auth = (USERNAME, PASSWORD)

    # Parse the XML file
    tree = ET.parse(xml_file)
    root = tree.getroot()

    # Collect all <doc> elements
    docs = []
    for doc in root.findall("doc"):
        docs.append(ET.tostring(doc, encoding="unicode"))

    total = len(docs)
    print(f"Found {total} documents in {xml_file}")

    # Send in batches
    sent = 0
    for i in range(0, total, BATCH_SIZE):
        batch = docs[i : i + BATCH_SIZE]
        sent += send_batch(batch, SOLR_URL, auth)
        print(f"  Indexed {sent} / {total}")

    print(f"Done! {sent} documents indexed.")

if __name__ == "__main__":
    main()

Example 4: Python — Convert Non-Solr XML to Solr Format

If your XML is not in Solr's <add><doc> format (e.g., it's a product feed, RSS, or custom XML), you need to convert it first:

# convert_and_index.py — Convert arbitrary XML to Solr format and index
import xml.etree.ElementTree as ET
import requests

SOLR_HOST  = "YOUR_HOST"
INDEX_NAME = "YOUR_INDEX"
USERNAME   = "opensolr"
PASSWORD   = "YOUR_API_KEY"
SOLR_URL   = f"https://{SOLR_HOST}/solr/{INDEX_NAME}/update?commit=true&wt=json"

# Example: convert a product catalog XML like:
# <catalog>
#   <product sku="ABC123">
#     <name>Widget</name>
#     <price>19.99</price>
#   </product>
# </catalog>

tree = ET.parse("catalog.xml")
root = tree.getroot()

solr_docs = []
for product in root.findall("product"):
    sku = product.get("sku", "")
    name = product.findtext("name", "")
    price = product.findtext("price", "0")
    category = product.findtext("category", "")
    desc = product.findtext("description", "")

    doc = f"""<doc>
  <field name="id">{sku}</field>
  <field name="title">{name}</field>
  <field name="price">{price}</field>
  <field name="category">{category}</field>
  <field name="description">{desc}</field>
</doc>"""
    solr_docs.append(doc)

print(f"Converted {len(solr_docs)} products to Solr format")

# Send to Solr in one batch (or chunk for large datasets)
xml_body = "<add>
" + "
".join(solr_docs) + "
</add>"
resp = requests.post(
    SOLR_URL,
    data=xml_body.encode("utf-8"),
    headers={"Content-Type": "text/xml"},
    auth=(USERNAME, PASSWORD),
    timeout=120,
)

if resp.status_code == 200:
    print("Indexed successfully!")
else:
    print(f"Error: {resp.status_code} — {resp.text}")

Handling Large XML Files

For files with thousands or millions of documents, keep these in mind:

Concern	Solution
Memory	Use batch processing (500-1000 docs per batch) — shown in examples above
Timeouts	Set `commit=false` during import, then send a final commit: `curl -u user:pass "https://HOST/solr/INDEX/update?commit=true"`
Speed	Use `commitWithin=10000` instead of `commit=true` to let Solr batch commits every 10 seconds
Errors	Send small batches so you can identify which batch has bad data
Encoding	Always use UTF-8. Escape XML special characters (`&` `<` `>` `"`)

Quick Reference

Placeholder	Where to Find It
`YOUR_HOST`	Your Index Control Panel → "Hostname"
`YOUR_INDEX`	Your Index name
`YOUR_API_KEY`	Control Panel → Dashboard → "Secret API Key"

Endpoint	Method	Content-Type
`/solr/INDEX/update`	POST	`text/xml`
`/solr/INDEX/update?commit=true`	POST	`text/xml` (auto-commit after import)
`/solr/INDEX/update?commitWithin=10000`	POST	`text/xml` (commit within 10 seconds)

You can also use the Opensolr Data Ingestion API to push documents in JSON format without needing to deal with XML at all.

How to Import XML Data into Your Opensolr Index

Import XML Data into Your Opensolr Index

The Solr XML Format

$_ Example 1: curl — Quick Import from Command Line

PHP Example 2: PHP — Parse & Index XML with Chunking

Py Example 3: Python — Parse & Index XML with Chunking

Py Example 4: Python — Convert Non-Solr XML to Solr Format