How to Import XML Data into Your Opensolr Index

Configuration

Import XML Data into Your Opensolr Index

XML DATA IMPORT FLOWYour XML File<add><doc>...</doc></add>Solr XML formatParse & ValidatePHP / Python / BashRead + chunk if largePOST to Solr/update?commit=trueContent-Type: text/xmlIndexed!Documents livein your indexDataImportHandler (DIH) was removed in Solr 9Use HTTP POST to /update insteadPOST your XML directly to Solr's /update endpoint — no plugins, no config, just HTTP.Works with curl, PHP, Python, Node.js, or any HTTP client.

Solr's DataImportHandler (DIH) was removed in Solr 9. The modern approach is simple: POST your XML data directly to Solr's /update endpoint via HTTP. No plugins needed, no data-config.xml, no special configuration — just send XML over HTTP.

This guide shows you how to parse XML files and index them into your Opensolr Index using curl, PHP, and Python.


The Solr XML Format

Solr accepts documents in a specific XML format. Every XML file you send must follow this structure:

SOLR XML DOCUMENT FORMAT<add><doc><field name="id">product-001</field><field name="title">Wireless Headphones</field></doc><doc><field name="id">product-002</field><field name="title">Bluetooth Speaker</field></doc></add>Document 1Document 2Each <doc> becomes one document in your index. Field names must match your schema.xml.

<?xml version="1.0" encoding="UTF-8"?>
<add>
  <doc>
    <field name="id">product-001</field>
    <field name="title">Wireless Headphones</field>
    <field name="description">Premium noise-cancelling wireless headphones</field>
    <field name="price">149.99</field>
    <field name="category">Electronics</field>
    <field name="in_stock">true</field>
  </doc>
  <doc>
    <field name="id">product-002</field>
    <field name="title">Bluetooth Speaker</field>
    <field name="description">Portable waterproof Bluetooth speaker</field>
    <field name="price">79.99</field>
    <field name="category">Electronics</field>
    <field name="in_stock">true</field>
  </doc>
</add>

Key rules:

  • The root element is <add>
  • Each document is wrapped in <doc>
  • Each field uses <field name="fieldname">value</field>
  • Field names must match fields defined in your schema.xml
  • The id field is required (unique key for each document)
  • You can include as many <doc> elements as you want in a single <add> block

$_ Example 1: curl — Quick Import from Command Line

The simplest way to import. Post the XML file directly to Solr:

#!/bin/bash
# =====================================================================
# CONFIGURATION — Replace these with YOUR Opensolr index details
# =====================================================================
SOLR_HOST="YOUR_HOST"
INDEX_NAME="YOUR_INDEX"
USERNAME="opensolr"
PASSWORD="YOUR_API_KEY"

XML_FILE="products.xml"

# Import the XML file into Solr
echo "Importing $XML_FILE into $INDEX_NAME..."

curl -u "$USERNAME:$PASSWORD" \
  -H "Content-Type: text/xml" \
  --data-binary @"$XML_FILE" \
  "https://$SOLR_HOST/solr/$INDEX_NAME/update?commit=true&wt=json"

echo ""
echo "Done!"

For large files, you can split and send in chunks — see the PHP and Python examples below.


PHP Example 2: PHP — Parse & Index XML with Chunking

This script reads an XML file, parses each <doc> element, and sends them to Solr in configurable batches. Handles large files without running out of memory.

<?php
// =====================================================================
// CONFIGURATION — Replace these with YOUR Opensolr index details
// =====================================================================
$solr_host  = "YOUR_HOST";
$index_name = "YOUR_INDEX";
$username   = "opensolr";
$password   = "YOUR_API_KEY";
$xml_file   = "products.xml";
$batch_size = 500;  // Documents per batch (adjust based on doc size)

$solr_url = "https://$solr_host/solr/$index_name/update?commit=true&wt=json";

// Load and parse the XML file
$xml = simplexml_load_file($xml_file);
if ($xml === false) {
    die("Error: Could not parse $xml_file\n");
}

$docs = $xml->doc;
$total = count($docs);
echo "Found $total documents in $xml_file\n";

// Send documents in batches
$batch = [];
$sent = 0;

foreach ($docs as $doc) {
    $batch[] = $doc->asXML();

    if (count($batch) >= $batch_size) {
        $sent += send_batch($batch, $solr_url, $username, $password);
        $batch = [];
        echo "  Indexed $sent / $total\n";
    }
}

// Send remaining documents
if (!empty($batch)) {
    $sent += send_batch($batch, $solr_url, $username, $password);
    echo "  Indexed $sent / $total\n";
}

echo "Done! $sent documents indexed.\n";

function send_batch($docs, $url, $user, $pass) {
    $xml_body = "<add>\n" . implode("\n", $docs) . "\n</add>";

    $ch = curl_init($url);
    curl_setopt_array($ch, [
        CURLOPT_USERPWD        => "$user:$pass",
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => $xml_body,
        CURLOPT_HTTPHEADER     => ["Content-Type: text/xml"],
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_TIMEOUT        => 120,
    ]);

    $response = curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($http_code !== 200) {
        echo "  ERROR (HTTP $http_code): $response\n";
        return 0;
    }

    return count($docs);
}

Py Example 3: Python — Parse & Index XML with Chunking

Same approach in Python. Uses xml.etree.ElementTree for parsing and requests for HTTP.

# import_xml.py — Parse and index XML documents into Opensolr
import xml.etree.ElementTree as ET
import requests
import sys

# =====================================================================
# CONFIGURATION — Replace these with YOUR Opensolr index details
# =====================================================================
SOLR_HOST  = "YOUR_HOST"
INDEX_NAME = "YOUR_INDEX"
USERNAME   = "opensolr"
PASSWORD   = "YOUR_API_KEY"
XML_FILE   = "products.xml"
BATCH_SIZE = 500  # Documents per batch

SOLR_URL = f"https://{SOLR_HOST}/solr/{INDEX_NAME}/update?commit=true&wt=json"

def send_batch(docs, url, auth):
    """Send a batch of <doc> elements to Solr."""
    xml_body = "<add>\n" + "\n".join(docs) + "\n</add>"
    resp = requests.post(
        url,
        data=xml_body.encode("utf-8"),
        headers={"Content-Type": "text/xml"},
        auth=auth,
        timeout=120,
    )
    if resp.status_code != 200:
        print(f"  ERROR (HTTP {resp.status_code}): {resp.text}")
        return 0
    return len(docs)

def main():
    xml_file = sys.argv[1] if len(sys.argv) > 1 else XML_FILE
    auth = (USERNAME, PASSWORD)

    # Parse the XML file
    tree = ET.parse(xml_file)
    root = tree.getroot()

    # Collect all <doc> elements
    docs = []
    for doc in root.findall("doc"):
        docs.append(ET.tostring(doc, encoding="unicode"))

    total = len(docs)
    print(f"Found {total} documents in {xml_file}")

    # Send in batches
    sent = 0
    for i in range(0, total, BATCH_SIZE):
        batch = docs[i : i + BATCH_SIZE]
        sent += send_batch(batch, SOLR_URL, auth)
        print(f"  Indexed {sent} / {total}")

    print(f"Done! {sent} documents indexed.")

if __name__ == "__main__":
    main()

Py Example 4: Python — Convert Non-Solr XML to Solr Format

If your XML is not in Solr's <add><doc> format (e.g., it's a product feed, RSS, or custom XML), you need to convert it first:

# convert_and_index.py — Convert arbitrary XML to Solr format and index
import xml.etree.ElementTree as ET
import requests

SOLR_HOST  = "YOUR_HOST"
INDEX_NAME = "YOUR_INDEX"
USERNAME   = "opensolr"
PASSWORD   = "YOUR_API_KEY"
SOLR_URL   = f"https://{SOLR_HOST}/solr/{INDEX_NAME}/update?commit=true&wt=json"

# Example: convert a product catalog XML like:
# <catalog>
#   <product sku="ABC123">
#     <name>Widget</name>
#     <price>19.99</price>
#   </product>
# </catalog>

tree = ET.parse("catalog.xml")
root = tree.getroot()

solr_docs = []
for product in root.findall("product"):
    sku = product.get("sku", "")
    name = product.findtext("name", "")
    price = product.findtext("price", "0")
    category = product.findtext("category", "")
    desc = product.findtext("description", "")

    doc = f"""<doc>
  <field name="id">{sku}</field>
  <field name="title">{name}</field>
  <field name="price">{price}</field>
  <field name="category">{category}</field>
  <field name="description">{desc}</field>
</doc>"""
    solr_docs.append(doc)

print(f"Converted {len(solr_docs)} products to Solr format")

# Send to Solr in one batch (or chunk for large datasets)
xml_body = "<add>\n" + "\n".join(solr_docs) + "\n</add>"
resp = requests.post(
    SOLR_URL,
    data=xml_body.encode("utf-8"),
    headers={"Content-Type": "text/xml"},
    auth=(USERNAME, PASSWORD),
    timeout=120,
)

if resp.status_code == 200:
    print("Indexed successfully!")
else:
    print(f"Error: {resp.status_code}{resp.text}")

Handling Large XML Files

For files with thousands or millions of documents, keep these in mind:

Concern Solution
Memory Use batch processing (500-1000 docs per batch) — shown in examples above
Timeouts Set commit=false during import, then send a final commit: curl -u user:pass "https://HOST/solr/INDEX/update?commit=true"
Speed Use commitWithin=10000 instead of commit=true to let Solr batch commits every 10 seconds
Errors Send small batches so you can identify which batch has bad data
Encoding Always use UTF-8. Escape XML special characters (&amp; &lt; &gt; &quot;)

Quick Reference

Placeholder Where to Find It
YOUR_HOST Your Index Control Panel → "Hostname"
YOUR_INDEX Your Index name
YOUR_API_KEY Control Panel → Dashboard → "Secret API Key"
Endpoint Method Content-Type
/solr/INDEX/update POST text/xml
/solr/INDEX/update?commit=true POST text/xml (auto-commit after import)
/solr/INDEX/update?commitWithin=10000 POST text/xml (commit within 10 seconds)

You can also use the Opensolr Data Ingestion API to push documents in JSON format without needing to deal with XML at all.