Parsing HTML contents between two DIVs

Sometimes you have to work with HTML which is not structured in semantically meaningful ways. Here’s a quick code snippet for finding the HTML between two DIVs using PHP’s DOMDocument and DOMXPath.

Suppose you have the following HTML structure:

<html>
<div id="some-div">...</div>
<div>...</div>
<div>...</div>
<div>...</div>
<div id="some-other-div">...</div>
</html>

Let’s say you want to grab the three DIVs between the ID “some-div” and “some-other-div”. Let’s also assume that your HTML is available in a variable named $html. Here’s how you’d get the HTML between those two DIVs.

$dom = new DOMDocument();
@$dom->loadHtml($html);
$xpath = new DOMXPath($dom);

$snippet = '';

// Find the DIV with ID "some-div".
$node = $xpath->query('//div[@id="some-div"]')->item(0);

// Loop through each sibling node.
while ($node = $node->nextSibling) {

  // Skip stuff like "#text" elements which cause problems.
  if (get_class($node) != 'DOMElement') {
    continue;
  }

  // If we get to the last DIV, stop.
  if ($node->getAttribute('id') == 'some-other-div') {
    break;
  }

  // Grab HTML of this element.
  $snippet .= $dom->saveXML($node);
}

That’s it!