你是如何解析和处理 PHP 中的 HTML/XML 的?

在 PHP 中解析 HTML/XML 以从中提取信息有哪些好的选择?

Try Simple HTML DOM Parser.

  • A HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.
  • Download

Note: as the name suggests, it can be useful for simple tasks. It uses regular expressions instead of an HTML parser, so will be considerably slower for more complex tasks. The bulk of its codebase was written in 2008, with only small improvements made since then. It does not follow modern PHP coding standards and would be challenging to incorporate into a modern PSR-compliant project.

Examples:

How to get HTML elements:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images
foreach($html->find(‘img’) as $element)
echo $element->src . ‘<br>’;

// Find all links
foreach($html->find(‘a’) as $element)
echo $element->href . ‘<br>’;

How to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find(‘div’, 1)->class = ‘bar’;

$html->find(‘div[id=hello]’, 0)->innertext = ‘foo’;

echo $html;

Extract content from HTML:

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;

Scraping Slashdot:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find(‘div.article’) as $article) {
$item[‘title’] = $article->find(‘div.title’, 0)->plaintext;
$item[‘intro’] = $article->find(‘div.intro’, 0)->plaintext;
$item[‘details’] = $article->find(‘div.details’, 0)->plaintext;
$articles[] = $item;
}

print_r($articles);


Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.

Simple HTML DOM is a great open-source parser:

simplehtmldom.sourceforge

It treats DOM elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.

I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.

This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.

I recommend PHP Simple HTML DOM Parser.

It really has nice features, like:

foreach($html->find('img') as $element)
       echo $element->src . '<br>';
</div>

I created a library named PHPPowertools/DOM-Query, which allows you to crawl HTML5 and XML documents just like you do with jQuery.

Under the hood, it uses symfony/DomCrawler for conversion of CSS selectors to XPath selectors. It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.


Example use :

namespace PowerTools;

// Get file content
$htmlcode = file_get_contents(‘https://github.com’);

// Define your DOMCrawler based on file string
$H = new DOM_Query($htmlcode);

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query($H->select(‘body’));

// Passing a string (CSS selector)
$s = $H->select(‘div.foo’);

// Passing an element object (DOM Element)
$s = $H->select($documentBody);

// Passing a DOM Query object
$s = $H->select( $H->select(‘p + p’));

// Select the body tag
$body = $H->select(‘body’);

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select(’.site-header, .masthead, .site-body, .site-footer’);

// Nest your methods just like you would with jQuery
$siteblocks->select(‘button’)->add(‘span’)->addClass(‘icon icon-printer’);

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
return $i . " - " . $val->attr(‘class’);
});

// Append the following HTML to all site blocks
$siteblocks->append(’<div class=“site-center”></div>’);

// Use a descendant selector to select the site’s footer
$sitefooter = $body->select(’.site-footer > .site-center’);

// Set some attributes for the site’s footer
$sitefooter->attr(array(‘id’ => ‘aweeesome’, ‘data-val’ => ‘see’));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr(‘data-val’, function( $i, $val) {
return $i . " - " . $val->attr(‘class’) . " - photo by Kelly Clark";
});

// Select the parent of the site’s footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site’s footer’s parent
$sitefooterparent->select(‘i’)->removeAttr(‘class’);

// Wrap the site’s footer within two nex selectors
$sitefooter->wrap(’<section><div class=“footer-wrapper”></div></section>’);

[…]


Supported methods :


  1. Renamed 'select', for obvious reasons
  2. Renamed 'void', since 'empty' is a reserved word in PHP

NOTE :

The library also includes its own zero-configuration autoloader for PSR-0 compatible libraries. The example included should work out of the box without any additional configuration. Alternatively, you can use it with composer.