Simple HTML DOM 简单的 PHP DOM 解析器 - 资源分享

Simple HTML DOM 简单的 PHP DOM 解析器

发布于 2020-05-09 字数 16319 浏览 1427 评论 0

Simple HTML DOM 是一个简单的 PHP DOM 解析器,可以想 jQuery 一样操作 HTML 元素,查找和替换网页的内容,并且支持不规范的 HTML 标签,最低版本要求 PHP5+。

特点

  • 用 PHP5+编写的 HTML DOM 解析器可以让您以非常简单的方式操作 HTML 文档。
  • 要求 PHP 5+
  • 支持无效的 HTML。
  • 在 HTML 页面上查找带有选择器的标记,如下所示 jQuery
  • 在一行中从HTML中提取内容。

使用方法

通过文件获取元素

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
  echo $element->src . '
';

// Find all links
foreach($html->find('a') as $element)
  echo $element->href . '
';

修改 HTML 内容

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

提取文本

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;

综合示例

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
  $item['title']   = $article->find('div.title', 0)->plaintext;
  $item['intro']  = $article->find('div.intro', 0)->plaintext;
  $item['details'] = $article->find('div.details', 0)->plaintext;
  $articles[] = $item;
}

print_r($articles);

如何创建 HTML DOM 对象?

快速方式

// Create a DOM object from a string
$html = str_get_html('<html><body>Hello!</body></html>');

// Create a DOM object from a URL
$html = file_get_html('http://www.google.com/');

// Create a DOM object from a HTML file
$html = file_get_html('test.htm');

面向对象

// Create a DOM object
$html = new simple_html_dom();

// Load HTML from a string
$html->load('<html><body>Hello!</body></html>');

// Load HTML from a URL 
$html->load_file('http://www.google.com/');

// Load HTML from a HTML file 
$html->load_file('test.htm');

如何找到 HTML 元素?

基本用法

// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1);

// Find all <div> with the id attribute
$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');

高级用法

// Find all element which id=foo
$ret = $html->find('#foo');

// Find all element which class=foo
$ret = $html->find('.foo');

// Find all element has attribute id
$ret = $html->find('*[id]');

// Find all anchors and images 
$ret = $html->find('a, img');

// Find all anchors and images with the "title" attribute
$ret = $html->find('a[title], img[title]');

后代选择器

// Find all <li> in <ul> 
foreach($html->find('ul') as $ul) 
{
  foreach($ul->find('li') as $li) 
  {
    // do something...
  }
}

// Find first <li> in first <ul> 
$e = $html->find('ul', 0)->find('li', 0);

嵌套选择器

// Find all <li> in <ul> 
foreach($html->find('ul') as $ul) 
{
  foreach($ul->find('li') as $li) 
  {
    // do something...
  }
}

// Find first <li> in first <ul> 
$e = $html->find('ul', 0)->find('li', 0);

属性过滤器

在属性选择器中支持这些运算符:

滤光器 描述
[属性] 匹配以下元素指定的属性。
[!属性] 匹配以下元素没有指定的属性。
[属性=值] 将具有指定属性的元素与一定值.
[属性!=值] 匹配以下元素没有具有特定值的指定属性。
[属性^=值] 匹配具有指定属性的元素开始有某个值。
[属性$=值] 匹配具有指定属性的元素端部有某个值。
[属性*=值] 匹配具有指定属性的元素某个值。

文本和评论

// Find all text blocks 
$es = $html->find('text');

// Find all comment (<!--...-->) blocks 
$es = $html->find('comment');

如何访问HTML元素的属性?

获取、设置和删除属性

// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;

// Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's value as true or false)
$e->href = 'my link';

// Remove a attribute, set it's value as null! 
$e->href = null;

// Determine whether a attribute exist? 
if(isset($e->href)) 
   echo 'href exist!';

魔术属性

// Example
$html = str_get_html("<div>foo <b>bar</b></div>"); 
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"
属性名 使用
$e->tag 读或写标签名元素。
$e->outertext 读或写外部HTML文本元素。
$e->innertext 读或写内部HTML文本元素。
$e->plaintext 读或写纯文本元素。

小贴士

// Extract contents from HTML 
echo $html->plaintext;

// Wrap a element
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';

// Remove a element, set it's outertext as an empty string 
$e->outertext = '';

// Append a element
$e->outertext = $e->outertext . '<div>foo<div>';

// Insert a element
$e->outertext = '<div>foo<div>' . $e->outertext;

如何遍历 DOM 树?

Background Knowledge

// If you are not so familiar with HTML DOM, check this link to learn more...

// Example
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
// or 
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');

转换树结构

You can also call methods with Camel naming convertions.

Method Description
mixed
$e->children ( [int $index] )
Returns the Nth child object if index is set, otherwise return an array of children.
element
$e->parent ()
Returns the parent of element.
element
$e->first_child ()
Returns the first child of element, or null if not found.
element
$e->last_child ()
Returns the last child of element, or null if not found.
element
$e->next_sibling ()
Returns the next sibling of element, or null if not found.
element
$e->prev_sibling ()
Returns the previous sibling of element, or null if not found.

如何转储DOM对象的内容?

快速方式

// Dumps the internal DOM tree back into string 
$str = $html;

// Print it!
echo $html;

面向对象

// Dumps the internal DOM tree back into string 
$str = $html->save();

// Dumps the internal DOM tree back into a file 
$html->save('result.htm');

如何自定义解析行为?

回调函数

// Write a function with parameter "$element"
function my_callback($element) {
   // Hide all <b> tags 
   if ($element->tag=='b')
    $element->outertext = '';
}

// Register the callback function with it's function name
$html->set_callback('my_callback');

// Callback function will be invoked while dumping
echo $html;

API Reference

Helper functions

Name Description
objectstr_get_html ( string $content ) Creates a DOM object from a string.
objectfile_get_html ( string $filename ) Creates a DOM object from a file or a URL.

DOM methods & properties

Name Description
void
__construct ( [string $filename] )
Constructor, set the filename parameter will automatically load the contents, either text or file/url.
string
plaintext
Returns the contents extracted from HTML.
void
clear ()
Clean up memory.
void
load ( string $content )
Load contents from a string.
string
save ( [string $filename] )
Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.
void
load_file ( string $filename )
Load contents from a from a file or a URL.
void
set_callback ( string $function_name )
Set a callback function.
mixed
find ( string $selector [, int $index] )
Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

Element methods & properties

Name Description
string
[attribute]
Read or write element’s attribure value.
string
tag
Read or write the tag name of element.
string
outertext
Read or write the outer HTML text of element.
string
innertext
Read or write the inner HTML text of element.
string
plaintext
Read or write the plain text of element.
mixed
find ( string $selector [, int $index] )
Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object.

DOM traversing

Name Description
mixed
$e->children ( [int $index] )
Returns the Nth child object if index is set, otherwise return an array of children.
element
$e->parent ()
Returns the parent of element.
element
$e->first_child ()
Returns the first child of element, or null if not found.
element
$e->last_child ()
Returns the last child of element, or null if not found.
element
$e->next_sibling ()
Returns the next sibling of element, or null if not found.
element
$e->prev_sibling ()
Returns the previous sibling of element, or null if not found.

Camel naming convertions

You can also call methods with W3C STANDARD camel naming convertions.

Method Mapping
array
$e->getAllAttributes ()
array
$e->attr
string
$e->getAttribute ( $name )
string
$e->attribute
void
$e->setAttribute ( $name, $value )
void
$value = $e->attribute
bool
$e->hasAttribute ( $name )
bool
isset($e->attribute)
void
$e->removeAttribute ( $name )
void
$e->attribute = null
element
$e->getElementById ( $id )
mixed
$e->find ( “#$id”, 0 )
mixed
$e->getElementsById ( $id [,$index] )
mixed
$e->find ( “#$id” [, int $index] )
element
$e->getElementByTagName ($name )
mixed
$e->find ( $name, 0 )
mixed
$e->getElementsByTagName ( $name [, $index] )
mixed
$e->find ( $name [, int $index] )
element
$e->parentNode ()
element
$e->parent ()
mixed
$e->childNodes ( [$index] )
mixed
$e->children ( [int $index] )
element
$e->firstChild ()
element
$e->first_child ()
element
$e->lastChild ()
element
$e->last_child ()
element
$e->nextSibling ()
element
$e->next_sibling ()
element
$e->previousSibling ()
element
$e->prev_sibling ()

相关资源

如果你对这篇文章有疑问,欢迎到本站 社区 发帖提问或使用手Q扫描下方二维码加群参与讨论,获取更多帮助。

扫码加入群聊

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

目前还没有任何评论,快来抢沙发吧!

关于作者

JSmiles

生命进入颠沛而奔忙的本质状态,并将以不断告别和相遇的陈旧方式继续下去。

2583 文章
29 评论
84935 人气
更多

推荐作者

佚名

文章 0 评论 0

文江

文章 0 评论 0

2012013325

文章 0 评论 0

女中豪杰

文章 2 评论 0