Skip to content

Commit d383264

Browse files
authored
Merge branch 'dev' into master
2 parents 4004e5b + 23df5a4 commit d383264

25 files changed

+806
-385
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
composer.phar
2+
composer.lock
3+
/vendor/
4+
.idea/

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ language: php
33
php:
44
- 5.6
55
- 7.0
6-
- hhvm
6+
~ 7.1
77

88
install:
99
- composer self-update

README.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@ Version 1.7.0
77
[![Coverage Status](https://coveralls.io/repos/paquettg/php-html-parser/badge.png)](https://coveralls.io/r/paquettg/php-html-parser)
88
[![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/paquettg/php-html-parser/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/paquettg/php-html-parser/?branch=master)
99

10-
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by [sunra/php-simple-html-dom-parser](https://github.com/sunra/php-simple-html-dom-parser) but the support seems to have stopped so this project is my adaptation of his previous work.
10+
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assist in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by [sunra/php-simple-html-dom-parser](https://github.com/sunra/php-simple-html-dom-parser) but the support seems to have stopped so this project is my adaptation of his previous work.
1111

1212
Install
1313
-------
1414

15-
This package can be found on [packagist](https://packagist.org/packages/paquettg/php-html-parser) and is best loaded using [composer](http://getcomposer.org/). We support php 5.6, 7.0, and hhvm 2.3.
15+
This package can be found on [packagist](https://packagist.org/packages/paquettg/php-html-parser) and is best loaded using [composer](http://getcomposer.org/). We support php 5.6, 7.0, 7.1.
1616

1717
Usage
1818
-----
@@ -35,7 +35,7 @@ The above will output "click here". Simple no? There are many ways to get the sa
3535
Loading Files
3636
------------------
3737

38-
You may also seamlessly load a file into the dom instead of a string, which is much more convinient and is how I except most developers will be loading the html. The following example is taken from our test and uses the "big.html" file found there.
38+
You may also seamlessly load a file into the dom instead of a string, which is much more convenient and is how I except most developers will be loading the html. The following example is taken from our test and uses the "big.html" file found there.
3939

4040
```php
4141
// Assuming you installed from Composer:
@@ -61,9 +61,9 @@ foreach ($contents as $content)
6161
}
6262
```
6363

64-
This example loads the html from big.html, a real page found online, and gets all the content-border classes to process. It also shows a few things you can do with a node but it is not an exhaustive list of methods that a node has avaiable.
64+
This example loads the html from big.html, a real page found online, and gets all the content-border classes to process. It also shows a few things you can do with a node but it is not an exhaustive list of methods that a node has available.
6565

66-
Alternativly, you can always use the `load()` method to load the file. It will attempt to find the file using `file_exists` and, if succesfull, will call `loadFromFile()` for you. The same applies to a URL and `loadFromUrl()` method.
66+
Alternativly, you can always use the `load()` method to load the file. It will attempt to find the file using `file_exists` and, if successful, will call `loadFromFile()` for you. The same applies to a URL and `loadFromUrl()` method.
6767

6868
Loading Url
6969
----------------
@@ -102,7 +102,7 @@ As long as the Connector object implements the `PHPHtmlParser\CurlInterface` int
102102
Loading Strings
103103
---------------
104104

105-
Loading a string directly, with out the checks in `load()` is also easely done.
105+
Loading a string directly, with out the checks in `load()` is also easily done.
106106

107107
```php
108108
// Assuming you installed from Composer:
@@ -142,19 +142,19 @@ At the moment we support 7 options.
142142

143143
**Strict**
144144

145-
Strict, by default false, will throw a `StrickException` if it find that the html is not strict complient (all tags must have a clossing tag, no attribute with out a value, etc.).
145+
Strict, by default false, will throw a `StrickException` if it find that the html is not strictly compliant (all tags must have a closing tag, no attribute with out a value, etc.).
146146

147147
**whitespaceTextNode**
148148

149149
The whitespaceTextNode, by default true, option tells the parser to save textnodes even if the content of the node is empty (only whitespace). Setting it to false will ignore all whitespace only text node found in the document.
150150

151151
**enforceEncoding**
152152

153-
The enforceEncoding, by default null, option will enforce an charater set to be used for reading the content and returning the content in that encoding. Setting it to null will trigger an attempt to figure out the encoding from within the content of the string given instead.
153+
The enforceEncoding, by default null, option will enforce an character set to be used for reading the content and returning the content in that encoding. Setting it to null will trigger an attempt to figure out the encoding from within the content of the string given instead.
154154

155155
**cleanupInput**
156156

157-
Set this to `true` to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to `false`.
157+
Set this to `false` to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to `true`.
158158

159159
**removeScripts**
160160

@@ -217,3 +217,13 @@ $a->delete();
217217
unset($a);
218218
echo $dom; // '<div class="all"><p>Hey bro, <br /> :)</p></div>');
219219
```
220+
221+
You can modify the text of `TextNode` objects easely. Please note that, if you set an encoding, the new text will be encoded using the existing encoding.
222+
223+
```php
224+
$dom = new Dom;
225+
$dom->load('<div class="all"><p>Hey bro, <a href="google.com">click here</a><br /> :)</p></div>');
226+
$a = $dom->find('a')[0];
227+
$a->firstChild()->setText('biz baz');
228+
echo $dom; // '<div class="all"><p>Hey bro, <a href="google.com">biz baz</a><br /> :)</p></div>'
229+
```

composer.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,11 @@
1515
],
1616
"require": {
1717
"php": ">=5.6",
18+
"ext-mbstring": "*",
1819
"paquettg/string-encode": "~0.1.0"
1920
},
2021
"require-dev": {
21-
"phpunit/phpunit": "~5.3.0",
22+
"phpunit/phpunit": "~5.7.0",
2223
"satooshi/php-coveralls": "~1.0.0",
2324
"mockery/mockery": "~0.9.0"
2425
},

src/PHPHtmlParser/Curl.php

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,11 @@ public function get($url)
2828

2929
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
3030
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
31+
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
32+
curl_setopt($ch, CURLOPT_VERBOSE, true);
33+
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
34+
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36');
35+
curl_setopt($ch, CURLOPT_URL, $url);
3136

3237
$content = curl_exec($ch);
3338
if ($content === false) {

src/PHPHtmlParser/Dom.php

Lines changed: 130 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,32 @@ class Dom
7979
* @var array
8080
*/
8181
protected $selfClosing = [
82-
'img',
82+
'area',
83+
'base',
84+
'basefont',
8385
'br',
86+
'col',
87+
'embed',
88+
'hr',
89+
'img',
8490
'input',
85-
'meta',
91+
'keygen',
8692
'link',
87-
'hr',
88-
'base',
89-
'embed',
93+
'meta',
94+
'param',
95+
'source',
9096
'spacer',
97+
'track',
98+
'wbr'
9199
];
92100

101+
/**
102+
* A list of tags where there should be no /> at the end (html5 style)
103+
*
104+
* @var array
105+
*/
106+
protected $noSlash = [];
107+
93108
/**
94109
* Returns the inner html of the root node.
95110
*
@@ -120,6 +135,7 @@ public function __get($name)
120135
*/
121136
public function load($str, $options = [])
122137
{
138+
AbstractNode::resetCount();
123139
// check if it's a file
124140
if (strpos($str, "\n") === false && is_file($str)) {
125141
return $this->loadFromFile($str, $options);
@@ -219,6 +235,20 @@ public function find($selector, $nth = null)
219235
return $this->root->find($selector, $nth);
220236
}
221237

238+
/**
239+
* Find element by Id on the root node
240+
*
241+
* @param int $id Element Id
242+
* @return mixed
243+
*
244+
*/
245+
public function findById($id)
246+
{
247+
$this->isLoaded();
248+
249+
return $this->root->findById($id);
250+
}
251+
222252
/**
223253
* Adds the tag (or tags in an array) to the list of tags that will always
224254
* be self closing.
@@ -267,6 +297,53 @@ public function clearSelfClosingTags()
267297
return $this;
268298
}
269299

300+
301+
/**
302+
* Adds a tag to the list of self closing tags that should not have a trailing slash
303+
*
304+
* @param $tag
305+
* @return $this
306+
*/
307+
public function addNoSlashTag($tag)
308+
{
309+
if ( ! is_array($tag)) {
310+
$tag = [$tag];
311+
}
312+
foreach ($tag as $value) {
313+
$this->noSlash[] = $value;
314+
}
315+
316+
return $this;
317+
}
318+
319+
/**
320+
* Removes a tag from the list of no-slash tags.
321+
*
322+
* @param $tag
323+
* @return $this
324+
*/
325+
public function removeNoSlashTag($tag)
326+
{
327+
if ( ! is_array($tag)) {
328+
$tag = [$tag];
329+
}
330+
$this->noSlash = array_diff($this->noSlash, $tag);
331+
332+
return $this;
333+
}
334+
335+
/**
336+
* Empties the list of no-slash tags.
337+
*
338+
* @return $this
339+
*/
340+
public function clearNoSlashTags()
341+
{
342+
$this->noSlash = [];
343+
344+
return $this;
345+
}
346+
270347
/**
271348
* Simple wrapper function that returns the first child.
272349
*
@@ -291,6 +368,42 @@ public function lastChild()
291368
return $this->root->lastChild();
292369
}
293370

371+
/**
372+
* Simple wrapper function that returns count of child elements
373+
*
374+
* @return int
375+
*/
376+
public function countChildren()
377+
{
378+
$this->isLoaded();
379+
380+
return $this->root->countChildren();
381+
}
382+
383+
/**
384+
* Get array of children
385+
*
386+
* @return array
387+
*/
388+
public function getChildren()
389+
{
390+
$this->isLoaded();
391+
392+
return $this->root->getChildren();
393+
}
394+
395+
/**
396+
* Check if node have children nodes
397+
*
398+
* @return bool
399+
*/
400+
public function hasChildren()
401+
{
402+
$this->isLoaded();
403+
404+
return $this->root->hasChildren();
405+
}
406+
294407
/**
295408
* Simple wrapper function that returns an element by the
296409
* id.
@@ -391,7 +504,9 @@ protected function clean($str)
391504
}
392505

393506
// strip out server side scripts
394-
$str = mb_eregi_replace("(<\?)(.*?)(\?>)", '', $str);
507+
if ($this->options->get('serverSideScriptis') == true){
508+
$str = mb_eregi_replace("(<\?)(.*?)(\?>)", '', $str);
509+
}
395510

396511
// strip smarty scripts
397512
$str = mb_eregi_replace("(\{\w)(.*?)(\})", '', $str);
@@ -516,8 +631,8 @@ protected function parseTag()
516631
}
517632

518633
if (empty($name)) {
519-
$this->content->fastForward(1);
520-
continue;
634+
$this->content->skipByToken('blank');
635+
continue;
521636
}
522637

523638
$this->content->skipByToken('blank');
@@ -588,6 +703,13 @@ protected function parseTag()
588703

589704
// We force self closing on this tag.
590705
$node->getTag()->selfClosing();
706+
707+
// Should this tag use a trailing slash?
708+
if(in_array($tag, $this->noSlash))
709+
{
710+
$node->getTag()->noTrailingSlash();
711+
}
712+
591713
}
592714

593715
$this->content->fastForward(1);

0 commit comments

Comments
 (0)