Here is a function I wrote to capitalize the previous remarks about charset problems (UTF-8...) when using loadHTML and then DOM functions.
It adds the charset meta tag just after <head> to improve automatic encoding detection, converts any specific character to an html entity, thus PHP DOM functions/attributes will return correct values.
<?php
mb_detect_order("ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15");
function loadNprepare($url,$encod='') {
$content = file_get_contents($url);
if (!empty($content)) {
if (empty($encod))
$encod = mb_detect_encoding($content);
$headpos = mb_strpos($content,'<head>');
if (FALSE=== $headpos)
$headpos= mb_strpos($content,'<HEAD>');
if (FALSE!== $headpos) {
$headpos+=6;
$content = mb_substr($content,0,$headpos) . '<meta http-equiv="Content-Type" content="text/html; charset='.$encod.'">' .mb_substr($content,$headpos);
}
$content=mb_convert_encoding($content, 'HTML-ENTITIES', $encod);
}
$dom = new DomDocument;
$res = $dom->loadHTML($content);
if (!$res) return FALSE;
return $dom;
}
?>
NB: it uses mb_strpos/mb_substr instead of mb_ereg_replace because that seemed more efficient with huge html pages.
DOMDocument::loadHTML
(PHP 5)
DOMDocument::loadHTML — Charge du code HTML à partir d'une chaîne de caractères
Description
Cette fonction analyse un document HTML contenu dans la chaîne source . Contrairement au XML, le HTML n'a pas besoin d'être bien formé pour être chargé. Cette fonction peut aussi être appelée statiquement pour charger et créer un objet DOMDocument. L'appel statique peut être utilisé lorsque vous n'avez besoin de configurer aucune propriété de DOMDocument avant le chargement.
Liste de paramètres
- source
-
La chaîne HTML.
Valeurs de retour
Cette fonction retourne TRUE en cas de succès, FALSE en cas d'échec. Si appelé statiquement, retourne un DOMDocument mais une alerte de type E_STRICT sera émise.
Erreurs / Exceptions
Si une chaîne vide est passée comme paramètre source , une alerte sera générée. Cette alerte n'est pas générée par libxml, et ne peut être gérée en utilisant les fonctions de gestion d'erreur de libxml.
Exemples
Exemple #1 Création d'un document
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo $doc->saveHTML();
?>
Voir aussi
- DOMDocument::loadHTMLFile - Charge du HTML à partir d'un fichier
- DOMDocument::saveHTML - Sauvegarde le document interne dans une chaîne en utilisant un formatage HTML
- DOMDocument::saveHTMLFile - Sauvegarde un document interne dans un fichier en utilisant un formatage HTML
DOMDocument::loadHTML
14-Jun-2009 03:29
11-Feb-2009 04:05
It should be noted that when any text is provided within the body tag
outside of a containing element, the DOMDocument will encapsulate that
text into a paragraph tag (<p>).
For example:
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><div>Text</div></body></html>");
echo $doc->saveHTML();
?>
will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>Test<br></p>
<div>Text</div>
</body></html>
while:
<?php
$doc = new DOMDocument();
$doc->loadHTML(
"<html><body><i>Test</i><br><div>Text</div></body></html>");
echo $doc->saveHTML();
?>
will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<i>Test</i><br><div>Text</div>
</body></html>
20-Oct-2008 06:37
Using loadHTML() automagically sets the doctype property of your DOMDocument instance(to the doctype in the html, or defaults to 4.0 Transitional). If you set the doctype with DOMImplementation it will be overridden.
I assumed it was possible to set it and then load html with the doctype I defined(in order to decide the doctype at runtime), and ran into a huge headache trying to find out where my doctype was going. Hopefully this helps someone else.
19-Nov-2007 02:51
For more info on how loadHTML/loadHTMLFile handle encodings, please visit http://www.onphp5.com/article/57
04-Oct-2007 08:38
If you use loadHTML() to process utf HTML string (eg in Vietnamese), you may experience result in garbage text, while some files were OK. Even your HTML already have meta charset like
<meta http-equiv="content-type" content="text/html; charset=utf-8">
I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title> Vietnamese - Tiếng Việt</title>
</head>
<body></body>
</html>
will be OK with loadHTML() when <meta> tag appear <title> tag.
But the file below will not regcornize by loadHTML() because <title> tag contains utf string appear before <meta> tag.
<html>
<head>
<title> Vietnamese - Tiếng Việt</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body></body>
</html>
18-Jun-2007 10:55
The comment from bigtree at DONTSPAM dot 29a dot nl
26-Apr-2005 11:15 was helpful.
In addition I noted that if your doctype declaration is not valid, DomDocument::loadHtml won't respect your charset=utf-8. It made me crazy. Beware!
27-Apr-2007 03:50
When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding before load UTF-8 page :
<?php
$pageDom = new DomDocument();
$searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8");
@$pageDom->loadHTML($htmlUTF8Page);
?>
15-Feb-2007 04:31
Note that the elements of such document will have no namespace even with <html xmlns="http://www.w3.org/1999/xhtml">
26-Apr-2005 09:15
Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.
