scratch/output/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/index.html
Yann Esposito (Yogsototh) 57d77cd030 Regen complete
2012-04-02 23:43:39 +02:00

275 lines
No EOL
19 KiB
HTML

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr" xml:lang="fr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="keywords" content="arbre, HTML, script, ruby">
<link rel="shortcut icon" type="image/x-icon" href="/Scratch/img/favicon.ico" />
<link rel="stylesheet" type="text/css" href="/Scratch/assets/css/main.css" />
<link rel="stylesheet" type="text/css" href="/Scratch/css/solarized.css" />
<link rel="stylesheet" type="text/css" href="/Scratch/css/idc.css" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://feeds.feedburner.com/yannespositocomfr"/>
<link rel="alternate" lang="fr" xml:lang="fr" title="Comment réparer un XML coupé ?" type="text/html" hreflang="fr" href="/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" />
<link rel="alternate" lang="en" xml:lang="en" title="How to repair a cutted XML?" type="text/html" hreflang="en" href="/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" />
<script type="text/javascript" src="/Scratch/js/jquery-1.3.1.min.js"></script>
<script type="text/javascript" src="/Scratch/js/jquery.cookie.js"></script>
<script type="text/javascript" src="/Scratch/js/index.js"></script>
<!--[if lt IE 9]>
<script src="http://ie7-js.googlecode.com/svn/version/2.1(beta4)/IE9.js"></script>
<![endif]-->
<title>Comment réparer un XML coupé ?</title>
</head>
<body lang="fr" class="article">
<script type="text/javascript">// <![CDATA[
document.write('<div id="blackpage"><img src="/Scratch/img/loading.gif" alt="Chargement en cours..."/></div>');
// ]]>
</script>
<div id="content">
<div id="choix">
<div class="return"><a href="#entete">&darr; Menu &darr;</a></div>
<div id="choixlang">
<a href="/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" onclick="setLanguage('en')">in English</a>
</div>
<div class="flush"></div>
</div>
<div id="titre">
<h1>
Comment réparer un XML coupé ?
</h1>
<h2>
et comment s'en sortir sans parseur ?
</h2>
</div>
<div class="flush"></div>
<div class="flush"></div>
<div id="afterheader">
<div class="corps">
<p>Sur ma page d&rsquo;accueil vous pouvez voir la liste des mes derniers articles avec le début de ceux-ci. Pour arriver à faire ça, j&rsquo;ai besoin de couper le code XHTML de mes pages en plein milieu. Il m&rsquo;a donc fallu trouver un moyen de les réparer.</p>
<p>Prenons un exemple&nbsp;:</p>
<pre class="twilight">
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">div</span> <span class="MetaTagAll">class</span>=<span class="String"><span class="String">&quot;</span>corps<span class="String">&quot;</span></span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">div</span> <span class="MetaTagAll">class</span>=<span class="String"><span class="String">&quot;</span>intro<span class="String">&quot;</span></span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>Introduction<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">div</span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>The first paragraph<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagInline"><span class="MetaTagInline">&lt;</span><span class="MetaTagInline">img</span> <span class="MetaTagInline">src</span>=<span class="String"><span class="String">&quot;</span>/img/img.png<span class="String">&quot;</span></span> <span class="MetaTagInline">alt</span>=<span class="String"><span class="String">&quot;</span>an image<span class="String">&quot;</span></span>/<span class="MetaTagInline">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>Another long paragraph<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">div</span><span class="MetaTagAll">&gt;</span></span>
</pre>
<p>Après avoir coupé, j&rsquo;obtiens&nbsp;:</p>
<pre class="twilight">
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">div</span> <span class="MetaTagAll">class</span>=<span class="String"><span class="String">&quot;</span>corps<span class="String">&quot;</span></span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">div</span> <span class="MetaTagAll">class</span>=<span class="String"><span class="String">&quot;</span>intro<span class="String">&quot;</span></span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>Introduction<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">div</span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>The first paragraph<span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
<span class="MetaTagInline"><span class="MetaTagInline">&lt;</span><span class="MetaTagInline">img</span> <span class="MetaTagInline">src</span>=<span class="String"><span class="String">&quot;</span>/img/im</span></span>
</pre>
<p>En plein milieu d&rsquo;un tag <code>&lt;img&gt;</code>&nbsp;!</p>
<p>En réalité, ce n&rsquo;est pas si difficile que celà peut paraître au premier abord. Le secret réside dans le fait de comprendre que l&rsquo;on n&rsquo;a pas besoin de conserver la structure complète de l&rsquo;arbre pour le réparer, mais seulement la liste des parents non fermés.</p>
<p>Pour notre exemple, juste après le paragraphe <code>first paragraph</code> nous n&rsquo;avons qu&rsquo;à fermer un <code>div</code> pour la classe <code>corps</code> et le XML est réparé. Bien entendu, quand on est dans le cas où un tag est coupé au milieu, on a qu&rsquo;à remonté juste avant le début de ce tag corrompu.</p>
<p>Donc, tout ce que nous avons à faire, c&rsquo;est d&rsquo;enregistrer la liste des parents dans une pile. Supposons que nous traitions le premier exemple complètement. La pile passera par les états suivants&nbsp;:</p>
<pre class="twilight">
[]
[div] <span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">div</span> <span class="MetaTagAll">class</span>=<span class="String"><span class="String">&quot;</span>corps<span class="String">&quot;</span></span><span class="MetaTagAll">&gt;</span></span>
[div, div] <span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">div</span> <span class="MetaTagAll">class</span>=<span class="String"><span class="String">&quot;</span>intro<span class="String">&quot;</span></span><span class="MetaTagAll">&gt;</span></span>
[div, div, p] <span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
Introduction
[div, div] <span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
[div] <span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">div</span><span class="MetaTagAll">&gt;</span></span>
[div, p] <span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
The first paragraph
[div] <span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
[div] <span class="MetaTagInline"><span class="MetaTagInline">&lt;</span><span class="MetaTagInline">img</span> <span class="MetaTagInline">src</span>=<span class="String"><span class="String">&quot;</span>/img/img.png<span class="String">&quot;</span></span> <span class="MetaTagInline">alt</span>=<span class="String"><span class="String">&quot;</span>an image<span class="String">&quot;</span></span>/<span class="MetaTagInline">&gt;</span></span>
[div, p] <span class="MetaTagAll"><span class="MetaTagAll">&lt;</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
Another long paragraph
[div] <span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">p</span><span class="MetaTagAll">&gt;</span></span>
[] <span class="MetaTagAll"><span class="MetaTagAll">&lt;/</span><span class="MetaTagAll">div</span><span class="MetaTagAll">&gt;</span></span>
</pre>
<p>L&rsquo;algorithme est alors très simple&nbsp;:
&lt;pre class="twilight"&gt;
let res be the XML as a string&nbsp;;
read res and each time you encouter a tag:
if it is an opening one:
push it to the stack
else if it is a closing one:
pop the stack.</p>
<p>remove any malformed/cutted tag in the end of res
for each tag in the stack, pop it, and write:
res = res + closed tag</p>
<p>return res
&lt;/pre&gt;</p>
<p>Et <code>res</code> contiend le XML réparé.</p>
<p>Finallement, voici le code en ruby que j&rsquo;utilise. La variable <code>xml</code> contient le XML coupé.</p>
<div class="code"><div class="file"><a href="/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/code/repair_xml.rb"> &#x27A5; repair_xml.rb </a></div><div class="withfile">
<pre class="twilight">
<span class="Comment"><span class="Comment">#</span> repair cutted XML code by closing the tags</span>
<span class="Comment"><span class="Comment">#</span> work even if the XML is cut into a tag.</span>
<span class="Comment"><span class="Comment">#</span> example:</span>
<span class="Comment"><span class="Comment">#</span> transform '&lt;div&gt; &lt;span&gt; toto &lt;/span&gt; &lt;p&gt; hello &lt;a href=&quot;http://tur'</span>
<span class="Comment"><span class="Comment">#</span> into '&lt;div&gt; &lt;span&gt; toto &lt;/span&gt; &lt;p&gt; hello &lt;/p&gt;&lt;/div&gt;'</span>
<span class="Keyword">def</span> <span class="Entity">repair_xml</span>(<span class="Variable"> xml </span>)
parents<span class="Keyword">=</span>[]
depth<span class="Keyword">=</span><span class="Constant">0</span>
xml.<span class="Entity">scan</span>( <span class="StringRegexp"><span class="StringRegexp">%r{</span>&lt;<span class="StringRegexp"><span class="StringRegexp">(</span>/?<span class="StringRegexp">)</span></span><span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexpSpecial">\w</span>*<span class="StringRegexp">)</span></span><span class="StringRegexp"><span class="StringRegexp">[</span>^&gt;<span class="StringRegexp">]</span></span>*<span class="StringRegexp"><span class="StringRegexp">(</span>/?<span class="StringRegexp">)</span></span>&gt;<span class="StringRegexp">}</span></span> ).<span class="Entity">each</span> <span class="Keyword">do </span>|<span class="Variable">m</span>|
<span class="Keyword">if</span> m[<span class="Constant">2</span>] <span class="Keyword">==</span> <span class="String"><span class="String">&quot;</span>/<span class="String">&quot;</span></span>
<span class="Keyword">next</span>
<span class="Keyword">end</span>
<span class="Keyword">if</span> m[<span class="Constant">0</span>] <span class="Keyword">==</span> <span class="String"><span class="String">&quot;</span><span class="String">&quot;</span></span>
parents[depth]<span class="Keyword">=</span>m[<span class="Constant">1</span>]
depth<span class="Keyword">+=</span><span class="Constant">1</span>
<span class="Keyword">else</span>
depth<span class="Keyword">-=</span><span class="Constant">1</span>
<span class="Keyword">end</span>
<span class="Keyword">end</span>
res<span class="Keyword">=</span>xml.<span class="Entity">sub</span>(<span class="StringRegexp"><span class="StringRegexp">/</span></span><span class="StringRegexp">&lt;<span class="StringRegexp"><span class="StringRegexp">[</span>^&gt;<span class="StringRegexp">]</span></span>*$</span><span class="StringRegexp"><span class="StringRegexp">/m</span></span>,<span class="String"><span class="String">'</span><span class="String">'</span></span>)
depth<span class="Keyword">-=</span><span class="Constant">1</span>
depth.<span class="Entity">downto</span>(<span class="Constant">0</span>).<span class="Entity">each</span> { |<span class="Variable">x</span>| res<span class="Keyword">&lt;&lt;=</span> <span class="String"><span class="String">%{</span>&lt;/<span class="StringEmbeddedSource"><span class="StringEmbeddedSource">#{</span>parents<span class="StringEmbeddedSource">[</span>x<span class="StringEmbeddedSource">]</span><span class="StringEmbeddedSource">}</span></span>&gt;<span class="String">}</span></span> }
res
<span class="Keyword">end</span>
</pre>
</div></div>
<p>Je ne sais pas si ce code pourra vous être utile. Par contre le raisonnement pour y parvenir mérite d&rsquo;être connu.</p>
</div>
<div id="choixrss">
<a id="rss" href="http://feeds.feedburner.com/yannespositocomfr">
s'abonner
</a>
</div>
<script type="text/javascript">
$(document).ready(function(){
$('#comment').hide();
$('#clickcomment').click(showComments);
});
function showComments() {
$('#comment').show();
$('#clickcomment').fadeOut();
}
document.write('<div id="clickcomment">Commentaires</div>');
</script>
<div class="flush"></div>
<div class="corps" id="comment">
<h2 class="first">commentaires</h2>
<noscript>
Vous devez activer javascript pour commenter.
</noscript>
<script type="text/javascript">
var idcomments_acct = 'a307f0044511ff1b5cfca573fc0a52e7';
var idcomments_post_id = '/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/';
var idcomments_post_url = 'http://yannesposito.com/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/';
</script>
<span id="IDCommentsPostTitle" style="display:none"></span>
<script type='text/javascript' src='/Scratch/js/genericCommentWrapperV2.js'></script>
</div>
<div id="entete" class="corps_spaced">
<div id="liens">
<ul><li><a href="/Scratch/fr/">Bienvenue</a></li>
<li><a href="/Scratch/fr/blog/">Blog</a></li>
<li><a href="/Scratch/fr/softwares/">Softwares</a></li>
<li><a href="/Scratch/fr/about/">À propos</a></li></ul>
</div>
<div class="flush"></div>
<hr/>
<div id="next_before_articles">
<div id="previous_articles">
articles précédents
<div class="previous_article">
<a href="/Scratch/fr/blog/2010-05-17-at-least-this-blog-revive/"><span class="nicer">«</span>&nbsp;Je reviens à la vie !</a>
</div>
<div class="previous_article">
<a href="/Scratch/fr/blog/2010-03-23-Encapsulate-git/"><span class="nicer">«</span>&nbsp;Encapsuler git</a>
</div>
<div class="previous_article">
<a href="/Scratch/fr/blog/2010-03-22-Git-Tips/"><span class="nicer">«</span>&nbsp;Astuces Git</a>
</div>
</div>
<div id="next_articles">
articles suivants
<div class="next_article">
<a href="/Scratch/fr/blog/2010-05-24-Trees--Pragmatism-and-Formalism/">Arbres ; Pragmatisme et Formalisme&nbsp;<span class="nicer">»</span></a>
</div>
<div class="next_article">
<a href="/Scratch/fr/blog/2010-06-14-multi-language-choices/">choix liés à l'écriture dans plusieurs langues&nbsp;<span class="nicer">»</span></a>
</div>
<div class="next_article">
<a href="/Scratch/fr/blog/2010-06-15-Get-my-blog-engine/">Récupérez mon système de blog&nbsp;<span class="nicer">»</span></a>
</div>
</div>
<div class="flush"></div>
</div>
</div>
<div id="bottom">
<div>
<a href="http://twitter.com/yogsototh">Suivez-moi</a>
</div>
<div>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/deed.fr">Droits de reproduction ©, Yann Esposito</a>
</div>
<div id="lastmod">
Écrit le : 19/05/2010
modifié le : 04/10/2010
</div>
<div>
Site entièrement réalisé avec
<a href="http://www.vim.org">Vim</a>
et
<a href="http://nanoc.stoneship.org">nanoc</a>
</div>
</div>
<div class="clear"></div>
</div>
</body>
</html>