2011-04-20 12:29:01 +00:00
<?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
< html xmlns = "http://www.w3.org/1999/xhtml" lang = "fr" xml:lang = "fr" >
< head >
< meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8" / >
< meta name = "keywords" content = "arbre, HTML, script, ruby" >
2011-04-20 13:56:52 +00:00
< link rel = "shortcut icon" type = "image/x-icon" href = "/Scratch/img/favicon.ico" / >
< link rel = "stylesheet" type = "text/css" href = "/Scratch/assets/css/main.css" / >
< link rel = "stylesheet" type = "text/css" href = "/Scratch/css/twilight.css" / >
< link rel = "stylesheet" type = "text/css" href = "/Scratch/css/idc.css" / >
2011-04-20 12:29:01 +00:00
< link rel = "alternate" type = "application/rss+xml" title = "RSS" href = "http://feeds.feedburner.com/yannespositocomfr" / >
2011-04-20 13:56:52 +00:00
< link rel = "alternate" lang = "fr" xml:lang = "fr" title = "Comment réparer un XML coupé ?" type = "text/html" hreflang = "fr" href = "/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" / >
< link rel = "alternate" lang = "en" xml:lang = "en" title = "How to repair a cutted XML?" type = "text/html" hreflang = "en" href = "/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" / >
< script type = "text/javascript" src = "/Scratch/js/jquery-1.3.1.min.js" > < / script >
< script type = "text/javascript" src = "/Scratch/js/jquery.cookie.js" > < / script >
< script type = "text/javascript" src = "/Scratch/js/index.js" > < / script >
2011-04-20 12:29:01 +00:00
<!-- [if lt IE 9]>
< script src = "http://ie7-js.googlecode.com/svn/version/2.1(beta4)/IE9.js" > < / script >
<![endif]-->
<!-- < % if containMaths %>
2011-04-20 13:56:52 +00:00
< script type = "text/javascript" src = "/Scratch/js/MathJax/MathJax.js" > < / script >
2011-04-20 12:29:01 +00:00
< % end %>
-->
< title > Comment réparer un XML coupé ?< / title >
< / head >
2011-10-18 22:30:00 +00:00
< body lang = "fr" class = "article" >
2011-04-20 12:29:01 +00:00
< script type = "text/javascript" > / / < ! [ C D A T A [
2011-04-20 13:56:52 +00:00
document.write('< div id = "blackpage" > < img src = "/Scratch/img/loading.gif" alt = "Chargement en cours..." / > < / div > ');
2011-04-20 12:29:01 +00:00
// ]]>
< / script >
< div id = "content" >
< div id = "choix" >
< div class = "return" > < a href = "#entete" > ↓ Menu ↓ < / a > < / div >
< div id = "choixlang" >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" onclick = "setLanguage('en')" > in English< / a >
2011-04-20 12:29:01 +00:00
< / div >
2011-09-28 16:05:55 +00:00
< div class = "flush" > < / div >
2011-04-20 12:29:01 +00:00
< / div >
< div id = "titre" >
< h1 >
Comment réparer un XML coupé ?
< / h1 >
< h2 >
et comment s'en sortir sans parseur ?
< / h2 >
< / div >
< div class = "flush" > < / div >
< div class = "flush" > < / div >
< div id = "afterheader" >
< div class = "corps" >
< p > Sur ma page d’ accueil vous pouvez voir la liste des mes derniers articles avec le début de ceux-ci. Pour arriver à faire ça, j’ ai besoin de couper le code XHTML de mes pages en plein milieu. Il m’ a donc fallu trouver un moyen de les réparer.< / p >
< p > Prenons un exemple :< / p >
< pre class = "twilight" >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > class< / span > =< span class = "String" > < span class = "String" > " < / span > corps< span class = "String" > " < / span > < / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > class< / span > =< span class = "String" > < span class = "String" > " < / span > intro< span class = "String" > " < / span > < / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span > Introduction< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span > The first paragraph< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagInline" > < span class = "MetaTagInline" > < < / span > < span class = "MetaTagInline" > img< / span > < span class = "MetaTagInline" > src< / span > =< span class = "String" > < span class = "String" > " < / span > /img/img.png< span class = "String" > " < / span > < / span > < span class = "MetaTagInline" > alt< / span > =< span class = "String" > < span class = "String" > " < / span > an image< span class = "String" > " < / span > < / span > /< span class = "MetaTagInline" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span > Another long paragraph< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > > < / span > < / span >
< / pre >
< p > Après avoir coupé, j’ obtiens :< / p >
< pre class = "twilight" >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > class< / span > =< span class = "String" > < span class = "String" > " < / span > corps< span class = "String" > " < / span > < / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > class< / span > =< span class = "String" > < span class = "String" > " < / span > intro< span class = "String" > " < / span > < / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span > Introduction< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span > The first paragraph< span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
< span class = "MetaTagInline" > < span class = "MetaTagInline" > < < / span > < span class = "MetaTagInline" > img< / span > < span class = "MetaTagInline" > src< / span > =< span class = "String" > < span class = "String" > " < / span > /img/im< / span > < / span >
< / pre >
< p > En plein milieu d’ un tag < code > < img> < / code > !< / p >
< p > En réalité, ce n’ est pas si difficile que celà peut paraître au premier abord. Le secret réside dans le fait de comprendre que l’ on n’ a pas besoin de conserver la structure complète de l’ arbre pour le réparer, mais seulement la liste des parents non fermés.< / p >
< p > Pour notre exemple, juste après le paragraphe < code > first paragraph< / code > nous n’ avons qu’ à fermer un < code > div< / code > pour la classe < code > corps< / code > et le XML est réparé. Bien entendu, quand on est dans le cas où un tag est coupé au milieu, on a qu’ à remonté juste avant le début de ce tag corrompu.< / p >
< p > Donc, tout ce que nous avons à faire, c’ est d’ enregistrer la liste des parents dans une pile. Supposons que nous traitions le premier exemple complètement. La pile passera par les états suivants :< / p >
< pre class = "twilight" >
[]
[div] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > class< / span > =< span class = "String" > < span class = "String" > " < / span > corps< span class = "String" > " < / span > < / span > < span class = "MetaTagAll" > > < / span > < / span >
[div, div] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > class< / span > =< span class = "String" > < span class = "String" > " < / span > intro< span class = "String" > " < / span > < / span > < span class = "MetaTagAll" > > < / span > < / span >
[div, div, p] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
Introduction
[div, div] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
[div] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > > < / span > < / span >
[div, p] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
The first paragraph
[div] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
[div] < span class = "MetaTagInline" > < span class = "MetaTagInline" > < < / span > < span class = "MetaTagInline" > img< / span > < span class = "MetaTagInline" > src< / span > =< span class = "String" > < span class = "String" > " < / span > /img/img.png< span class = "String" > " < / span > < / span > < span class = "MetaTagInline" > alt< / span > =< span class = "String" > < span class = "String" > " < / span > an image< span class = "String" > " < / span > < / span > /< span class = "MetaTagInline" > > < / span > < / span >
[div, p] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < < / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
Another long paragraph
[div] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > p< / span > < span class = "MetaTagAll" > > < / span > < / span >
[] < span class = "MetaTagAll" > < span class = "MetaTagAll" > < /< / span > < span class = "MetaTagAll" > div< / span > < span class = "MetaTagAll" > > < / span > < / span >
< / pre >
< p > L’ algorithme est alors très simple :
< pre class="twilight">
let res be the XML as a string ;
read res and each time you encouter a tag:
if it is an opening one:
push it to the stack
else if it is a closing one:
pop the stack.< / p >
< p > remove any malformed/cutted tag in the end of res
for each tag in the stack, pop it, and write:
res = res + closed tag< / p >
< p > return res
< /pre> < / p >
< p > Et < code > res< / code > contiend le XML réparé.< / p >
< p > Finallement, voici le code en ruby que j’ utilise. La variable < code > xml< / code > contient le XML coupé.< / p >
2011-04-20 13:56:52 +00:00
< div class = "code" > < div class = "file" > < a href = "/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/code/repair_xml.rb" > ➥ repair_xml.rb < / a > < / div > < div class = "withfile" >
2011-04-20 12:29:01 +00:00
< pre class = "twilight" >
< span class = "Comment" > < span class = "Comment" > #< / span > repair cutted XML code by closing the tags< / span >
< span class = "Comment" > < span class = "Comment" > #< / span > work even if the XML is cut into a tag.< / span >
< span class = "Comment" > < span class = "Comment" > #< / span > example:< / span >
< span class = "Comment" > < span class = "Comment" > #< / span > transform '< div> < span> toto < /span> < p> hello < a href=" http://tur'< / span >
< span class = "Comment" > < span class = "Comment" > #< / span > into '< div> < span> toto < /span> < p> hello < /p> < /div> '< / span >
< span class = "Keyword" > def< / span > < span class = "Entity" > repair_xml< / span > (< span class = "Variable" > xml < / span > )
parents< span class = "Keyword" > =< / span > []
depth< span class = "Keyword" > =< / span > < span class = "Constant" > 0< / span >
xml.< span class = "Entity" > scan< / span > ( < span class = "StringRegexp" > < span class = "StringRegexp" > %r{< / span > < < span class = "StringRegexp" > < span class = "StringRegexp" > (< / span > /?< span class = "StringRegexp" > )< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > (< / span > < span class = "StringRegexpSpecial" > \w< / span > *< span class = "StringRegexp" > )< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > [< / span > ^> < span class = "StringRegexp" > ]< / span > < / span > *< span class = "StringRegexp" > < span class = "StringRegexp" > (< / span > /?< span class = "StringRegexp" > )< / span > < / span > > < span class = "StringRegexp" > }< / span > < / span > ).< span class = "Entity" > each< / span > < span class = "Keyword" > do < / span > |< span class = "Variable" > m< / span > |
< span class = "Keyword" > if< / span > m[< span class = "Constant" > 2< / span > ] < span class = "Keyword" > ==< / span > < span class = "String" > < span class = "String" > " < / span > /< span class = "String" > " < / span > < / span >
< span class = "Keyword" > next< / span >
< span class = "Keyword" > end< / span >
< span class = "Keyword" > if< / span > m[< span class = "Constant" > 0< / span > ] < span class = "Keyword" > ==< / span > < span class = "String" > < span class = "String" > " < / span > < span class = "String" > " < / span > < / span >
parents[depth]< span class = "Keyword" > =< / span > m[< span class = "Constant" > 1< / span > ]
depth< span class = "Keyword" > +=< / span > < span class = "Constant" > 1< / span >
< span class = "Keyword" > else< / span >
depth< span class = "Keyword" > -=< / span > < span class = "Constant" > 1< / span >
< span class = "Keyword" > end< / span >
< span class = "Keyword" > end< / span >
res< span class = "Keyword" > =< / span > xml.< span class = "Entity" > sub< / span > (< span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < < span class = "StringRegexp" > < span class = "StringRegexp" > [< / span > ^> < span class = "StringRegexp" > ]< / span > < / span > *$< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /m< / span > < / span > ,< span class = "String" > < span class = "String" > '< / span > < span class = "String" > '< / span > < / span > )
depth< span class = "Keyword" > -=< / span > < span class = "Constant" > 1< / span >
depth.< span class = "Entity" > downto< / span > (< span class = "Constant" > 0< / span > ).< span class = "Entity" > each< / span > { |< span class = "Variable" > x< / span > | res< span class = "Keyword" > < < =< / span > < span class = "String" > < span class = "String" > %{< / span > < /< span class = "StringEmbeddedSource" > < span class = "StringEmbeddedSource" > #{< / span > parents< span class = "StringEmbeddedSource" > [< / span > x< span class = "StringEmbeddedSource" > ]< / span > < span class = "StringEmbeddedSource" > }< / span > < / span > > < span class = "String" > }< / span > < / span > }
res
< span class = "Keyword" > end< / span >
< / pre >
< / div > < / div >
< p > Je ne sais pas si ce code pourra vous être utile. Par contre le raisonnement pour y parvenir mérite d’ être connu.< / p >
< / div >
< div id = "choixrss" >
< a id = "rss" href = "http://feeds.feedburner.com/yannespositocomfr" >
s'abonner
< / a >
< / div >
< script type = "text/javascript" >
$(document).ready(function(){
$('#comment').hide();
$('#clickcomment').click(showComments);
});
function showComments() {
$('#comment').show();
$('#clickcomment').fadeOut();
}
document.write('< div id = "clickcomment" > Commentaires< / div > ');
< / script >
< div class = "flush" > < / div >
< div class = "corps" id = "comment" >
< h2 class = "first" > commentaires< / h2 >
< noscript >
Vous devez activer javascript pour commenter.
< / noscript >
< script type = "text/javascript" >
var idcomments_acct = 'a307f0044511ff1b5cfca573fc0a52e7';
2011-04-20 13:56:52 +00:00
var idcomments_post_id = '/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/';
var idcomments_post_url = 'http://yannesposito.com/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/';
2011-04-20 12:29:01 +00:00
< / script >
< span id = "IDCommentsPostTitle" style = "display:none" > < / span >
2011-04-20 13:56:52 +00:00
< script type = 'text/javascript' src = '/Scratch/js/genericCommentWrapperV2.js' > < / script >
2011-04-20 12:29:01 +00:00
< / div >
< div id = "entete" class = "corps_spaced" >
< div id = "liens" >
2011-04-20 13:56:52 +00:00
< ul > < li > < a href = "/Scratch/fr/" > Bienvenue< / a > < / li >
< li > < a href = "/Scratch/fr/blog/" > Blog< / a > < / li >
< li > < a href = "/Scratch/fr/softwares/" > Softwares< / a > < / li >
< li > < a href = "/Scratch/fr/about/" > À propos< / a > < / li > < / ul >
2011-04-20 12:29:01 +00:00
< / div >
< div class = "flush" > < / div >
< hr / >
< div id = "next_before_articles" >
< div id = "previous_articles" >
articles précédents
< div class = "previous_article" >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/fr/blog/2010-05-17-at-least-this-blog-revive/" > < span class = "nicer" > «< / span > Je reviens à la vie !< / a >
2011-04-20 12:29:01 +00:00
< / div >
< div class = "previous_article" >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/fr/blog/2010-03-23-Encapsulate-git/" > < span class = "nicer" > «< / span > Encapsuler git< / a >
2011-04-20 12:29:01 +00:00
< / div >
< div class = "previous_article" >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/fr/blog/2010-03-22-Git-Tips/" > < span class = "nicer" > «< / span > Astuces Git< / a >
2011-04-20 12:29:01 +00:00
< / div >
< / div >
< div id = "next_articles" >
articles suivants
< div class = "next_article" >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/fr/blog/2010-05-24-Trees--Pragmatism-and-Formalism/" > Arbres ; Pragmatisme et Formalisme < span class = "nicer" > »< / span > < / a >
2011-04-20 12:29:01 +00:00
< / div >
< div class = "next_article" >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/fr/blog/2010-06-14-multi-language-choices/" > choix liés à l'écriture dans plusieurs langues < span class = "nicer" > »< / span > < / a >
2011-04-20 12:29:01 +00:00
< / div >
< div class = "next_article" >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/fr/blog/2010-06-15-Get-my-blog-engine/" > Récupérez mon système de blog < span class = "nicer" > »< / span > < / a >
2011-04-20 12:29:01 +00:00
< / div >
< / div >
< div class = "flush" > < / div >
< / div >
< / div >
< div id = "bottom" >
< div >
< a rel = "license" href = "http://creativecommons.org/licenses/by-sa/3.0/deed.fr" > Droits de reproduction ©, Yann Esposito< / a >
< / div >
< div id = "lastmod" >
Écrit le : 19/05/2010
modifié le : 04/10/2010
< / div >
< div >
Site entièrement réalisé avec
< a href = "http://www.vim.org" > Vim< / a >
et
< a href = "http://nanoc.stoneship.org" > nanoc< / a >
< / div >
< div >
2011-04-20 13:56:52 +00:00
< a href = "/Scratch/fr/validation/" > Validation< / a >
2011-04-20 12:29:01 +00:00
< a href = "http://validator.w3.org/check?uri=referer" > [xhtml] < / a >
.
< a href = "http://jigsaw.w3.org/css-validator/check/referer?profile=css3" > [css] < / a >
.
2011-04-20 13:56:52 +00:00
< a href = "http://validator.w3.org/feed/check.cgi?url=http%3A//yannesposito.com/Scratch/fr/blog/feed/feed.xml" > [rss]< / a >
2011-04-20 12:29:01 +00:00
< / div >
< / div >
< div class = "clear" > < / div >
< / div >
< script type = "text/javascript" >
var clicky = { log: function(){ return; }, goal: function(){ return; }};
var clicky_site_id = 66374971;
(function() {
var s = document.createElement('script');
s.type = 'text/javascript';
s.async = true;
s.src = ( document.location.protocol == 'https:' ? 'https://static.getclicky.com/js' : 'http://static.getclicky.com/js' );
( document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0] ).appendChild( s );
})();
< / script >
< noscript > < p > < img alt = "Clicky" width = "1" height = "1" src = "http://in.getclicky.com/66374971ns.gif" / > < / p > < / noscript >
< / body >
< / html >