scratch/output/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/index.html

290 lines
12 KiB
HTML
Raw Normal View History

2011-04-20 12:29:01 +00:00
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr" xml:lang="fr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="keywords" content="tree, HTML, script, ruby">
<link rel="shortcut icon" type="image/x-icon" href="/Scratch/img/favicon.ico" />
<link rel="stylesheet" type="text/css" href="/Scratch/assets/css/main.css" />
2012-04-02 21:43:39 +00:00
<link rel="stylesheet" type="text/css" href="/Scratch/css/solarized.css" />
<link rel="stylesheet" type="text/css" href="/Scratch/css/idc.css" />
2012-05-02 15:43:56 +00:00
<link href='http://fonts.googleapis.com/css?family=Inconsolata' rel='stylesheet' type='text/css'>
2011-04-20 12:29:01 +00:00
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://feeds.feedburner.com/yannespositocomen"/>
<link rel="alternate" lang="fr" xml:lang="fr" title="Comment réparer un XML coupé ?" type="text/html" hreflang="fr" href="/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" />
<link rel="alternate" lang="en" xml:lang="en" title="How to repair a cutted XML?" type="text/html" hreflang="en" href="/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" />
<script type="text/javascript" src="/Scratch/js/jquery-1.3.1.min.js"></script>
<script type="text/javascript" src="/Scratch/js/jquery.cookie.js"></script>
<script type="text/javascript" src="/Scratch/js/index.js"></script>
2012-05-02 15:43:56 +00:00
<script type="text/javascript" src="/Scratch/js/highlight/highlight.pack.js"></script>
<script type="text/javascript" src="/Scratch/js/article.js"></script>
2011-04-20 12:29:01 +00:00
<!--[if lt IE 9]>
<script src="http://ie7-js.googlecode.com/svn/version/2.1(beta4)/IE9.js"></script>
<![endif]-->
<title>How to repair a cutted XML?</title>
</head>
2011-10-18 22:30:00 +00:00
<body lang="en" class="article">
2011-04-20 12:29:01 +00:00
<script type="text/javascript">// <![CDATA[
document.write('<div id="blackpage"><img src="/Scratch/img/loading.gif" alt="loading..."/></div>');
2011-04-20 12:29:01 +00:00
// ]]>
</script>
<div id="content">
<div id="choix">
<div class="return"><a href="#entete">&darr; Menu &darr;</a></div>
<div id="choixlang">
<a href="/Scratch/fr/blog/2010-05-19-How-to-cut-HTML-and-repair-it/" onclick="setLanguage('fr')">en Français</a>
2011-04-20 12:29:01 +00:00
</div>
2011-09-28 16:05:55 +00:00
<div class="flush"></div>
2011-04-20 12:29:01 +00:00
</div>
<div id="titre">
<h1>
How to repair a cutted XML?
</h1>
<h2>
and how to do it without any parsor?
</h2>
</div>
<div class="flush"></div>
<div class="flush"></div>
<div id="afterheader">
<div class="corps">
<p>For my main page, you can see, a list of my latest blog entry. And you have the first part of each article. To accomplish that, I needed to include the begining of the entry and to cut it somewhere. But now, I had to repair this cutted HTML.</p>
<p>Here is an example:</p>
2012-05-02 15:43:56 +00:00
<pre><code class="html">&lt;div class="corps"&gt;
&lt;div class="intro"&gt;
&lt;p&gt;Introduction&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The first paragraph&lt;/p&gt;
&lt;img src="/img/img.png" alt="an image"/&gt;
&lt;p&gt;Another long paragraph&lt;/p&gt;
&lt;/div&gt;
</code></pre>
2011-04-20 12:29:01 +00:00
<p>After the cut, I obtain:</p>
2012-05-02 15:43:56 +00:00
<pre><code class="html">&lt;div class="corps"&gt;
&lt;div class="intro"&gt;
&lt;p&gt;Introduction&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The first paragraph&lt;/p&gt;
&lt;img src="/img/im
</code></pre>
2011-04-20 12:29:01 +00:00
<p>Argh! In the middle of an <code>&lt;img&gt;</code> tag.</p>
2012-05-03 09:21:34 +00:00
<p>In fact, it is not as difficult as it should sound first. The secret is, you don&rsquo;t need to keep the complete tree structure to repair it, but only the list of not closed parents.</p>
2011-04-20 12:29:01 +00:00
<p>Given with our example, when we are after the first paragraph. we only have to close the <code>div</code> for class <code>corps</code> and the XML is repaired. Of course, when you cut inside a tag, you sould go back, as if you where just before it. Delete this tag and all is ok.</p>
<p>Then, all you have to do, is not remember all the XML tree, but only the heap containing your parents. Suppose we treat the complete first example, the stack will pass through the following state, in order:</p>
2012-05-02 15:43:56 +00:00
<pre><code class="html">[]
[div] &lt;div class="corps"&gt;
[div, div] &lt;div class="intro"&gt;
[div, div, p] &lt;p&gt;
2011-04-20 12:29:01 +00:00
Introduction
2012-05-02 15:43:56 +00:00
[div, div] &lt;/p&gt;
[div] &lt;/div&gt;
[div, p] &lt;p&gt;
2011-04-20 12:29:01 +00:00
The first paragraph
2012-05-02 15:43:56 +00:00
[div] &lt;/p&gt;
[div] &lt;img src="/img/img.png" alt="an image"/&gt;
[div, p] &lt;p&gt;
2011-04-20 12:29:01 +00:00
Another long paragraph
2012-05-02 15:43:56 +00:00
[div] &lt;/p&gt;
[] &lt;/div&gt;
</code></pre>
2011-04-20 12:29:01 +00:00
2012-05-02 15:43:56 +00:00
<p>The algorihm, is then really simple: </p>
<pre><code class="html">let res be the XML as a string ;
2011-04-20 12:29:01 +00:00
read res and each time you encouter a tag:
if it is an opening one:
push it to the stack
else if it is a closing one:
2012-05-02 15:43:56 +00:00
pop the stack.
2011-04-20 12:29:01 +00:00
2012-05-02 15:43:56 +00:00
remove any malformed/cutted tag in the end of res
2011-04-20 12:29:01 +00:00
for each tag in the stack, pop it, and write:
2012-05-02 15:43:56 +00:00
res = res + closed tag
2011-04-20 12:29:01 +00:00
2012-05-02 15:43:56 +00:00
return res
</code></pre>
2011-04-20 12:29:01 +00:00
<p>And <code>res</code> contain the repaired XML.</p>
<p>Finally, this is the code in ruby I use. The <code>xml</code> variable contain the cutted XML.</p>
2012-05-02 15:43:56 +00:00
<div class="codefile"><a href="/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/code/repair_xml.rb">&#x27A5; repair_xml.rb</a></div>
<pre><code class="ruby"># repair cutted XML code by closing the tags
# work even if the XML is cut into a tag.
# example:
# transform '&lt;div&gt; &lt;span&gt; toto &lt;/span&gt; &lt;p&gt; hello &lt;a href="http://tur'
# into '&lt;div&gt; &lt;span&gt; toto &lt;/span&gt; &lt;p&gt; hello &lt;/p&gt;&lt;/div&gt;'
def repair_xml( xml )
parents=[]
depth=0
xml.scan( %r{&lt;(/?)(\w*)[^&gt;]*(/?)&gt;} ).each do |m|
if m[2] == "/"
next
end
if m[0] == ""
parents[depth]=m[1]
depth+=1
else
depth-=1
end
end
res=xml.sub(/&lt;[^&gt;]*$/m,'')
depth-=1
depth.downto(0).each { |x| res&lt;&lt;= %{&lt;/#{parents[x]}&gt;} }
2011-04-20 12:29:01 +00:00
res
2012-05-02 15:43:56 +00:00
end
</code></pre>
2011-04-20 12:29:01 +00:00
2012-05-03 09:21:34 +00:00
<p>I don&rsquo;t know if the code can help you, but the raisonning should definitively be known.</p>
2011-04-20 12:29:01 +00:00
</div>
2012-04-10 13:56:34 +00:00
<div id="social">
<div class="left"> <a href="https://twitter.com/share" class="twitter-share-button" data-via="yogsototh">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
</div>
<div class="left"> <div class="g-plusone" data-size="medium" data-annotation="inline" data-width="106"></div>
<script type="text/javascript">
(function() {
var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
})();
</script>
</div>
<div class="flush"></div>
</div>
2011-04-20 12:29:01 +00:00
<div id="choixrss">
<a id="rss" href="http://feeds.feedburner.com/yannespositocomen">
Subscribe
</a>
</div>
<script type="text/javascript">
$(document).ready(function(){
$('#comment').hide();
$('#clickcomment').click(showComments);
});
function showComments() {
$('#comment').show();
$('#clickcomment').fadeOut();
}
2012-04-10 13:56:34 +00:00
document.write('<div id="clickcomment">Comments &amp; Share</div>');
2011-04-20 12:29:01 +00:00
</script>
<div class="flush"></div>
2012-04-10 13:56:34 +00:00
2011-04-20 12:29:01 +00:00
<div class="corps" id="comment">
<h2 class="first">comments</h2>
<noscript>
You must enable javascript to comment.
</noscript>
<script type="text/javascript">
var idcomments_acct = 'a307f0044511ff1b5cfca573fc0a52e7';
var idcomments_post_id = '/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/';
var idcomments_post_url = 'http://yannesposito.com/Scratch/en/blog/2010-05-19-How-to-cut-HTML-and-repair-it/';
2011-04-20 12:29:01 +00:00
</script>
<span id="IDCommentsPostTitle" style="display:none"></span>
<script type='text/javascript' src='/Scratch/js/genericCommentWrapperV2.js'></script>
2011-04-20 12:29:01 +00:00
</div>
<div id="entete" class="corps_spaced">
<div id="liens">
<ul><li><a href="/Scratch/en/">Home</a></li>
<li><a href="/Scratch/en/blog/">Blog</a></li>
<li><a href="/Scratch/en/softwares/">Softwares</a></li>
<li><a href="/Scratch/en/about/">About</a></li></ul>
2011-04-20 12:29:01 +00:00
</div>
<div class="flush"></div>
<hr/>
<div id="next_before_articles">
<div id="previous_articles">
previous entries
<div class="previous_article">
<a href="/Scratch/en/blog/2010-05-17-at-least-this-blog-revive/"><span class="nicer">«</span>&nbsp;I live again!</a>
2011-04-20 12:29:01 +00:00
</div>
2012-04-02 21:43:39 +00:00
<div class="previous_article">
<a href="/Scratch/en/blog/2010-03-23-Encapsulate-git/"><span class="nicer">«</span>&nbsp;Encapsulate git</a>
</div>
<div class="previous_article">
<a href="/Scratch/en/blog/2010-03-22-Git-Tips/"><span class="nicer">«</span>&nbsp;Git Tips</a>
</div>
2011-04-20 12:29:01 +00:00
</div>
<div id="next_articles">
next entries
<div class="next_article">
2012-04-02 21:43:39 +00:00
<a href="/Scratch/en/blog/2010-05-24-Trees--Pragmatism-and-Formalism/">Trees; Pragmatism and Formalism&nbsp;<span class="nicer">»</span></a>
2011-04-20 12:29:01 +00:00
</div>
<div class="next_article">
2012-04-02 21:43:39 +00:00
<a href="/Scratch/en/blog/2010-06-14-multi-language-choices/">multi language choices&nbsp;<span class="nicer">»</span></a>
2011-04-20 12:29:01 +00:00
</div>
2012-04-02 21:43:39 +00:00
<div class="next_article">
<a href="/Scratch/en/blog/2010-06-15-Get-my-blog-engine/">Get my blog engine&nbsp;<span class="nicer">»</span></a>
</div>
2011-04-20 12:29:01 +00:00
</div>
<div class="flush"></div>
</div>
</div>
<div id="bottom">
2012-04-02 21:43:39 +00:00
<div>
2012-04-10 13:56:34 +00:00
<a href="https://twitter.com/yogsototh">Follow @yogsototh</a>
2012-04-02 21:43:39 +00:00
</div>
2011-04-20 12:29:01 +00:00
<div>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Copyright ©, Yann Esposito</a>
</div>
<div id="lastmod">
Created: 05/19/2010
Modified: 10/04/2010
</div>
<div>
Entirely done with
<a href="http://www.vim.org">Vim</a>
and
<a href="http://nanoc.stoneship.org">nanoc</a>
</div>
</div>
<div class="clear"></div>
</div>
</body>
</html>