scratch/output/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/index.html
2011-12-07 16:40:03 +01:00

264 lines
No EOL
15 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr" xml:lang="fr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="keywords" content="regexp, regular expression">
<link rel="shortcut icon" type="image/x-icon" href="/Scratch/img/favicon.ico" />
<link rel="stylesheet" type="text/css" href="/Scratch/assets/css/main.css" />
<link rel="stylesheet" type="text/css" href="/Scratch/css/twilight.css" />
<link rel="stylesheet" type="text/css" href="/Scratch/css/idc.css" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://feeds.feedburner.com/yannespositocomen"/>
<link rel="alternate" lang="fr" xml:lang="fr" title="Tout sauf quelquechose en expression régulière." type="text/html" hreflang="fr" href="/Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/" />
<link rel="alternate" lang="en" xml:lang="en" title="Pragmatic Regular Expression Exclude (2)" type="text/html" hreflang="en" href="/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/" />
<script type="text/javascript" src="/Scratch/js/jquery-1.3.1.min.js"></script>
<script type="text/javascript" src="/Scratch/js/jquery.cookie.js"></script>
<script type="text/javascript" src="/Scratch/js/index.js"></script>
<!--[if lt IE 9]>
<script src="http://ie7-js.googlecode.com/svn/version/2.1(beta4)/IE9.js"></script>
<![endif]-->
<title>Pragmatic Regular Expression Exclude (2)</title>
</head>
<body lang="en" class="article">
<script type="text/javascript">// <![CDATA[
document.write('<div id="blackpage"><img src="/Scratch/img/loading.gif" alt="loading..."/></div>');
// ]]>
</script>
<div id="content">
<div id="choix">
<div class="return"><a href="#entete">&darr; Menu &darr;</a></div>
<div id="choixlang">
<a href="/Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/" onclick="setLanguage('fr')">en Français</a>
</div>
<div class="flush"></div>
</div>
<div id="titre">
<h1>
Pragmatic Regular Expression Exclude (2)
</h1>
</div>
<div class="flush"></div>
<div class="flush"></div>
<div id="afterheader">
<div class="corps">
<p>In my <a href="previouspost">previous post</a> I had given some trick to match all except something. On the same idea, the trick to match the smallest possible string. Say you want to match the string between a and b, for example, you want to match:</p>
<pre class="twilight">
a.....<span class="Constant"><strong>a......b</strong></span>..b..a....<span class="Constant"><strong>a....b</strong></span>...
</pre>
<p>Here are two common errors and a solution:</p>
<pre class="twilight">
/a.*b/
<span class="Constant"><strong>a.....a......b..b..a....a....b</strong></span>...
</pre>
<p>The first error is to use the <em>evil</em> <code>.*</code>. Because you will match from the first to the last.</p>
<pre class="twilight">
/a.*?b/
<span class="Constant"><strong>a.....a......b</strong></span>..b..<span class="Constant"><strong>a....a....b</strong></span>...
</pre>
<p>The next natural way, is to change the <em>greediness</em>. But it is not enough as you will match from the first <code>a</code> to the first <code>b</code>.
Then a simple constatation is that our matching string shouldnt contain any <code>a</code> nor <code>b</code>. Which lead to the last elegant solution.</p>
<pre class="twilight">
/a[^ab]*b/
a.....<span class="Constant"><strong>a......b</strong></span>..b..a....<span class="Constant"><strong>a....b</strong></span>...
</pre>
<p>Until now, that was, easy.
Now, just pass at the case you need to match not between <code>a</code> and <code>b</code>, but between strings.
For example:</p>
<div><pre class="twilight">
&lt;li&gt;...&lt;li&gt;
</pre></div>
<p>This is a bit difficult. You need to match </p>
<div><pre class="twilight">
&lt;li&gt;[anything not containing &lt;li&gt;]&lt;/li&gt;
</pre></div>
<p>The first method would be to use the same reasoning as in my <a href="previouspost">previous post</a>. Here is a first try:</p>
<div><pre class="twilight">
&lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*&lt;/li&gt;
</pre></div>
<p>But what about the following string: </p>
<div><pre class="twilight">
&lt;li&gt;...&lt;li&lt;/li&gt;
</pre></div>
<p>That string should not match. This is why if we really want to match it correctly<sup><a href="#note1"></a></sup> we need to add:</p>
<div><pre class="twilight">
&lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*(|&lt;|&lt;l|&lt;li)&lt;/li&gt;
</pre></div>
<p>Yes a bit complicated. But what if the string I wanted to match was even longer?</p>
<p>Here is the algorithm way to handle this easily. You reduce the problem to the first one letter matching:</p>
<div><pre class="twilight">
<span class="Comment"><span class="Comment">#</span> transform a simple randomly choosen character</span>
<span class="Comment"><span class="Comment">#</span> to an unique ID </span>
<span class="Comment"><span class="Comment">#</span> (you should verify the identifier is REALLY unique)</span>
<span class="Comment"><span class="Comment">#</span> beware the unique ID must not contain the </span>
<span class="Comment"><span class="Comment">#</span> choosen character</span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X</span><span class="StringRegexp"><span class="StringRegexp">/</span>_was_x_<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>Y</span><span class="StringRegexp"><span class="StringRegexp">/</span>_was_y_<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="Comment"><span class="Comment">#</span> transform the long string in this simple character</span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>&lt;li&gt;</span><span class="StringRegexp"><span class="StringRegexp">/</span>X<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>&lt;<span class="StringRegexpSpecial">\/</span>li&gt;</span><span class="StringRegexp"><span class="StringRegexp">/</span>Y<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="Comment"><span class="Comment">#</span> use the first method</span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X([^X]*)Y</span><span class="StringRegexp"><span class="StringRegexp">/</span><span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="Comment"><span class="Comment">#</span> retransform choosen letter by string</span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X</span><span class="StringRegexp"><span class="StringRegexp">/</span>&lt;li&gt;<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>Y</span><span class="StringRegexp"><span class="StringRegexp">/</span>&lt;<span class="StringRegexpSpecial">\/</span>li&gt;<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="Comment"><span class="Comment">#</span> retransform the choosen character back</span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>_was_x_</span><span class="StringRegexp"><span class="StringRegexp">/</span>X<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
<span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>_was_y_</span><span class="StringRegexp"><span class="StringRegexp">/</span>Y<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span>
</pre></div>
<p>And it works in only 9 lines for any beginning and ending string. This solution should look less <em>I AM THE GREAT REGEXP M45T3R, URAN00B</em>, but is more convenient in my humble opinion. Further more, using this last solution prove you master regexp, because you know it is difficult to manage such problems with only a regexp.</p>
<hr />
<p><small><a name="note1"><sup></sup></a> I know I used an HTML syntax example, but in my real life usage, I needed to match between <code>en:</code> and <code>::</code>. And sometimes the string could finish with <code>e::</code>.</small></p>
</div>
<div id="choixrss">
<a id="rss" href="http://feeds.feedburner.com/yannespositocomen">
Subscribe
</a>
</div>
<script type="text/javascript">
$(document).ready(function(){
$('#comment').hide();
$('#clickcomment').click(showComments);
});
function showComments() {
$('#comment').show();
$('#clickcomment').fadeOut();
}
document.write('<div id="clickcomment">Comments</div>');
</script>
<div class="flush"></div>
<div class="corps" id="comment">
<h2 class="first">comments</h2>
<noscript>
You must enable javascript to comment.
</noscript>
<script type="text/javascript">
var idcomments_acct = 'a307f0044511ff1b5cfca573fc0a52e7';
var idcomments_post_id = '/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/';
var idcomments_post_url = 'http://yannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/';
</script>
<span id="IDCommentsPostTitle" style="display:none"></span>
<script type='text/javascript' src='/Scratch/js/genericCommentWrapperV2.js'></script>
</div>
<div id="entete" class="corps_spaced">
<div id="liens">
<ul><li><a href="/Scratch/en/">Home</a></li>
<li><a href="/Scratch/en/blog/">Blog</a></li>
<li><a href="/Scratch/en/softwares/">Softwares</a></li>
<li><a href="/Scratch/en/about/">About</a></li></ul>
</div>
<div class="flush"></div>
<hr/>
<div id="next_before_articles">
<div id="previous_articles">
previous entries
<div class="previous_article">
<a href="/Scratch/en/blog/2010-02-15-All-but-something-regexp/"><span class="nicer">«</span>&nbsp;Pragmatic Regular Expression Exclude</a>
</div>
<div class="previous_article">
<a href="/Scratch/en/blog/2010-01-12-antialias-font-in-Firefox-under-Ubuntu/"><span class="nicer">«</span>&nbsp;antialias font in Firefox under Ubuntu</a>
</div>
<div class="previous_article">
<a href="/Scratch/en/blog/2010-01-04-Change-default-shell-on-Mac-OS-X/"><span class="nicer">«</span>&nbsp;Change default shell on Mac OS X</a>
</div>
</div>
<div id="next_articles">
next entries
<div class="next_article">
<a href="/Scratch/en/blog/2010-02-18-split-a-file-by-keyword/">split a file by keyword&nbsp;<span class="nicer">»</span></a>
</div>
<div class="next_article">
<a href="/Scratch/en/blog/2010-02-23-When-regexp-is-not-the-best-solution/">When regexp is not the best solution&nbsp;<span class="nicer">»</span></a>
</div>
<div class="next_article">
<a href="/Scratch/en/blog/2010-03-22-Git-Tips/">Git Tips&nbsp;<span class="nicer">»</span></a>
</div>
</div>
<div class="flush"></div>
</div>
</div>
<div id="bottom">
<div>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Copyright ©, Yann Esposito</a>
</div>
<div id="lastmod">
Created: 02/16/2010
Modified: 04/20/2011
</div>
<div>
Entirely done with
<a href="http://www.vim.org">Vim</a>
and
<a href="http://nanoc.stoneship.org">nanoc</a>
</div>
<div>
<a href="/Scratch/en/validation/">Validation</a>
<a href="http://validator.w3.org/check?uri=referer"> [xhtml] </a>
.
<a href="http://jigsaw.w3.org/css-validator/check/referer?profile=css3"> [css] </a>
.
<a href="http://validator.w3.org/feed/check.cgi?url=http%3A//yannesposito.com/Scratch/en/blog/feed/feed.xml">[rss]</a>
</div>
</div>
<div class="clear"></div>
</div>
</body>
</html>