2010-08-23 12:26:48 +00:00
<?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
< html xmlns = "http://www.w3.org/1999/xhtml" lang = "fr" xml:lang = "fr" >
< head >
< meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8" / >
< meta name = "keywords" content = "regexp, regular expression" >
< link rel = "shortcut icon" type = "image/x-icon" href = "/Scratch/img/favicon.ico" / >
< link rel = "stylesheet" type = "text/css" href = "/Scratch/assets/css/main.css" / >
< link rel = "stylesheet" type = "text/css" href = "/Scratch/css/twilight.css" / >
< link rel = "stylesheet" type = "text/css" href = "/Scratch/css/idc.css" / >
< link rel = "alternate" type = "application/rss+xml" title = "RSS" href = "http://feeds.feedburner.com/yannespositocomen" / >
< link rel = "alternate" lang = "fr" xml:lang = "fr" title = "Tout sauf quelquechose en expression régulière." type = "text/html" hreflang = "fr" href = "/Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/" / >
< link rel = "alternate" lang = "en" xml:lang = "en" title = "Pragmatic Regular Expression Exclude (2)" type = "text/html" hreflang = "en" href = "/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/" / >
< script type = "text/javascript" src = "/Scratch/js/jquery-1.3.1.min.js" > < / script >
< script type = "text/javascript" src = "/Scratch/js/jquery.cookie.js" > < / script >
< script type = "text/javascript" src = "/Scratch/js/index.js" > < / script >
< title > Pragmatic Regular Expression Exclude (2)< / title >
< / head >
< body lang = "en" >
< script type = "text/javascript" > / / < ! [ C D A T A [
document.write('< div id = "blackpage" > < img src = "/Scratch/img/loading.gif" alt = "loading..." / > < / div > ');
// ]]>
< / script >
< div id = "content" >
2010-09-27 18:49:15 +00:00
< div id = "choix" >
< div class = "return" > < a href = "#entete" > ↓ Menu ↓ < / a > < / div >
< div id = "choixlang" >
< a href = "/Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/" onclick = "setLanguage('fr')" > en Français< / a >
< / div >
< / div >
< img src = "/Scratch/img/presentation.png" alt = "Presentation drawing" / >
2010-08-23 12:26:48 +00:00
< div id = "titre" >
< h1 >
Pragmatic Regular Expression Exclude (2)
< / h1 >
< / div >
< div class = "flush" > < / div >
< div class = "flush" > < / div >
< div id = "afterheader" >
< div class = "corps" >
< p > In my < a href = "previouspost" > previous post< / a > I had given some trick to match all except something. On the same idea, the trick to match the smallest possible string. Say you want to match the string between ‘ a’ and ‘ b’ , for example, you want to match:< / p >
< pre class = "twilight" >
a.....< span class = "Constant" > < strong > a......b< / strong > < / span > ..b..a....< span class = "Constant" > < strong > a....b< / strong > < / span > ...
< / pre >
< p > Here are two common errors and a solution:< / p >
< pre class = "twilight" >
/a.*b/
< span class = "Constant" > < strong > a.....a......b..b..a....a....b< / strong > < / span > ...
< / pre >
< p > The first error is to use the < em > evil< / em > < code > .*< / code > . Because you will match from the first to the last.< / p >
< pre class = "twilight" >
/a.*?b/
< span class = "Constant" > < strong > a.....a......b< / strong > < / span > ..b..< span class = "Constant" > < strong > a....a....b< / strong > < / span > ...
< / pre >
< p > The next natural way, is to change the < em > greediness< / em > . But it is not enough as you will match from the first < code > a< / code > to the first < code > b< / code > .
Then a simple constatation is that our matching string shouldn’ t contain any < code > a< / code > nor < code > b< / code > . Which lead to the last elegant solution.< / p >
< pre class = "twilight" >
/a[^ab]*b/
a.....< span class = "Constant" > < strong > a......b< / strong > < / span > ..b..a....< span class = "Constant" > < strong > a....b< / strong > < / span > ...
< / pre >
< p > Until now, that was, easy.
Now, just pass at the case you need to match not between < code > a< / code > and < code > b< / code > , but between strings.
For example:< / p >
< div > < pre class = "twilight" >
< li> ...< li>
< / pre > < / div >
< p > This is a bit difficult. You need to match < / p >
< div > < pre class = "twilight" >
< li> [anything not containing < li> ]< /li>
< / pre > < / div >
< p > The first method would be to use the same reasoning as in my < a href = "previouspost" > previous post< / a > . Here is a first try:< / p >
< div > < pre class = "twilight" >
< li> ([^< ]|< [^l]|< l[^i]|< li[^> ])*< /li>
< / pre > < / div >
< p > But what about the following string: < / p >
< div > < pre class = "twilight" >
< li> ...< li< /li>
< / pre > < / div >
< p > That string should not match. This is why if we really want to match it correctly< sup > < a href = "#note1" > † < / a > < / sup > we need to add:< / p >
< div > < pre class = "twilight" >
< li> ([^< ]|< [^l]|< l[^i]|< li[^> ])*(|< |< l|< li)< /li>
< / pre > < / div >
< p > Yes a bit complicated. But what if the string I wanted to match was even longer?< / p >
< p > Here is the algorithm way to handle this easily. You reduce the problem to the first one letter matching:< / p >
< div > < pre class = "twilight" >
< span class = "Comment" > < span class = "Comment" > #< / span > transform a simple randomly choosen character< / span >
< span class = "Comment" > < span class = "Comment" > #< / span > to an unique ID < / span >
< span class = "Comment" > < span class = "Comment" > #< / span > (you should verify the identifier is REALLY unique)< / span >
< span class = "Comment" > < span class = "Comment" > #< / span > beware the unique ID must not contain the < / span >
< span class = "Comment" > < span class = "Comment" > #< / span > choosen character< / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > X< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > _was_x_< span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > Y< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > _was_y_< span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "Comment" > < span class = "Comment" > #< / span > transform the long string in this simple character< / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > < li> < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > X< span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > < < span class = "StringRegexpSpecial" > \/< / span > li> < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > Y< span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "Comment" > < span class = "Comment" > #< / span > use the first method< / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > X([^X]*)Y< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > < span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "Comment" > < span class = "Comment" > #< / span > retransform choosen letter by string< / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > X< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > < li> < span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > Y< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > < < span class = "StringRegexpSpecial" > \/< / span > li> < span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "Comment" > < span class = "Comment" > #< / span > retransform the choosen character back< / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > _was_x_< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > X< span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "SupportFunction" > s< / span > /< / span > _was_y_< / span > < span class = "StringRegexp" > < span class = "StringRegexp" > /< / span > Y< span class = "StringRegexp" > /< / span > < / span > < span class = "StringRegexp" > < span class = "StringRegexp" > < span class = "Keyword" > g< / span > < / span > < / span >
< / pre > < / div >
< p > And it works in only 9 lines for any beginning and ending string. This solution should look less < em > I AM THE GREAT REGEXP M45T3R, URAN00B< / em > , but is more convenient in my humble opinion. Further more, using this last solution prove you master regexp, because you know it is difficult to manage such problems with only a regexp.< / p >
< hr / >
< p > < small > < a name = "note1" > < sup > † < / sup > < / a > I know I used an HTML syntax example, but in my real life usage, I needed to match between < code > en:< / code > and < code > ::< / code > . And sometimes the string could finish with < code > e::< / code > .< / small > < / p >
< / div >
< div id = "choixrss" >
< a id = "rss" href = "http://feeds.feedburner.com/yannespositocomen" >
Subscribe
< / a >
< / div >
< script type = "text/javascript" >
$(document).ready(function(){
$('#comment').hide();
$('#clickcomment').click(showComments);
});
function showComments() {
$('#comment').show();
$('#clickcomment').fadeOut();
}
document.write('< div id = "clickcomment" > Comments< / div > ');
< / script >
< div class = "flush" > < / div >
< div class = "corps" id = "comment" >
< h2 class = "first" > comments< / h2 >
< noscript >
Vous devez activer javascript pour commenter.
< / noscript >
< script type = "text/javascript" >
var idcomments_acct = 'a307f0044511ff1b5cfca573fc0a52e7';
var idcomments_post_id = '/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/';
var idcomments_post_url = 'http://yannesposito.com/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/';
< / script >
< span id = "IDCommentsPostTitle" style = "display:none" > < / span >
< script type = 'text/javascript' src = '/Scratch/js/genericCommentWrapperV2.js' > < / script >
< / div >
< div id = "entete" class = "corps_spaced" >
< div id = "liens" >
< ul > < li > < a href = "/Scratch/en/" > Homepage< / a > < / li >
< li > < a href = "/Scratch/en/blog/" > Blog< / a > < / li >
2010-09-30 13:01:14 +00:00
< li > < a href = "/Scratch/en/softwares/" > Softwares< / a > < / li >
2010-09-28 01:00:51 +00:00
< li > < a href = "/Scratch/en/about/" > About< / a > < / li > < / ul >
2010-08-23 12:26:48 +00:00
< / div >
< div class = "flush" > < / div >
< hr / >
< div id = "next_before_articles" >
< div id = "previous_articles" >
previous entries
< div class = "previous_article" >
2010-09-28 15:10:12 +00:00
< a href = "/Scratch/en/blog/2010-02-15-All-but-something-regexp/" > < span class = "nicer" > «< / span > Pragmatic Regular Expression Exclude< / a >
2010-08-23 12:26:48 +00:00
< / div >
< div class = "previous_article" >
2010-09-28 15:10:12 +00:00
< a href = "/Scratch/en/blog/2010-01-12-antialias-font-in-Firefox-under-Ubuntu/" > < span class = "nicer" > «< / span > antialias font in Firefox under Ubuntu< / a >
2010-08-23 12:26:48 +00:00
< / div >
< div class = "previous_article" >
2010-09-28 15:10:12 +00:00
< a href = "/Scratch/en/blog/2010-01-04-Change-default-shell-on-Mac-OS-X/" > < span class = "nicer" > «< / span > Change default shell on Mac OS X< / a >
2010-08-23 12:26:48 +00:00
< / div >
< / div >
< div id = "next_articles" >
next entries
< div class = "next_article" >
2010-09-28 15:10:12 +00:00
< a href = "/Scratch/en/blog/2010-02-18-split-a-file-by-keyword/" > split a file by keyword < span class = "nicer" > »< / span > < / a >
2010-08-23 12:26:48 +00:00
< / div >
< div class = "next_article" >
2010-09-28 15:10:12 +00:00
< a href = "/Scratch/en/blog/2010-02-23-When-regexp-is-not-the-best-solution/" > When regexp is not the best solution < span class = "nicer" > »< / span > < / a >
2010-08-23 12:26:48 +00:00
< / div >
< div class = "next_article" >
2010-09-28 15:10:12 +00:00
< a href = "/Scratch/en/blog/2010-03-22-Git-Tips/" > Git Tips < span class = "nicer" > »< / span > < / a >
2010-08-23 12:26:48 +00:00
< / div >
< / div >
< div class = "flush" > < / div >
< / div >
< / div >
< div id = "bottom" >
< div >
< a rel = "license" href = "http://creativecommons.org/licenses/by-sa/3.0/" > Copyright ©, Yann Esposito< / a >
< / div >
< div id = "lastmod" >
2010-08-31 13:06:43 +00:00
Created: 02/16/2010
2010-09-02 09:51:46 +00:00
Modified: 05/09/2010
2010-08-23 12:26:48 +00:00
< / div >
< div >
Entirely done with
< a href = "http://www.vim.org" > Vim< / a >
and
< a href = "http://nanoc.stoneship.org" > nanoc< / a >
< / div >
< div >
< a href = "/Scratch/en/validation/" > Validation< / a >
< a href = "http://validator.w3.org/check?uri=referer" > [xhtml] < / a >
.
< a href = "http://jigsaw.w3.org/css-validator/check/referer?profile=css3" > [css] < / a >
.
< a href = "http://validator.w3.org/feed/check.cgi?url=http%3A//yannesposito.com/Scratch/en/blog/feed/feed.xml" > [rss]< / a >
< / div >
< / div >
< div class = "clear" > < / div >
< / div >
< / body >
< / html >