http://yannesposito.com/ Yogsototh's last blogs entries 2010-02-18T13:29:14Z Yann Esposito http://yannesposito.com tag:yannesposito.com,2010-02-18:/Scratch/en/blog/2010-02-18-split-a-file-by-keyword/ split a file by keyword 2010-02-18T13:29:14Z 2010-02-18T13:29:14Z <p>Strangely enough, I didn&rsquo;t find any built-in tool to split a file by keyword. I made one myself in <code>awk</code>. I put it here mostly for myself. But it could also helps someone else. The following code split a file for each line containing the word <code>UTC</code>.</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span>!/usr/bin/env awk</span> <span class="Entity">BEGIN</span>{i=0;} <span class="StringRegexp"><span class="StringRegexp">/</span>UTC<span class="StringRegexp">/</span></span> { i+=1; FIC=<span class="SupportFunction">sprintf</span>(<span class="String"><span class="String">&quot;</span>fic.%03d<span class="String">&quot;</span></span>,i); } {<span class="SupportFunction">print</span> <span class="Variable"><span class="Variable">$</span>0</span>&gt;&gt;FIC} </pre> <p>In my real world example, I wanted one file per day, each line containing UTC being in the following format:</p> <pre class="twilight"> Mon Dec 7 10:32:30 UTC 2009 </pre> <p>I then finished with the following code:</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span>!/usr/bin/env awk</span> <span class="Entity">BEGIN</span>{i=0;} <span class="StringRegexp"><span class="StringRegexp">/</span>UTC<span class="StringRegexp">/</span></span> { date=<span class="Variable"><span class="Variable">$</span>1</span><span class="Variable"><span class="Variable">$</span>2</span><span class="Variable"><span class="Variable">$</span>3</span>; <span class="Keyword">if</span> ( date&nbsp;!= olddate ) { olddate=date; i+=1; FIC=<span class="SupportFunction">sprintf</span>(<span class="String"><span class="String">&quot;</span>fic.%03d<span class="String">&quot;</span></span>,i); } } {<span class="SupportFunction">print</span> <span class="Variable"><span class="Variable">$</span>0</span>&gt;&gt;FIC} </pre> tag:yannesposito.com,2010-02-18:/Scratch/fr/blog/2010-02-18-split-a-file-by-keyword/ split a file by keyword 2010-02-18T13:29:14Z 2010-02-18T13:29:14Z <p>Strangely enough, I didn&rsquo;t find any built-in tool to split a file by keyword. I made one myself in <code>awk</code>. I put it here mostly for myself. But it could also helps someone else. The following code split a file for each line containing the word <code>UTC</code>.</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span>!/usr/bin/env awk</span> <span class="Entity">BEGIN</span>{i=0;} <span class="StringRegexp"><span class="StringRegexp">/</span>UTC<span class="StringRegexp">/</span></span> { i+=1; FIC=<span class="SupportFunction">sprintf</span>(<span class="String"><span class="String">&quot;</span>fic.%03d<span class="String">&quot;</span></span>,i); } {<span class="SupportFunction">print</span> <span class="Variable"><span class="Variable">$</span>0</span>&gt;&gt;FIC} </pre> <p>In my real world example, I wanted one file per day, each line containing UTC being in the following format:</p> <pre class="twilight"> Mon Dec 7 10:32:30 UTC 2009 </pre> <p>I then finished with the following code:</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span>!/usr/bin/env awk</span> <span class="Entity">BEGIN</span>{i=0;} <span class="StringRegexp"><span class="StringRegexp">/</span>UTC<span class="StringRegexp">/</span></span> { date=<span class="Variable"><span class="Variable">$</span>1</span><span class="Variable"><span class="Variable">$</span>2</span><span class="Variable"><span class="Variable">$</span>3</span>; <span class="Keyword">if</span> ( date&nbsp;!= olddate ) { olddate=date; i+=1; FIC=<span class="SupportFunction">sprintf</span>(<span class="String"><span class="String">&quot;</span>fic.%03d<span class="String">&quot;</span></span>,i); } } {<span class="SupportFunction">print</span> <span class="Variable"><span class="Variable">$</span>0</span>&gt;&gt;FIC} </pre> tag:yannesposito.com,2010-02-16:/Scratch/en/blog/2010-02-16-All-but-something-regexp--2-/ All but something regexp (2) 2010-02-16T08:33:21Z 2010-02-16T08:33:21Z <p>In my <a href="previouspost">previous post</a> I had given some trick to match all except something. On the same idea, the trick to match the smallest possible string. Say you want to match the string between &lsquo;a&rsquo; and &lsquo;b&rsquo;, for example, you want to match:</p> <pre class="twilight"> a.....<span class="Constant"><strong>a......b</strong></span>..b..a....<span class="Constant"><strong>a....b</strong></span>... </pre> <p>Here are two common errors and a solution:</p> <pre class="twilight"> /a.*b/ <span class="Constant"><strong>a.....a......b..b..a....a....b</strong></span>... </pre> <pre class="twilight"> /a.*?b/ <span class="Constant"><strong>a.....a......b</strong></span>..b..<span class="Constant"><strong>a....a....b</strong></span>... </pre> <pre class="twilight"> /a[^ab]*b/ a.....<span class="Constant"><strong>a......b</strong></span>..b..a....<span class="Constant"><strong>a....b</strong></span>... </pre> <p>The first error is to use the <em>evil</em> <code>.*</code>. Because you will match from the first to the last. The next natural way, is to change the <em>greediness</em>. But it is not enough as you will match from the first <code>a</code> to the first <code>b</code>. Then a simple constatation is that our matching string shouldn&rsquo;t contain any <code>a</code> nor <code>b</code>. Which lead to the last elegant solution.</p> <p>Until now, that was, easy. Now, how do you manage when instead of <code>a</code> you have a string?</p> <p>Say you want to match: </p> <pre class="twilight"> &lt;li&gt;...&lt;li&gt; </pre> <p>This is a bit difficult. You need to match </p> <pre class="twilight"> &lt;li&gt;[anything not containing &lt;li&gt;]&lt;/li&gt; </pre> <p>The first method would be to use the same reasoning as in my <a href="previouspost">previous post</a>. Here is a first try:</p> <pre class="twilight"> &lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*&lt;/li&gt; </pre> <p>But what about the following string: </p> <pre class="twilight"> &lt;li&gt;...&lt;li&lt;/li&gt; </pre> <p>That string should not match. This is why if we really want to match it correctly<sup><a href="#note1">&dagger;</a></sup> we need to add:</p> <pre class="twilight"> &lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*(|&lt;|&lt;l|&lt;li)&lt;/li&gt; </pre> <p>Yes a bit complicated. But what if the string I wanted to match was even longer?</p> <p>Here is the algorithm way to handle this easily. You reduce the problem to the first one letter matching:</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span> transform a simple randomly choosen character</span> <span class="Comment"><span class="Comment">#</span> to an unique ID </span> <span class="Comment"><span class="Comment">#</span> (you should verify the identifier is REALLY unique)</span> <span class="Comment"><span class="Comment">#</span> beware the unique ID must not contain the </span> <span class="Comment"><span class="Comment">#</span> choosen character</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X</span><span class="StringRegexp"><span class="StringRegexp">/</span>_was_x_<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>Y</span><span class="StringRegexp"><span class="StringRegexp">/</span>_was_y_<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> transform the long string in this simple character</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>&lt;li&gt;</span><span class="StringRegexp"><span class="StringRegexp">/</span>X<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>&lt;<span class="StringRegexpSpecial">\/</span>li&gt;</span><span class="StringRegexp"><span class="StringRegexp">/</span>Y<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> use the first method</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X([^X]*)Y</span><span class="StringRegexp"><span class="StringRegexp">/</span><span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> retransform choosen letter by string</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X</span><span class="StringRegexp"><span class="StringRegexp">/</span>&lt;li&gt;<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>Y</span><span class="StringRegexp"><span class="StringRegexp">/</span>&lt;<span class="StringRegexpSpecial">\/</span>li&gt;<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> retransform the choosen character back</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>_was_x_</span><span class="StringRegexp"><span class="StringRegexp">/</span>X<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>_was_y_</span><span class="StringRegexp"><span class="StringRegexp">/</span>Y<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> </pre> <p>And it works in only 9 lines for any beginning and ending string. This solution should look less <em>I AM THE GREAT REGEXP M45T3R, URAN00B</em>, but is more convenient in my humble opinion. Further more, using this last solution prove you master regexp, because you know it is difficult to manage such problems with only a regexp.</p> <hr /> <p><small><a name="note1"><sup>&dagger;</sup></a> I know I used an HTML syntax example, but in my real life usage, I needed to match between <code>and</code>. And sometimes the string could finish with <code>e::</code>.</small></p> tag:yannesposito.com,2010-02-16:/Scratch/fr/blog/2010-02-16-All-but-something-regexp--2-/ All but something regexp (2) 2010-02-16T08:33:21Z 2010-02-16T08:33:21Z <p>In my <a href="previouspost">previous post</a> I had given some trick to match all except something. On the same idea, the trick to match the smallest possible string. Say you want to match the string between &lsquo;a&rsquo; and &lsquo;b&rsquo;, for example, you want to match:</p> <pre class="twilight"> a.....<span class="Constant"><strong>a......b</strong></span>..b..a....<span class="Constant"><strong>a....b</strong></span>... </pre> <p>Here are two common errors and a solution:</p> <pre class="twilight"> /a.*b/ <span class="Constant"><strong>a.....a......b..b..a....a....b</strong></span>... </pre> <pre class="twilight"> /a.*?b/ <span class="Constant"><strong>a.....a......b</strong></span>..b..<span class="Constant"><strong>a....a....b</strong></span>... </pre> <pre class="twilight"> /a[^ab]*b/ a.....<span class="Constant"><strong>a......b</strong></span>..b..a....<span class="Constant"><strong>a....b</strong></span>... </pre> <p>The first error is to use the <em>evil</em> <code>.*</code>. Because you will match from the first to the last. The next natural way, is to change the <em>greediness</em>. But it is not enough as you will match from the first <code>a</code> to the first <code>b</code>. Then a simple constatation is that our matching string shouldn&rsquo;t contain any <code>a</code> nor <code>b</code>. Which lead to the last elegant solution.</p> <p>Until now, that was, easy. Now, how do you manage when instead of <code>a</code> you have a string?</p> <p>Say you want to match: </p> <pre class="twilight"> &lt;li&gt;...&lt;li&gt; </pre> <p>This is a bit difficult. You need to match </p> <pre class="twilight"> &lt;li&gt;[anything not containing &lt;li&gt;]&lt;/li&gt; </pre> <p>The first method would be to use the same reasoning as in my <a href="previouspost">previous post</a>. Here is a first try:</p> <pre class="twilight"> &lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*&lt;/li&gt; </pre> <p>But what about the following string: </p> <pre class="twilight"> &lt;li&gt;...&lt;li&lt;/li&gt; </pre> <p>That string should not match. This is why if we really want to match it correctly<sup><a href="#note1">&dagger;</a></sup> we need to add:</p> <pre class="twilight"> &lt;li&gt;([^&lt;]|&lt;[^l]|&lt;l[^i]|&lt;li[^&gt;])*(|&lt;|&lt;l|&lt;li)&lt;/li&gt; </pre> <p>Yes a bit complicated. But what if the string I wanted to match was even longer?</p> <p>Here is the algorithm way to handle this easily. You reduce the problem to the first one letter matching:</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span> transform a simple randomly choosen character</span> <span class="Comment"><span class="Comment">#</span> to an unique ID </span> <span class="Comment"><span class="Comment">#</span> (you should verify the identifier is REALLY unique)</span> <span class="Comment"><span class="Comment">#</span> beware the unique ID must not contain the </span> <span class="Comment"><span class="Comment">#</span> choosen character</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X</span><span class="StringRegexp"><span class="StringRegexp">/</span>_was_x_<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>Y</span><span class="StringRegexp"><span class="StringRegexp">/</span>_was_y_<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> transform the long string in this simple character</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>&lt;li&gt;</span><span class="StringRegexp"><span class="StringRegexp">/</span>X<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>&lt;<span class="StringRegexpSpecial">\/</span>li&gt;</span><span class="StringRegexp"><span class="StringRegexp">/</span>Y<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> use the first method</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X([^X]*)Y</span><span class="StringRegexp"><span class="StringRegexp">/</span><span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> retransform choosen letter by string</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>X</span><span class="StringRegexp"><span class="StringRegexp">/</span>&lt;li&gt;<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>Y</span><span class="StringRegexp"><span class="StringRegexp">/</span>&lt;<span class="StringRegexpSpecial">\/</span>li&gt;<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="Comment"><span class="Comment">#</span> retransform the choosen character back</span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>_was_x_</span><span class="StringRegexp"><span class="StringRegexp">/</span>X<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> <span class="StringRegexp"><span class="StringRegexp"><span class="SupportFunction">s</span>/</span>_was_y_</span><span class="StringRegexp"><span class="StringRegexp">/</span>Y<span class="StringRegexp">/</span></span><span class="StringRegexp"><span class="StringRegexp"><span class="Keyword">g</span></span></span> </pre> <p>And it works in only 9 lines for any beginning and ending string. This solution should look less <em>I AM THE GREAT REGEXP M45T3R, URAN00B</em>, but is more convenient in my humble opinion. Further more, using this last solution prove you master regexp, because you know it is difficult to manage such problems with only a regexp.</p> <hr /> <p><small><a name="note1"><sup>&dagger;</sup></a> I know I used an HTML syntax example, but in my real life usage, I needed to match between ``. And sometimes the string could finish with <code>e::</code>.</small></p> tag:yannesposito.com,2010-02-15:/Scratch/en/blog/2010-02-15-All-but-something-regexp/ All but something regexp 2010-02-15T09:16:12Z 2010-02-15T09:16:12Z <p>Sometimes you cannot simply write:</p> <pre class="twilight"> <span class="Keyword">if</span> str.<span class="Entity">match</span>(regexp) <span class="Keyword">and</span> <span class="Keyword">not</span> str.<span class="Entity">match</span>(other_regexp) do_something </pre> <p>and you have to make this behaviour with only one regular expression. The problem is the complementary of regular languages is not regular. Then, for some expression it is absolutely not impossible.</p> <p>But sometimes with some simple regular expression it should be possible<sup><a href="#note1">&dagger;</a></sup>. Say you want to match everything containing the some word say <code>bull</code> but don&rsquo;t want to match <code>bullshit</code>. Here is a nice way to do that:</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span> match all string containing 'bull' (bullshit comprised)</span> <span class="StringRegexp"><span class="StringRegexp">/</span></span><span class="StringRegexp">bull</span><span class="StringRegexp"><span class="StringRegexp">/</span></span> <span class="Comment"><span class="Comment">#</span> match all string containing 'bull' except 'bullshit'</span> <span class="StringRegexp"><span class="StringRegexp">/</span></span><span class="StringRegexp">bull<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^s<span class="StringRegexp">]</span></span>|$<span class="StringRegexp">)</span></span>|</span> <span class="StringRegexp">bulls<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^h<span class="StringRegexp">]</span></span>|$<span class="StringRegexp">)</span></span>|</span> <span class="StringRegexp">bullsh<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^i<span class="StringRegexp">]</span></span>|$<span class="StringRegexp">)</span></span>|</span> <span class="StringRegexp">bullshi<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^t<span class="StringRegexp">]</span></span>|$<span class="StringRegexp">)</span></span></span><span class="StringRegexp"><span class="StringRegexp">/</span></span> <span class="Comment"><span class="Comment">#</span> another way to write it would be</span> <span class="StringRegexp"><span class="StringRegexp">/</span></span><span class="StringRegexp">bull<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^s<span class="StringRegexp">]</span></span>|$|s<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^h<span class="StringRegexp">]</span></span>|$<span class="StringRegexp">)</span></span>|sh<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^i<span class="StringRegexp">]</span></span>|$<span class="StringRegexp">)</span></span>|shi<span class="StringRegexp"><span class="StringRegexp">(</span><span class="StringRegexp"><span class="StringRegexp">[</span>^t<span class="StringRegexp">]</span></span>|$<span class="StringRegexp">)</span></span><span class="StringRegexp">)</span></span></span><span class="StringRegexp"><span class="StringRegexp">/</span></span> </pre> <p>Let look closer. In the first line the expression is: <code>bull([^s]|$)</code>, why does the <code>$</code> is needed? Because, without it the word <code>bull</code> would be no more matched. This expression means:</p> <blockquote> <p>The string finish by <code>bull</code> <br /> or, <br /> contains <code>bull</code> followed by a letter different from <code>s</code>. </p> </blockquote> <p>And this is it. I hope it could help you.</p> <p>Notice this method is not always the best. For example try to write a regular expression equivalent to the following conditional expression:</p> <pre class="twilight"> <span class="Comment"><span class="Comment">#</span> Begin with 'a': ^a</span> <span class="Comment"><span class="Comment">#</span> End with 'a': c$</span> <span class="Comment"><span class="Comment">#</span> Contain 'b': .*b.*</span> <span class="Comment"><span class="Comment">#</span> But isn't 'axbxc'</span> <span class="Keyword">if</span> str.<span class="Entity">match</span>(<span class="StringRegexp"><span class="StringRegexp">/</span></span><span class="StringRegexp">^a.*b.*c$</span><span class="StringRegexp"><span class="StringRegexp">/</span></span>) <span class="Keyword">and</span> <span class="Keyword">not</span> str.<span class="Entity">match</span>(<span class="StringRegexp"><span class="StringRegexp">/</span></span><span class="StringRegexp">^axbxc$</span><span class="StringRegexp"><span class="StringRegexp">/</span></span>) do_something <span class="Keyword">end</span> </pre> <p>A nice solution is:</p> <pre class="twilight"> <span class="StringRegexp"><span class="StringRegexp">/</span></span><span class="StringRegexp">abc| <span class="Comment"><span class="Comment">#</span> length 3</span></span> <span class="StringRegexp">a.bc| <span class="Comment"><span class="Comment">#</span> length 4</span></span> <span class="StringRegexp">ab.c|</span> <span class="StringRegexp">a<span class="StringRegexp"><span class="StringRegexp">[</span>^x<span class="StringRegexp">]</span></span>b<span class="StringRegexp"><span class="StringRegexp">[</span>^x<span class="StringRegexp">]</span></span>c| <span class="Comment"><span class="Comment">#</span> length 5</span></span> <span class="StringRegexp">a...*b.*c| # length &gt;5</span> <span class="StringRegexp">a.*b...*c</span><span class="StringRegexp"><span class="StringRegexp">/</span></span> </pre> <p>This solution uses the maximal length of the string not to be matched. There certainly exists many other methods. But the important lesson is it is not straightforward to exclude something of a regular expression.</p> <hr /> <p><small><a name="note1">&dagger;</a> It can be proved that any regular set minus a finite set is also regular. </small></p>