thedarkness

Anyone here use awk?

I had always considered it just a more complex sed/grep but I recently had a situation where I HAD to use it so i started to grok it and man it is pretty cool. I still know bugger all but I just used it on an AMD Athlon(tm) XP 1800+ with 256MB RAM to do the following;

Use a

regex

  to find matching lines and pull out the seventh field from said lines.

It processed 6969140 lines and extracted 3291994 results in a minute and a half!

Yes, you read it right, nearly seven million lines and three point three million results!

imagine what it would do on a real computer!

Man, I'm IMPRESSED!

Well worth the time invested to

learn

  how it works in my humble opinion.

Cheers,
td

[edit]Forgot to conjugate the verb "to go"[/edit]

perkiset

For those who don't know the tool, AWK is a text processing mini-language that was originally built by three guys whos last names started with A, W and K. It's been around for a long time.

TD - tell us how you used it and give us a primer... I have completely forgotten the tool and since a vast amount of my processing with clients is text based, a refresher would be good. I'm thinking also that AWK may be an EXCELLENT adjunct tool to the scraper and parser... in conjunction with, say,

php

 :exec_shell() which returns the stdout of a process to a variable in

PHP

 , this could be a match made in heaven. Could be quicker'n shit through a goose if the parsing is heavy...

/p

thedarkness

Exactly perk,

OK, I first got interested when I had this as a problem.....

I had to parse a file which had the following structure;

data1
data2 (possibly missing)
empty line

This looks considerably easier than it actually is to parse, the problems presented are the possible absence of the second piece of data and the empty lines in between. Either one when taken in isolation is not that bad but together they presented me with a headache.

Now awk can either be run like this "awk -f cmd.awk target_to_parse" or the cmd.wak (arbitrary name) can be made stand alone by the use of a shebang line and you just call it like any other executable.

Anyway, the solution to the above example came out looking like this;


#!/bin/awk -f
BEGIN {
        FS=" "
        RS=""
}
{
#       if( $2 != "" )
        if( NF > 1 )
          print $1 " " $2
}


The first line is the shebang and I'm not going into that here.
next is the BEGIN section which is like an init() function where you set up variables etc. for the run. FS is an internal variable which stands for field separator (in this case a line feed), RS is Record separator (in this case a blank line). the next set of parenthesis represent the main guts of the script, this is the section that gets applied to each "record", in this case I test that there is more than 1 field and, if there is print the two fields seperated by a new line. This satisified the requirements for that particular task.


next I had a task where the numbers involved were very large as mentioned in the OP. Once again awk came to the rescue, this time I've changed the

regex

  slightly to protect the guilty ;-)


#!/bin/awk -f
BEGIN {
        FS="""
        RS=" "
}
{
#       if( NF == 7 )
        if( / *?<img name/ )
          print $4
}



this time we're testing whether a line matches the

regex

  (in this case looking for a line that starts with at least one space and is a name <img>Applause and printing the fourth field based on a " as the field separator. the lines that begin with # are commented out BTW, except the shebang of course  Applause

this barely scratches the surface of what can be done with awk but I hope it is enough to whet your collective appetites.

Heaps more info here http://www.gnu.org/software/gawk/manual/html_node/index.html

Cheers,
Brad

perkiset

That's just simply great stuff TD. I don't have an immediate application, but with the amounts of text I process I think there are definitely some things to consider. I wonder about some efficiency things like converting XML into a serialized array for

php

 ... or even JSON... how strong is the output phase of the language?

Interestingly, awk is used in several examples at the

PHP

 

.net

  site in relation to exec() functions:

http://us.

php

 

.net

 /exec

'twould seem we're not the first to think of this, eh...?

thedarkness

quote author=perkiset link=topic=275.msg1857#msg1857 date=1180496186

I don't have an immediate application, but with the amounts of text I process I think there are definitely some things to consider. I wonder about some efficiency things like converting XML into a serialized array for

php

 ... or even JSON... how strong is the output phase of the language?


You mean like this  Applause (a great

learn

 ing  experience BTW, thanks perk)


#!/bin/awk -f
BEGIN {
        RS=" "
        ORS=""
}
{
        line_array[NR] = $0
}
END {
        # arrayname is the arbitrary name of the array you wish to create
        print "arrayname:" NR ":{"
        for( i = 0; i < NR; i++ )
        {
                print "i:" i ";s:" length( line_array[i+1] ) ":"" line_array[i+1] "";"
        }
        print "}"
}



quote author=perkiset link=topic=275.msg1857#msg1857 date=1180496186

Interestingly, awk is used in several examples at the

PHP

 

.net

  site in relation to exec() functions:

http://us.

php

 

.net

 /exec

'twould seem we're not the first to think of this, eh...?


Not by a long shot, plenty of people would have known awk long before they

learn

 ed

php

 . The old ARP

ANET

  brigade etc. By my recent experience I can see many reasons to use it in that fashion and in many others, like you i do a lot of text processing and I wish I'd taken the time to

learn

  awk about...... oh, let's say..... 10-15 years ago....  Applause

looking forward to giving some of this a run in the benchtests coming up  Applause

Cheers,
td

JasonD

AWK is amazing and always has been but Perk, do you REALLY use exec calls like that ?

esrun

Off-topic, did you spend much time at london

seo

  last night jason? I caught you on the way out but I would have been there earlier if it wasn't for finding parking and then trying to find the damn pub.

perkiset

quote author=JasonD link=topic=275.msg1881#msg1881 date=1180528006

AWK is amazing and always has been but Perk, do you REALLY use exec calls like that ?


I use exec calls when I need to - not that often really, but I do. What do you mean by "Like That" - is there something nasty at the

php

 

.net

  site? I never use an exec that a surfer could somehow get to... more often it's a poor-mans threading technique. Sheesh JD now you've got me all nervous and shit  Applause

JasonD

Nothing wrong with doing an exec call, but it's bloody ugly code Applause

I code shittily all the time too !


Perkiset's Place Home   Politics @ Perkiset's