The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register.
Did you miss your activation email?
May 24, 2012, 12:54:33 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: AWK just blew me away  (Read 1835 times)
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« on: May 28, 2007, 09:50:31 PM »

Anyone here use awk?

I had always considered it just a more complex sed/grep but I recently had a situation where I HAD to use it so i started to grok it and man it is pretty cool. I still know bugger all but I just used it on an AMD Athlon(tm) XP 1800+ with 256MB RAM to do the following;

Use a regex to find matching lines and pull out the seventh field from said lines.

It processed 6969140 lines and extracted 3291994 results in a minute and a half!

Yes, you read it right, nearly seven million lines and three point three million results!

imagine what it would do on a real computer!

Man, I'm IMPRESSED!

Well worth the time invested to learn how it works in my humble opinion.

Cheers,
td

[edit]Forgot to conjugate the verb "to go"[/edit]
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 9896



View Profile
« Reply #1 on: May 29, 2007, 03:35:59 PM »

For those who don't know the tool, AWK is a text processing mini-language that was originally built by three guys whos last names started with A, W and K. It's been around for a long time.

TD - tell us how you used it and give us a primer... I have completely forgotten the tool and since a vast amount of my processing with clients is text based, a refresher would be good. I'm thinking also that AWK may be an EXCELLENT adjunct tool to the scraper and parser... in conjunction with, say, php:exec_shell() which returns the stdout of a process to a variable in PHP, this could be a match made in heaven. Could be quicker'n shit through a goose if the parsing is heavy...

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #2 on: May 29, 2007, 06:29:50 PM »

Exactly perk,

OK, I first got interested when I had this as a problem.....

I had to parse a file which had the following structure;

data1
data2 (possibly missing)
empty line

This looks considerably easier than it actually is to parse, the problems presented are the possible absence of the second piece of data and the empty lines in between. Either one when taken in isolation is not that bad but together they presented me with a headache.

Now awk can either be run like this "awk -f cmd.awk target_to_parse" or the cmd.wak (arbitrary name) can be made stand alone by the use of a shebang line and you just call it like any other executable.

Anyway, the solution to the above example came out looking like this;

Code:
#!/bin/awk -f
BEGIN {
        FS="\n"
        RS=""
}
{
#       if( $2 != "" )
        if( NF > 1 )
          print $1 "\n" $2
}

The first line is the shebang and I'm not going into that here.
next is the BEGIN section which is like an init() function where you set up variables etc. for the run. FS is an internal variable which stands for field separator (in this case a line feed), RS is Record separator (in this case a blank line). the next set of parenthesis represent the main guts of the script, this is the section that gets applied to each "record", in this case I test that there is more than 1 field and, if there is print the two fields seperated by a new line. This satisified the requirements for that particular task.


next I had a task where the numbers involved were very large as mentioned in the OP. Once again awk came to the rescue, this time I've changed the regex slightly to protect the guilty ;-)

Code:
#!/bin/awk -f
BEGIN {
        FS="\""
        RS="\n"
}
{
#       if( NF == 7 )
        if( / *?<img name/ )
          print $4
}


this time we're testing whether a line matches the regex (in this case looking for a line that starts with at least one space and is a name <img>) and printing the fourth field based on a " as the field separator. the lines that begin with # are commented out BTW, except the shebang of course  Tongue

this barely scratches the surface of what can be done with awk but I hope it is enough to whet your collective appetites.

Heaps more info here http://www.gnu.org/software/gawk/manual/html_node/index.html

Cheers,
Brad
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 9896



View Profile
« Reply #3 on: May 29, 2007, 08:36:26 PM »

That's just simply great stuff TD. I don't have an immediate application, but with the amounts of text I process I think there are definitely some things to consider. I wonder about some efficiency things like converting XML into a serialized array for php... or even JSON... how strong is the output phase of the language?

Interestingly, awk is used in several examples at the PHP.net site in relation to exec() functions:

http://us.php.net/exec

'twould seem we're not the first to think of this, eh...?
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #4 on: May 30, 2007, 04:25:40 AM »

I don't have an immediate application, but with the amounts of text I process I think there are definitely some things to consider. I wonder about some efficiency things like converting XML into a serialized array for php... or even JSON... how strong is the output phase of the language?

You mean like this  Wink (a great learning experience BTW, thanks perk)

Code:
#!/bin/awk -f
BEGIN {
        RS="\n"
        ORS=""
}
{
        line_array[NR] = $0
}
END {
        # arrayname is the arbitrary name of the array you wish to create
        print "arrayname:" NR ":{"
        for( i = 0; i < NR; i++ )
        {
                print "i:" i ";s:" length( line_array[i+1] ) ":\"" line_array[i+1] "\";"
        }
        print "}"
}


Interestingly, awk is used in several examples at the PHP.net site in relation to exec() functions:

http://us.php.net/exec

'twould seem we're not the first to think of this, eh...?

Not by a long shot, plenty of people would have known awk long before they learned php. The old ARPANET brigade etc. By my recent experience I can see many reasons to use it in that fashion and in many others, like you i do a lot of text processing and I wish I'd taken the time to learn awk about...... oh, let's say..... 10-15 years ago....  ROFLMAO

looking forward to giving some of this a run in the benchtests coming up  Nerd

Cheers,
td
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
JasonD
Expert
****
Offline Offline

Posts: 100


View Profile
« Reply #5 on: May 30, 2007, 05:26:46 AM »

AWK is amazing and always has been but Perk, do you REALLY use exec calls like that ?
Logged
esrun
Rookie
**
Offline Offline

Posts: 24


View Profile
« Reply #6 on: May 30, 2007, 08:51:47 AM »

Off-topic, did you spend much time at londonseo last night jason? I caught you on the way out but I would have been there earlier if it wasn't for finding parking and then trying to find the damn pub.
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 9896



View Profile
« Reply #7 on: May 30, 2007, 09:27:38 AM »

AWK is amazing and always has been but Perk, do you REALLY use exec calls like that ?

I use exec calls when I need to - not that often really, but I do. What do you mean by "Like That" - is there something nasty at the php.net site? I never use an exec that a surfer could somehow get to... more often it's a poor-mans threading technique. Sheesh JD now you've got me all nervous and shit  ROFLMAO
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
JasonD
Expert
****
Offline Offline

Posts: 100


View Profile
« Reply #8 on: June 01, 2007, 07:53:06 AM »

Nothing wrong with doing an exec call, but it's bloody ugly code Smiley

I code shittily all the time too !
Logged
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!