
![]() |
thedarkness
Anyone here use awk?
I had always considered it just a more complex sed/grep but I recently had a situation where I HAD to use it so i started to grok it and man it is pretty cool. I still know bugger all but I just used it on an AMD Athlon(tm) XP 1800+ with 256MB RAM to do the following; Use a regexto find matching lines and pull out the seventh field from said lines.It processed 6969140 lines and extracted 3291994 results in a minute and a half! Yes, you read it right, nearly seven million lines and three point three million results! imagine what it would do on a real computer! Man, I'm IMPRESSED! Well worth the time invested to learnhow it works in my humble opinion.Cheers, td [edit]Forgot to conjugate the verb "to go"[/edit] perkiset
For those who don't know the tool, AWK is a text processing mini-language that was originally built by three guys whos last names started with A, W and K. It's been around for a long time.
TD - tell us how you used it and give us a primer... I have completely forgotten the tool and since a vast amount of my processing with clients is text based, a refresher would be good. I'm thinking also that AWK may be an EXCELLENT adjunct tool to the scraper and parser... in conjunction with, say, php:exec_shell() which returns the stdout of a process to a variable inPHP, this could be a match made in heaven. Could be quicker'n shit through a goose if the parsing is heavy.../p thedarkness
Exactly perk,
OK, I first got interested when I had this as a problem..... I had to parse a file which had the following structure; data1 data2 (possibly missing) empty line This looks considerably easier than it actually is to parse, the problems presented are the possible absence of the second piece of data and the empty lines in between. Either one when taken in isolation is not that bad but together they presented me with a headache. Now awk can either be run like this "awk -f cmd.awk target_to_parse" or the cmd.wak (arbitrary name) can be made stand alone by the use of a shebang line and you just call it like any other executable. Anyway, the solution to the above example came out looking like this; #!/bin/awk -f BEGIN { FS=" " RS="" } { # if( $2 != "" ) if( NF > 1 ) print $1 " " $2 } The first line is the shebang and I'm not going into that here. next is the BEGIN section which is like an init() function where you set up variables etc. for the run. FS is an internal variable which stands for field separator (in this case a line feed), RS is Record separator (in this case a blank line). the next set of parenthesis represent the main guts of the script, this is the section that gets applied to each "record", in this case I test that there is more than 1 field and, if there is print the two fields seperated by a new line. This satisified the requirements for that particular task. next I had a task where the numbers involved were very large as mentioned in the OP. Once again awk came to the rescue, this time I've changed the regexslightly to protect the guilty ;-)#!/bin/awk -f BEGIN { FS=""" RS=" " } { # if( NF == 7 ) if( / *?<img name/ ) print $4 } this time we're testing whether a line matches the regex(in this case looking for a line that starts with at least one space and is a name <img>![]() ![]() this barely scratches the surface of what can be done with awk but I hope it is enough to whet your collective appetites. Heaps more info here http://www.gnu.org/software/gawk/manual/html_node/index.html Cheers, Brad perkiset
That's just simply great stuff TD. I don't have an immediate application, but with the amounts of text I process I think there are definitely some things to consider. I wonder about some efficiency things like converting XML into a serialized array for
php... or even JSON... how strong is the output phase of the language?Interestingly, awk is used in several examples at the PHP.netsite in relation to exec() functions:http://us. php.net/exec'twould seem we're not the first to think of this, eh...? thedarkness
quote author=perkiset link=topic=275.msg1857#msg1857 date=1180496186 I don't have an immediate application, but with the amounts of text I process I think there are definitely some things to consider. I wonder about some efficiency things like converting XML into a serialized array for php... or even JSON... how strong is the output phase of the language?You mean like this ![]() experience BTW, thanks perk) |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads