The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. August 22, 2019, 03:05:03 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Footprint catcher  (Read 3134 times)
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« on: August 31, 2009, 11:02:14 PM »

I'm writing footprint catcher to make it a little bit easier to find good and "unique" footprints to find certain types of sites instead of "powered by" type of footprints.

What's the best approach? I was thinking about writing a simple phrase extractor/exploder and then just feed verified example sites and see which phrases are mentioned in every site. However the problem is that I don't know any good and quick way to do this. I would do it by explode and then just for looping words and seeing which are mentioned by looping words from page 1 against page 2. This is a lot of loops and at least now before morning coffee I can't think how I could easily make it to accept more than fixed X amount of sites because of loops. Is there some sort of nice way to do this?

*update* I realized that I'm going to get "google php.net for array_diff function" reply but that's not the problem Smiley Problem is the whole structure of the script. Footprints can be from 2 words to 15 or more words. That's a lot of arrays you know Wink
« Last Edit: August 31, 2009, 11:09:12 PM by kurdt » Logged

I met god and he had nothing to say to me.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: August 31, 2009, 11:16:54 PM »

This is actually pretty fascinating to me, although I've spent absolutely no time on it. I'll be looking forward to anyone weighing in.

Side note, I am completely bothered by the fact that I can't get my arms around how Shazam works for the iPhone - listens to any 15 second segment of a song, then comes back with the name, album etc really really fast (given the magnitude of the query). I keep grinding around on creating anchor points and then building hash values for chunks of songs, then trying to find an anchor point in everything it listens to and creates a hash then does a lookup ... it's got me twisted up.

And the answer, is probably very similar to your challenge here kurdt. It's essentially the same problem - find similarities between unknown numbers of things, with unknown sizes and shapes and nor hard anchor points.

Popcorn
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #2 on: August 31, 2009, 11:24:45 PM »

I always thought that Shazam might be using some sort of frequency checking. I mean like analyzing beat, peak frequencies of the sample and then just running it thru a database where it has analyzed A LOT of different songs. I have made Shazam to give me wrong song that actually reminds the part I fed to Shazam so it's probably based on some sort of probability algo also.
« Last Edit: August 31, 2009, 11:27:00 PM by kurdt » Logged

I met god and he had nothing to say to me.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #3 on: August 31, 2009, 11:30:18 PM »

And the answer, is probably very similar to your challenge here kurdt. It's essentially the same problem - find similarities between unknown numbers of things, with unknown sizes and shapes and nor hard anchor points.

Popcorn
Actually this is not the problem. I have solved this already. The problem for me is really how to actually gather these datasets in the fastest manner possible with a good script structure Smiley

Funny thing is that now I'm zipping my second cup of morning coffee and I think I have the solution. I now need to test it and I'll come back with results Wink
Logged

I met god and he had nothing to say to me.
oldenstylehats
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #4 on: September 01, 2009, 03:04:34 AM »

Neat bit of synchronicity here. My team has been looking at this problem for the last few weeks. We've approached the problem with a similarity-based probability model, but it isn't effective without a huge amount of test data and seems to me (though I am not the stats guy) like it might be overkill. I'm also interested to see if anyone has any insight on this.
Logged

No links in signatures please
rcjordan
Lifer
*****
Offline Offline

Posts: 882


View Profile
« Reply #5 on: September 01, 2009, 08:00:19 AM »

>Footprint catcher

Odd, but for almost my entire online 'career' I've been suggesting/warning that footprints are bad.  Though there are times that I've targeted a few phrases (submit your site, etc. back in the directory heyday) it's hard to flip over to the other side. Like Perk, I'll watch and contribute if I can.

I assume you already have in mind going after the big cms system footprints? Or is that off-target for your purpose?
Logged
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #6 on: September 01, 2009, 08:07:08 AM »

I resolved this like I theorized above, I ended up using type of similarity measuring also but mainly just bruteforce. The code isn't written yet but I found a way to see where the hotspot is. Now it takes some coding and optimizing. It remains to be seen that how heavy the final script is and do I need to get it coded in C to make it run fast enough. When you process few hundred million pages, little lag starts to feel Wink
Logged

I met god and he had nothing to say to me.
lamontagne
Journeyman
***
Offline Offline

Posts: 89


View Profile
« Reply #7 on: September 04, 2009, 11:59:25 PM »

Find a list of sample sites that contain the same url structure...it is best if the sample input sites contain a very low amount of actual content, but it will not change results much if there is content, it will only change the time taken to find footprints...

your crawler for this sample input should go n levels deep into the sample input sites, the value of n is based on how advanced you want to get with it.

the process of crawling a single page on a single site will go like this:
grab html from page, strip all tags, strip all stop words and punctuation. you are left with the core keywords of the page. using these words split by " " and store each word into a database with four columns: one for an identification, one for word, one for the uri (just the uri from the path you began the crawl, not the domain name, aka "/viewtopic.php?id=1", even if the url was http://mysite.com/forum/viewtopic.php?id and you began the crawl on http://mysite.com/forum) ... and one for the count...

if the keyword and uri is already in the table, increment the count for that row by 1. do this for every page crawled on the site up to an n level depth. Basically you're creating a reverse index (like a very simple search engine would use).

Repeat this process using the same table for all sample input sites...


Now in the database simply do a "SELECT * FROM footfinder ORDER BY uri"... This should give you a list of keywords in each uri that are common amongst the examples... You could also just setup the cms locally, crawl the basic install grabbing all text between any tags that is longer than say 15 characters ... >(.[^<>]{15,})< and storing it into a text file...which would be a faster, simpler, and more automated approach... (there are some issues still with the second method, which i will leave up to you to resolve.. i cant give you everything)
Logged

"Long time no see. I only pray the caliber of your questions has improved." - Kevin Smith
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #8 on: September 05, 2009, 12:17:54 AM »

Heh, that's a pretty good way to do it with "brute force". I might actually try that because the solution I came up is way more resource hungry than yours.
Logged

I met god and he had nothing to say to me.
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #9 on: September 07, 2009, 01:30:57 AM »

You could also flip through robots.txt, any css files available, and html comments for footprints alot of people forget about.
Logged
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #10 on: September 07, 2009, 01:37:16 AM »

You could also flip through robots.txt, any css files available, and html comments for footprints alot of people forget about.
This might sound stupid but how do you search html comments in Google? Or css files.. just try inurl:"common.css"

I think footprint catcher is meant to find footprints you can use in search engines. It's very easy to find patterns in the source code but it gets little more difficult when there's middle man (search engine) involved and you are playing by their rules when it comes to what you can search.
Logged

I met god and he had nothing to say to me.
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #11 on: September 07, 2009, 11:56:52 PM »

@kurdt
Sorry I thought we were talking about crawling sites on our own.

You could easily store an array of common .css names eg(style.css,css.css) whatever css filenames that the cms you are targeting uses by default.
To do it through google, you could always do the filetype:css command, not sure if that works though off the top of my head, but there is bound to be a workaround for it.
Edit: inurl:style filetype:css brings up some interesting results.

Same thing with html comments..
Crawl page, preg_match and compare any html comments to a set that you have put together, if theres a match store the url, if not dump it from the database.
Not sure if google can bring back a search using <!-- --> html comments in it. But you never know.

Just something to spin the gears.
Logged
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!