determining the topic of a page?

To The Live Thread

Thread: determining the topic of a page?

Back to category: BlackHat SEO & Web Stuff

nutballs

any thoughts on how to do this?
I mean, you can count the words, but that just gives you the most used word, which is not neccessarily the topic. for example, coffee may be the most used word, but coffee colored paint might be the topic.

thoughts?

perkiset

Your own stuff Applause

Or robotic interpretation of ... other?

This is the Holy Grail really... How much do you think G has spent trying to ascertain just that thing?

: Applause

erk waits patiently for someone WAY smarter than him to come by and fill in the blanks::
Applause

piratescurvy

I may not be sure what you are getting at, but do you mean using synonyms and such to make an overall topic?

Like instead of coffee, coffee, blah blah blah, coffee.
Where the topic is obviously coffee.

You have... arabica beans, energy drinks, coffee, caffeinated beverages, blah blah blah, crappy coffee.
The topic might be "stimulant drinks".

Wikipedia seems to do a good job of redirecting you to the right page if you type in a synonym that is different from the page title. Or it will give you a list of disambiguations.

nutballs

I know perk, but the grail must exist in a simpler form

as for what i mean. not so much synonyms, since that is probably way out there, but more like "relational occurrence". so not arabica or green mountain being id'd as coffee, but more like finding the most popular phrase in a block of text. Like

I bought some flavored coffee the other day. The flavored coffee was great. Chocolate coffee was the name.

Now, the way I can do it is count the words, and decide coffee is it. But that would be wrong. "flavored coffee" is the real phrase. I am not looking to guess at the theoretical subject, just determine the most popular phrase like "flavored coffee".

make sense?

jammaster82

<img src=http://www.perkiset.org/forum/Smileys/default/popcorn6mq.gif>

me too.

perkiset

If I needed to do this, I'd ping Fantomaster - his expertise in grammatics and linguistics might prove to be good gear turners. There's the really cerebral effort of divining intention - which is highly subjective, or the notion of mechanical sentence division for something slightly smarter than <i>word count</i>. Sounds like you're moving towards advanced mechanics rather than the somewhat quixotic divination of "meaning" - which is good.

Frankly my friend this puzzle creeps me out and I don't know much else to start you.

Bompa

quote author=nutballs link=topic=803.msg5507#msg5507 date=1204161837

DangerMouse

Would this be where the lsi concept creeps in? Group pages with a similar pattern together and maybe they're on the same topic?

vsloathe

Ah, I should have gotten to this sooner.

For a while now, I've had a contextual ad-serving system on the back burner. It works *really* well - I can describe the methodology I use.

First of all, you need to approach this very systematically, not much AI is involved per se. So for instance, you need a ready made database of common word pairings, some people have mentioned LSI - that's a good place to start, as are the traditional keyword suggesters (wordtracker, wordze, et al). Do the word count thing, but create an object for each base word that has the words surrounding it. The more times a word ap

pear

s next to a popular word, the more likely it is that those terms are related An example:

plant
-<before>
--potted
-<after>
--food
--bunch of random words we don't care about, no recognizable pattern

In this example, we see that the article is most likely about potted plant food. The word potted is *always* before the word plant, so the page has to be about potted plants. Then we see that the word food is often (but not always) after the word plant. From this we could gather that the article is about potted plants and potted plant food.

We do this dance for every word on the page, and we should come out with some intelligible results. Don't forget the dead giveaways, like what's in the meta tags or the <h> tags, but remember that can and is often spoofed, so weight it accordingly.

nutballs

This is for my own contextual-lazy-bastard ad engine as well which was probably pretty apparent.

ok. i think i get it, and it is along the line of what I was thinking i would have to do.
my thought was to not have any pre-existing pairings of phrases.

So I would find any words that are not in my stop word list.
count each occurrence (root word).
grab the word before and after each root word.
and count those.

that sounds like a lot of iterations to me, and slow.

so the other option is to just count 1word, 2word, and 3word phrases on the page, which again is alot of iterations, but is simpler. I think i could just explode into an array, count out into another array, checking for existing and increment the counter for that phrase. Then I can decide on count thresholds to determine if a word-group is a topic or not.

both methods just sound ridiculously intensive though. maybe not. I will have to try a small experiment on a controlled text chunk and see what the possibilities are.

ironically this is not so much for BH. When i spit out a site, i know the topic, since after all, i made the damn thing. So i can limit to a few topics, and route accordingly. But there is another thing I want to do, a little more random and less controlled, which an automatic ad-server would work great.
But, i think I only have brute force methods at my disposal since AI-concepts just make me want to go play video games and make fun of the AI...

perkiset

NICE kickoff VS - that's a great way to tackle the puzzle. There are some details there, but I think you're right and a good hashing algo and DB would make this not entirely inefficient.

vsloathe

I wouldn't say it's inefficient or resource-intensive. Keep in mind you only need to do this once per page. If you do it more, you're doing it wrong.

Thread Categories

		Best of The Cache Home
		Search The Cache