The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 22, 2019, 08:14:13 PM

Login with username, password and session length


Pages: [1] 2 3
  Print  
Author Topic: The difference percentage to aim for in my article rewritter?  (Read 8220 times)
leaferz
Rookie
**
Offline Offline

Posts: 19


View Profile
« on: September 11, 2009, 06:24:14 PM »

Sorry if some are reading this for the second time but I should have posted this question here in the first place.

I'm about to begin coding my content rewriter and am wondering what percentages people are aiming for (before and after rewrite) in their tools?

Also does anyone have any suggestions on a command line grammer tool I could use post rewrite to catch the odd grammatic error?
Is abiword with the link-grammer plugin the only way to go?

Thx

Logged

No links in signatures please
lamontagne
Journeyman
***
Offline Offline

Posts: 89


View Profile
« Reply #1 on: September 11, 2009, 07:53:44 PM »

you're wasting your time. there's your answer.
Logged

"Long time no see. I only pray the caliber of your questions has improved." - Kevin Smith
isthisthingon
Global Moderator
Lifer
*****
Offline Offline

Posts: 2879



View Profile
« Reply #2 on: September 11, 2009, 08:01:14 PM »

Quote
you're wasting your time. there's your answer.

Time notwithstanding, I emphatically disagree with lamontagne's answer.  Hold up - who took my gin??   

Logged

I would love to change the world, but they won't give me the source code.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: September 11, 2009, 08:48:28 PM »

Welcome to The Cache leaferz.

The notion of straight percentages is pretty antiquated as is "weighting", and really doesn't apply much anymore. The new notion is "naturalness." If I could stuff 80% keywords into my content and it still looked and read naturally, I'd do it. But that rarely happens.  Roll Eyes

I don't do rewrite and gin-up anymore, using natural and original text everywhere I can, and recombining it as much as I can to create originality. I use semantic markup to create significance which works a LOT better for me than stuffing by percentage.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
isthisthingon
Global Moderator
Lifer
*****
Offline Offline

Posts: 2879



View Profile
« Reply #4 on: September 11, 2009, 09:44:18 PM »

Quote
stuffing by percentage

 Ditto  ROFLMAO
Logged

I would love to change the world, but they won't give me the source code.
leaferz
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #5 on: September 11, 2009, 09:48:44 PM »

Welcome to The Cache leaferz.

The notion of straight percentages is pretty antiquated as is "weighting", and really doesn't apply much anymore. The new notion is "naturalness." If I could stuff 80% keywords into my content and it still looked and read naturally, I'd do it. But that rarely happens.  Roll Eyes

I don't do rewrite and gin-up anymore, using natural and original text everywhere I can, and recombining it as much as I can to create originality. I use semantic markup to create significance which works a LOT better for me than stuffing by percentage.


Thx Perk.  Grin

Keyword stuffing isn't my end goal either. I want to stay as far away from those "markoved" disasters I used to use a year or two ago and focus on readability first before anything.  I'm fine with having the word and/or related words mentioned 2-3 times within the article. There are always other ways of achiveing that goal throughout the rest of that page.

My main worry is getting pages (or domains) dropped because of duplicate content.

Heres random text from todays news which totaled 745 words:

Quote
But a Liberal strategist says the ads are not aimed at the 18 per cent, mostly politicos, who know him. Rather, they are trying to hit the 82 per cent who haven't been introduced to him. “They may not be sexy,” the strategist says. He dismisses the partisans who want the ads to be more edgy. “He [Mr. Ignatieff] is talking for the first time to Canadians who are, hopefully, seeing him in their living rooms.

If I were to change it to:

Quote
But a Liberal planner says the ads are not aimed at the 18 per cent, mostly politicos, who know him. Instead, they are trying to hit the 82 per cent who haven't been informed to him. “They may not be sexy,” the strategist says. He disregards the partisans who want the ads to be more edgy. “He [Mr. Ignatieff] is talking for the first time to Canadians who are, hopefully, seeing him in their living rooms.

News articles will be a bit more difficult but by doubling the size with two articles on the same topic (different sources) I beleive it could be achieved.

Would those subtle changes be enough to avoid the penalty?

The before comes up in a google search while the second part doesn't. Would that be a proper way to gauge it?
« Last Edit: September 11, 2009, 09:54:30 PM by leaferz » Logged

No links in signatures please
lamontagne
Journeyman
***
Offline Offline

Posts: 89


View Profile
« Reply #6 on: September 11, 2009, 09:54:03 PM »

My main worry is getting pages (or domains) dropped because of duplicate content.

Like I said, waste of time. Build sites in stages, not in content.
Logged

"Long time no see. I only pray the caliber of your questions has improved." - Kevin Smith
leaferz
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #7 on: September 11, 2009, 09:57:27 PM »

My main worry is getting pages (or domains) dropped because of duplicate content.

Like I said, waste of time. Build sites in stages, not in content.

What I wanted to achieve is a small 10-20 page website which grows on a preset timer. I still require actual content for those 10-20 pages though.
Logged

No links in signatures please
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: September 11, 2009, 09:59:33 PM »

Would those subtle changes be enough to avoid the penalty?

Oooh, I'd be hesitant to make such a hard line declaration as yes or no here. People like VSloathe work with maths that calculate the percentage of difference, just to defeat such a trick. But what that ratio is, I have no idea. I do know that the best hashing algos toss first and last changes and a certain percentage of difference, perhaps even notional difference in the strongest cases (ie., changes in word to not change the intention or meaning of the sentence). It is reasonably easy to look for adverb and adjective differences - nouns, subjects and such are harder to change and still sound readable, yet if I were a betting man, this is where I'd strive ti inject difference.

A buddy just told me the other day of a story that I'm intrigued by. His site is not scraped often by G, as it doesn't change that fast. As his story goes, 'bots are scraping him regularly and when there's new content, they are putting it up before him and HE is the one getting the dupe content penalty. I am not sure how he is proving this, but it makes for an interesting story.

But there is the other camp that claims there is no dupe penalty.

Let me simply offer this thought. If what you are presenting is a couple paragraphs of stuff, and you use the same thing all over the place, you'll have troubles. But what if, for each site, you had a site-themed couple paragraphs that are used all over the site and in conjunction with your content. These paragraphs would be specific to each site. So the exact same piece of content from site A, would not hash out to be the same as site B, because the site-specific content would change the hash value of the scraped content enough. Just a thought. Wink
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #9 on: September 11, 2009, 10:00:49 PM »

Like I said, waste of time. Build sites in stages, not in content.

Why don't you expand on the Lamont?
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
leaferz
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #10 on: September 11, 2009, 10:18:41 PM »

Oooh, I'd be hesitant to make such a hard line declaration as yes or no here. People like VSloathe work with maths that calculate the percentage of difference, just to defeat such a trick. But what that ratio is, I have no idea. I do know that the best hashing algos toss first and last changes and a certain percentage of difference, perhaps even notional difference in the strongest cases (ie., changes in word to not change the intention or meaning of the sentence). It is reasonably easy to look for adverb and adjective differences - nouns, subjects and such are harder to change and still sound readable, yet if I were a betting man, this is where I'd strive ti inject difference.

I've setup the database over the last few weeks on each type of word and over the next few days will finish off the ruleset on which words can be changed and which should be avoided (people's names, numbers, locations, common words, etc). Once I've defined which words can be changed then it enters another set of guidelines which decides what type of word it actually is (noun, verb, etc) and make the decision to change with the next most probable word with and any additions which were tacked onto the word being changed (ing, ed, etc).

A buddy just told me the other day of a story that I'm intrigued by. His site is not scraped often by G, as it doesn't change that fast. As his story goes, 'bots are scraping him regularly and when there's new content, they are putting it up before him and HE is the one getting the dupe content penalty. I am not sure how he is proving this, but it makes for an interesting story.

But there is the other camp that claims there is no dupe penalty.

Let me simply offer this thought. If what you are presenting is a couple paragraphs of stuff, and you use the same thing all over the place, you'll have troubles. But what if, for each site, you had a site-themed couple paragraphs that are used all over the site and in conjunction with your content. These paragraphs would be specific to each site. So the exact same piece of content from site A, would not hash out to be the same as site B, because the site-specific content would change the hash value of the scraped content enough. Just a thought. Wink

Thats exactly how I wanted to surround the article. Grin Once the article is in place have text around it that could change every couple of days to keep the website fresh and to change it's overall value. Possibly a related posts section, latest forum posts on the other side, rss news from my own feed from my own feeder news site, etc.

It seems I'm on the right path though. Although I could have explained myself a bit better.
Logged

No links in signatures please
lamontagne
Journeyman
***
Offline Offline

Posts: 89


View Profile
« Reply #11 on: September 11, 2009, 10:37:18 PM »

Like I said, waste of time. Build sites in stages, not in content.

Why don't you expand on the Lamont?

Gladly,

There are ways to ensure a website will not be under penalty for duplicate content, though it may contain duplicate content. Let's consider the concept of "duplicate content" for a moment. The reason I have never believed in the possibility of google being able to establish a duplicate content penalty on a website is simple in foundation. How is it done?

The brains at google understand one simple fact, you cannot crawl in real time. This leads to the inability to establish a "owner" of content. For there to be a duplicate content penalty you must establish an original "owner", correct? And this original owner is found how? What tells google that I am the owner of: "Hey I love the cache and it is the best forum ever" on www.mydomain.com as opposed to it being posted on www.digg.com ? Digg is crawled much more frequently, so the googlebots would find that exact line of content on digg before they would find it on www.mydomain.com , therefore is digg the original content owner and am I in penalty of duplicate content? Or is Digg? And how would the googlebots know the difference. This creates both a problem and a gateway for you.

If you can manage to erect a site that is crawled more frequently than the original owners of the content you scrape, you have solved this problem regardless of whether it matters or not. But that is not the point of my post, just a side note to think of. The point of my post is to touch on building in "stages" and not "content".

Most blackhat websites and spam blogs are easily identified. Why? The overall layout and structure is static. Despite whether you setup a blackhat site to grow every day or all at once, the structure and layout remain the same for the site. A real website, one worth ranking, is not built like this. A natural website is built with basic functions, and over time new additions are added as well as an improved layout for the website. This is the difference between a website that has 10,000 pages indexed and a website that has 500,000 pages indexed.

Think of a website like about.com. When it began it was simple, over time features and abilities increased. Digg.com, Reddit.com, they are all the same, and layouts/features increased over time. Nothing seemed out of the ordinary, no cloaking, nothing special, they all contain duplicate content and all rank very high. It could be argued that the very fact that they rank high is a proof of the duplicate content penalty, that these sites are spidered more often and therefore the content is considered originally theres and that is why they rank high for search term (these sites are often used as "parasite hosts"). But if that is the case, and the duplicate content penalty is true, how did they come to be such large players on google in the first place?

It is possible to automate the process of building in "stages". These automated sites will in no way compete with the likes of Digg or Reddit or any social site that has real employees improving it (as opposed to time delayed feature releases that have been prebuilt into the CMS  Wink )... but I can guarantee they will last in the engines, and if only 10-20 of these are made into a moderate success they will far outweigh any "update this site with new content everyday" type of site you can build. If done correctly, the only thing required for these websites is a quick template and configuration variable change to launch a new one (Most of which can be outsourced for a few hundred dollars).  

If you have ever worked at a web 2.0 startup this is obvious information, because web 2.0 websites gather all content from other websites. Yet they stay in power and you will fight hard to out-rank them with your pitiful "make 20 new posts a day" automated blackhat blog, whether it has unique content or not.

Hope this helps. But, if you are a newb, please build a few simple "updated content" websites before tackling a goal such as this. As for doing duplicate content generation I would recommend a simple synonym replacement approach or a content translation (translate to german, translate back to english) approach, it has proved to be the most legible for me.
Logged

"Long time no see. I only pray the caliber of your questions has improved." - Kevin Smith
leaferz
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #12 on: September 11, 2009, 10:55:12 PM »

Like I said, waste of time. Build sites in stages, not in content.

Why don't you expand on the Lamont?

Gladly,

There are ways to ensure a website will not be under penalty for duplicate content, though it may contain duplicate content. Let's consider the concept of "duplicate content" for a moment. The reason I have never believed in the possibility of google being able to establish a duplicate content penalty on a website is simple in foundation. How is it done?

The brains at google understand one simple fact, you cannot crawl in real time. This leads to the inability to establish a "owner" of content. For there to be a duplicate content penalty you must establish an original "owner", correct? And this original owner is found how? What tells google that I am the owner of: "Hey I love the cache and it is the best forum ever" on www.mydomain.com as opposed to it being posted on www.digg.com ? Digg is crawled much more frequently, so the googlebots would find that exact line of content on digg before they would find it on www.mydomain.com , therefore is digg the original content owner and am I in penalty of duplicate content? Or is Digg? And how would the googlebots know the difference. This creates both a problem and a gateway for you.

If you can manage to erect a site that is crawled more frequently than the original owners of the content you scrape, you have solved this problem regardless of whether it matters or not. But that is not the point of my post, just a side note to think of. The point of my post is to touch on building in "stages" and not "content".

Most blackhat websites and spam blogs are easily identified. Why? The overall layout and structure is static. Despite whether you setup a blackhat site to grow every day or all at once, the structure and layout remain the same for the site. A real website, one worth ranking, is not built like this. A natural website is built with basic functions, and over time new additions are added as well as an improved layout for the website. This is the difference between a website that has 10,000 pages indexed and a website that has 500,000 pages indexed.

Think of a website like about.com. When it began it was simple, over time features and abilities increased. Digg.com, Reddit.com, they are all the same, and layouts/features increased over time. Nothing seemed out of the ordinary, no cloaking, nothing special, they all contain duplicate content and all rank very high. It could be argued that the very fact that they rank high is a proof of the duplicate content penalty, that these sites are spidered more often and therefore the content is considered originally theres and that is why they rank high for search term (these sites are often used as "parasite hosts"). But if that is the case, and the duplicate content penalty is true, how did they come to be such large players on google in the first place?

It is possible to automate the process of building in "stages". These automated sites will in no way compete with the likes of Digg or Reddit or any social site that has real employees improving it (as opposed to time delayed feature releases that have been prebuilt into the CMS  Wink )... but I can guarantee they will last in the engines, and if only 10-20 of these are made into a moderate success they will far outweigh any "update this site with new content everyday" type of site you can build. If done correctly, the only thing required for these websites is a quick template and configuration variable change to launch a new one (Most of which can be outsourced for a few hundred dollars).  

If you have ever worked at a web 2.0 startup this is obvious information, because web 2.0 websites gather all content from other websites. Yet they stay in power and you will fight hard to out-rank them with your pitiful "make 20 new posts a day" automated blackhat blog, whether it has unique content or not.

Hope this helps. But, if you are a newb, please build a few simple "updated content" websites before tackling a goal such as this. As for doing duplicate content generation I would recommend a simple synonym replacement approach or a content translation (translate to german, translate back to english) approach, it has proved to be the most legible for me.

Good points lamontagne. I have started a few other long term statisitcal type websites which use that ideology whose sole purpose is to eventually help my other level of websites rank. I have the sections all completed but plan on adding each section every 6 months or so with possibly a template change.

My overall goal since I'm relatively a noob is to experiment with each type of automation design (cloaked throw away domains, long term project, etc) until I find a few methods that produce the results I'm looking for.
« Last Edit: September 11, 2009, 10:57:43 PM by leaferz » Logged

No links in signatures please
lamontagne
Journeyman
***
Offline Offline

Posts: 89


View Profile
« Reply #13 on: September 11, 2009, 11:12:12 PM »

Starting out I would recommend grabbing a copy of wordpress mu and a bunch of the plugins for wordpress and wordpress mu. For wordpress mu there are a few plugins you will need (I can't remember the exact names, but a quick glance through the wpmu plugin repository will bring them up):

The multiple database plugin - it allows you to spread it through multiple databases so you don't run into a data problem after trying to setup 1000 blogs (it spreads the blogs out through a number of databases on mysql)
The multiple host plugin - it allows you to use multiple domain names for a wpmu install, which will let you setup thousands of sites to be controlled through one admin panel

I would also do the following:
Gather up tons and tons of themes, but be careful, some of them have backdoors built in unless it's from a reputable site.
All in one seo plugin, it's simple and works. Easy to configure.
Akismet and other anti-spam plugins with captcha. I even went as far as to modify a captcha plugins. Do not be mistaken, there are people who actually do make comments that can provide quality content.

Build a small tool to setup these blogs and make automated posts everyday. If you set this up right you can make a small javascript that will show affiliate/ppc ads anywhere on the page you want (like adsense) and it will look like the javascript is hosted on that particular site... (because the actual files are all located in the same directory for the hosting, which means all requests will resolve correctly, use the Host: header, it is your friend here...)

This shouldn't take you more than a month to build and get working right. The XMLRPC posting script, to automate the posts to the blogs (you can directly insert them into the database if you're hardcore enough) should rewrite the content using synonyms and translation.

I would spend the majority of your time on a link dropping script that will bypass anti-spam measures (most people will set their websites to automatically post a comment/entry/topic so long as a captcha test is passed... bad practice... also, there are thousands of neglected blogs and forums that are great for links, and new ones come every month). In the new stages focus your energy on a link dropping script, xrumer will only get you so far....
Logged

"Long time no see. I only pray the caliber of your questions has improved." - Kevin Smith
leaferz
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #14 on: September 11, 2009, 11:53:40 PM »

Starting out I would recommend grabbing a copy of wordpress mu and a bunch of the plugins for wordpress and wordpress mu. For wordpress mu there are a few plugins you will need (I can't remember the exact names, but a quick glance through the wpmu plugin repository will bring them up):

The multiple database plugin - it allows you to spread it through multiple databases so you don't run into a data problem after trying to setup 1000 blogs (it spreads the blogs out through a number of databases on mysql)
The multiple host plugin - it allows you to use multiple domain names for a wpmu install, which will let you setup thousands of sites to be controlled through one admin panel

I would also do the following:
Gather up tons and tons of themes, but be careful, some of them have backdoors built in unless it's from a reputable site.
All in one seo plugin, it's simple and works. Easy to configure.
Akismet and other anti-spam plugins with captcha. I even went as far as to modify a captcha plugins. Do not be mistaken, there are people who actually do make comments that can provide quality content.

Build a small tool to setup these blogs and make automated posts everyday. If you set this up right you can make a small javascript that will show affiliate/ppc ads anywhere on the page you want (like adsense) and it will look like the javascript is hosted on that particular site... (because the actual files are all located in the same directory for the hosting, which means all requests will resolve correctly, use the Host: header, it is your friend here...)

This shouldn't take you more than a month to build and get working right. The XMLRPC posting script, to automate the posts to the blogs (you can directly insert them into the database if you're hardcore enough) should rewrite the content using synonyms and translation.

I would spend the majority of your time on a link dropping script that will bypass anti-spam measures (most people will set their websites to automatically post a comment/entry/topic so long as a captcha test is passed... bad practice... also, there are thousands of neglected blogs and forums that are great for links, and new ones come every month). In the new stages focus your energy on a link dropping script, xrumer will only get you so far....

I've set up a few WP blogs about a year ago and ended up taking a step back to learn mysql/php/jquery because I realized quickly that without a language or two under your belt it's almost impossible to succeed. At least thats been my experience.

Quite often I found myself chasing around the latest tool that's been released with methods that are well past their due date and never work the way you want them to. At least now I can attempt to stay ahead of the game and code tools based on the results of my experiments.

At the beginning I wrote down a list of tools I would need to accomplish my immediate goals, then chipped away at all of them as I became a bit more comfortable with the language(s) and libraries. I left a few tools until the very end (article rewriter, etc) which would no doubt be a bit more difficult but at this stage I'm more then comfortable tackling.

During my previous attempts though, it always came back to content. I'm not comfortbale writing (as you may have noticed) as my strengths and interests (especially) are with coding and figuring out new methods to exploit. I wouldn't mind adding a new language to my toolbox somewhere down the line. At the moment Perl and Python have peaked my interest but that may change when the time comes. At the moment my next phase of learning would be OO PHP, especially after sifting through Perks classes I realize what I'm missing out on.

Sorry if I've gone off topic here but I never really introduced myself to the forum a year ago (almost to the date Wink) when I signed up and after Perks welcome it seemed fitting. Anyways you've both have given me a lot to think about before I begin tomorrow and gave me the exact answers I was hoping for.

Thx again.
Logged

No links in signatures please
Pages: [1] 2 3
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!