Chris Clark: Creating original content through web mining

Tuesday, June 28, 2005

Creating original content through web mining

There are quite a few products out there that artificially inflate your web site content through the use of stealing stories from others who publish RSS feeds.

The basic idea is to find several RSS feeds, based on keywords, match the content that you are trying to create. Download their RSS feed and steal their stories to put on your web site.

It's easy to read RSS and post it to a blog. For example blogger gives you the option of sending stories to your blog through email. So it would be very easy to write a program to read several RSS feeds, filter the stories and send emails to your blog with the contents.

But that's for lame asses who can't create any original content. They are counting on getting high search engine rankings because they have alot of on topic content that changes frequently. It's a desperate attempt to get lots of pages in the search engines.

Alot of these programs sell you on the concept of timed postings. Which would appear to freshen your blog content even when you are on vacation. You can schedule stories weeks in advance and have them posted at timed intervals to your blog.

Any way you cut it, it's still stealing. You could call it stealing on a schedule if you want. But it's still stealing.

What I think is a better idea is the creation of content based on building a knowledge base around a topic. If you have a program that will search the Internet for stories, you can parse the stories into individual sentences. Parsing the sentences further into parts of speech (noun, verb, adjective, etc...) and mapping the relationships between the parts of speech will allow you to build a fact table.

For example, if I were to build a program that searched all the major news markets, for stories on the recent headline of the week. I bet you would find that there are plenty of similarities in the fact tables you would be able to build. The Who, What, When, Where, How and Why would probably all be refected in the stories. Using the fact tables and the relevance of the web site source. i.e. CNN would out rank Bob's news. You could build a structure of words and phrases that establish facts.

When you receive 3 unique sources, all credible I would hope. You could determine this type of fact out weights other facts that maybe only have 1 unique source.

Using the knowledge base of facts all related around a topic of interest, you could create a program that could summarize and create unique paragraphs based on the facts. Which would appear as unique content that is original.

For some time I have thought about creating such a program that would read RSS feeds and filter them based on keyword relavance in the posts. Once I had the text, I would parse it using some type of english language part of speech tagger to determine nouns, verbs, adjectives, etc... All of which would be built into hyperbolic trees of keyword associations, based on the credibility of the source (easily determinable by inbound links and such). After that, I would add in intelligence like word synonyms using a thesaurus and part of speech learning using a dictionary and sentence structure.

Once you have something like this, you would need to create a program like the nonsense generator, that would create a unique template for how you would want your stories to appear. In no time flat, you could create hugh amounts of original content that is not stolen and in fact meets some ethical standard of source notation and value.

Reference: http://nonsense.sourceforge.net/

Chris Clark

Tuesday, June 28, 2005

Creating original content through web mining

18 Comments:

About Me

Previous Posts