June 12, 2008

SEO 101 - Why Site Maps work to help search engines crawl your site better

Site Maps: A Workaround

Site maps are a great workaround for many of the issues mentioned earlier. They can help a site owner as follows:

• A site map brings more of your deep levels of content to higher levels of your site. Your home page (Level 1) links to your site map page (Level 2). Your site map page links to every important page on your site, thus making those pages Level 3. Using text links that are keyword rich is a good idea for the site map.

• If you have a JavaScript, Flash, Java, or graphical menu system, the site map can ensure your site is indexed. This is not an optimal solution from an SEO point of view. One of the problems is that you end up with just one page linking to the whole site, and minimal other internal links. It does work though.

• If a page is only "linked to" once in your site, a site map can provide a secondary link and ensure that it gets indexed. The search engines look at the links going to a page to see if it’s relevant to index or not. If there is only one link to the page and that’s from a very deep page, they may decide to index something else. If there are two links, you've just doubled the link count to the page, making the page more likely to be indexed. In addition, a link from the site map makes most of the pages of your site "Level 3" deep, thus raising a page that might be six or more clicks from the home page and raising its relevancy.

So submit your sitemap today at Live Search Webmaster Center.

June 11, 2008

SEO 101 - Search Engine Spam: the does and don'ts of spamming

Search Engine Spam

So what if you site can be indexed but the search engines decide not to include your site in their index.

There are various techniques that can get your site delisted (banned) from the search engines and they're all labeled "search engine spam."

Search engines do not like spam or site owners who try to spam their results. Search engine users do not like finding spam in their search results.

What is Spam?

Spam is the use of a technique to artificially improve the ranking of a web site. The techniques are generally classified into two groupings:

1. Onsite spamming techniques

2. Offsite spamming techniques

Both of these types of techniques will be covered in further detail later.

Most spam problems come about because:

• The site owner didn't know a specific technique would get their site banned.

• The site owner was trying to do something for their visitors but didn't follow search engine guidelines.

• The site owner attempted to use legitimate optimization but took it too far. Some techniques work fine in moderation but when used in excess these techniques can cause issues.

• The site owner employed an optimization company that doesn't follow the search engine guidelines explained on the following two pages.

• The methods to get a site ranked change constantly. Be aware – what was once a great SEO technique can today get your site banned.

Webmaster Guidelines Include Spam Warnings

Most search engines provide webmaster guidelines to help site owners develop a site suitable for the search engines. You should read, understand, and follow these guidelines:

Google

Yahoo!

MSN

Basically, the guidelines state:

• Write your content for your human visitors, not for spiders.

• Ensure you don't stop the spiders from indexing your site.

• Don’t spam.

If you get caught violating the search engine guidelines, your site is likely to be banned. If your site has been banned, the individual search engines handle how they remove the ban in different ways.

However, they have some things in common and request you take the following actions:

Remove the issue – You're not going to get re-included if the spam is still on the site. No search engine will tell you the reason why you got banned; therefore, make sure your site is cleaner than clean before requesting re-inclusion.

Request Re-inclusion – Get down on your knees and beg forgiveness from the search engines for violating their terms of service. Promise to be a good web site owner in future. Most search engines provide re-inclusion request forms for this purpose.

Wait – Usually for a long time. Generally, re-inclusion requests take three to six months minimum, sometimes longer, to be resolved. This, of course, depends on the degree of violation. Sometimes your site will never get back in the search engine index and you will need to abandon the domain and start again.

Onsite Spamming Techniques

Doorway Pages. This used to be a favorite technique to get sites ranked higher. A doorway page is a page designed to rank well with the search engines. Generally, it looks terrible, usually contains no text of note, and it exists purely to rank well in the search engines. When a visitor lands on the page they get redirected to the correct page within the site.

Cloaking. Similar to doorway pages except that the site detects if a spider is reading the page and displays the doorway page; when a visitor arrives, they view a different page.

Hidden Text. Hidden text is text that is the same color as the background, is font size 1, or has an x,y coordinate that stops it from being displayed by the browser. Hidden text is text that the average visitor will not be able to view but spiders can traverse. As these techniques can be used to hide text from the average user, they have often been used to spam the search engines.

Automated Content Generation. Automatically generating near-duplicate content can be reason for a search engine to ban a site.

Keyword Stuffing. Keyword stuffing is the technique of inserting the keyword you want a page to be found for multiple times on the page, especially when it’s in places the average visitor will not see, such as in image ALT tags.

NoFrames and NoScript. These two tags were designed to display content for visitors who do not support frames or JavaScript. The majority of browsers now support both technologies, so the legitimate use of these techniques is slowly decreasing. They are still useful for those people with JavaScript turned off, or for providing alternative menus for the search engines

Offsite Spamming Techniques

Most offsite spamming falls into link spamming and is covered in the Linking section. 

Reciprocal Links. This is where two sites agree to swap links. This is not an issue on a small scale, but make sure you remain within your site’s theme and related themes. If you link outside of your theme, then the search engines may ask questions. Generally, sites won't get banned for this technique but the links are devalued to the point of being worthless.

Link Triangulation. This is where three or more sites agree to swap links. They link in a circle (site A to site B, B to C, and C to A). This is seen as a deliberate attempt to spam the search engines and, when you are caught, can get you immediately banned.

Paid Links. This is where one site pays another to host a link to the site. Generally, this does not result in bans but can result in penalties.

Bad Neighborhoods. Linking to bad neighborhoods, including sites that are already banned, can also be seen as an issue by search engines.

Long Term vs. Short Term?

Most companies expect to keep their domains for a long time and want to build their domain’s reputation and their brand. Therefore, spam techniques should be avoided.

Some people’s method of promotion is to "pump and dump" their web sites – spam their way to the top of the rankings and dump their site when it gets caught. For these people, getting banned and having to start again is the cost of doing business.

Resources

W3C

W3C HTML specification

Google Webmaster Guidelines

Yahoo! Webmaster Guidelines

MSN Webmaster Guidelines

June 10, 2008

SEO 101 - Why Duplicate content is a problem for search engines

Duplicate content can be a major issue for most sites. Search engines do not like duplicate content and may penalize or ban the sites. This can be an issue for many reasons:  

• The "www.sitename.com" and "sitename.com" version of your site could both be indexed. This results in two copies of your site being indexed and listed by the search engines. The problem stems from the fact that some people linking to your site will use the "www.sitename.com" and some just "sitename.com" and, by default, your server accepts both site names and sends them to the same site, though according to standards, these are really two different sites. Unless you detect and correct this issue, you will end up with both sites indexed.

• That blog post you made will appear in the month archive, the category archive, the home page, etc.

• Sites selling goods tend to use the descriptions supplied by the wholesaler. As most goods are sold on multiple sites, each site will essentially duplicate the content found on the other sites.

• Affiliates can unintentionally cause an issue. Affiliates are third parties that resell products or services for a main site. If the affiliates use text from the main site, it can be viewed as duplicate content. Also, if they use spamming techniques, it can reflect poorly on the main site and result in penalties.

Duplicate content causes issues for the search engines because:

• It takes additional bandwidth to download the duplicated pages.

• It takes additional storage to store the duplicated pages.

• It takes additional processor cycles to process the duplicated pages.

• It takes additional time to scan the search engine’s index because it contains duplicate pages.

• It does not provide a better experience for the searcher. No searcher wants to see near- identical pages taking up the top 10 results for a search. Search engines want to provide unique and relevant pages, so they try to eliminate duplication.

All of these issues lower the quality of the search engine results. Search engines are particularly keen to rid themselves of this nuisance and whole sites have disappeared because of it. Duplicate content is one of the main reasons pages end up in the supplemental index on Google.

The supplemental index is used by Google when the query entered doesn't return enough results – therefore, any query satisfied by the supplemental index is, by definition, a low volume, low competitive search phrase. The supplemental index is where pages go to die and is usually filled with old, removed pages (404), and those pages that are not worth indexing properly.

June 09, 2008

SEO 101 - 12-13 of 13 common SEO Roadblocks (Graphics and Pages which are hard to access)

Graphics

Search engines cannot read graphics or graphical text. If the text of your site is a graphic, it is entirely blank to the search engines. The image ALT tag could be used to tell the search engines what the image text says but, due to the widespread abuse of this tag by those trying to trick the search engines, it doesn't carry much weight in the algorithms.

A better bet would be to convert the graphic to text and use CSS to provide the look and feel you're looking for. This allows the search engines to index the text and also allows the visually impaired using screen readers to access your site information.

CSS is short for “Cascading Style Sheets.” It allows web site developers and users more control over how pages are displayed. With CSS, designers can create style sheets that define how different elements appear, such as headers and links. These style sheets can then be applied to any web page.

Pages Which Are Hard to Access

The more clicks a page is distanced from the home page, the less value is placed on that page by the search engines. Logically this makes sense. If a page takes six clicks to reach, it must be less important to your visitors than one that takes two clicks to reach.

The number of clicks a page takes to be found is referred to as the page’s level:

Level 1 – The home page

Level 2 – All pages linked to from the home page

Level 3 – All pages linked to from the Level 2 pages that have not already been found etc.

Some search engines limit the levels of a site they will index. This is much less of an issue today than it was in the past. In general, if possible, limit all pages to be three clicks from the home page. Also put your important content on the higher levels of your site.

June 07, 2008

Being around RD's is just special!!!

So what are RD's? Well you can find out about the program which Kevin Schuler of Microsoft runs, and guess what he was a RD too. Read about What a RD is and how to become one. If you want to hear from the RD's themselves and see what they are up to and who they are the 140, the proud the RD's.

This is just an amazing group of people I love being around anytime I get a chance, they are not only technology savvy, have been around Microsoft stuff in most cases longer than I've been alive, but they also have a lot of fans. Being among them is like being among the developer stars! There are more PhD's, book authors, trainers, successful business men and women on this one table combined and we only had 20 who could attend last night.

You can see me way in the back and Kate Gregory who is a super Guru in C++, she delivered a 400 session in C++ at Tech Ed 2008 Orlando yesterday. She had 50 people attend, and I told her it's because of her name and how cool she is and how well known she is, can you imagine a super duper tech talk in C++ that deep.

So a bit of history, the sessions at Tech Ed are numbered mostly 200 are semi technical, 300's show and compile code and 400's not only code on the fly and compile but if they run into errors they fix them super fast. So can you imagine doing that with C++? Only Kate.

Anyway, the table is full of super cool people. And the best part is I get to hang with them again today and show them our http://webmaster.live.com product and the http://search.live.com/developer sites.

Awesome RD's at TechEd 2008 Orlando For those of you who don't know these super cool people here goes the list: Scott Golightly, Richard Hundhausen, Jaser Elmorsy, me (Ani), Eurico Bras, Scott Stanfield (from Vertigo), Carl Franklin (from .NET Rocks), Fernando Guerrero, Jonathan Zuck (front and center), Dr.Neil, Kate Gregory, Tim Huckaby (from Interknowledgy), Jonathan Goodyear, Ken & Tricia Spencer, Brian Noyes, David Starr (the main guy for the biggest code camp to day!), Chris Menegay.

If you want to find out more about what Code Camps are, read the manifesto from Thom Robbins and learn how to run your own check this out.

June 06, 2008

SEO 101 - 9-11 of 13 common SEO Roadblocks (Login & Other User Entry Form Pages, JavaScript, Flash Sites)

Login and Other User Entry Form Pages

If your site requires user registration before it will display content, I have news for you – it won't be indexed by the search engine spiders.

Spiders do not submit forms, so any content that requires a login or asks the user to enter a country before showing a page will not be indexed.

JavaScript

JavaScript is a scripting language originally developed by Netscape for use within HTML web pages – JavaScript and Java are not related. Overall, the search engine spiders do not execute JavaScript. If your site has a JavaScript menu system, the spiders will ignore it. There are workarounds to this issue. You can provide an alternative navigation system, hard coded in HTML, and use standard anchor tags to enable the search engine spiders to index your site.

Flash Sites

Flash is a graphics animation program that uses vector graphics. Flash files occur most commonly in animated advertisements on web pages and rich-media web sites.

Search engine spiders have limited ability to read Flash. They can decipher some of the content but, as a general rule of thumb, you should assume that the search engines ignore Flash.

If your menu system is in Flash, provide a secondary HTML menu system.

If your site’s content is in Flash, provide HTML versions of the page for the spiders to index.

Both of these tactics will help users who have Flash turned off on their browser, including users who can't run Flash and visually impaired visitors who use screen reading software.

June 05, 2008

SEO 101 - 8 of 13 common SEO Roadblocks (Frames)

Frames

Frames are a method of displaying multiple web pages simultaneously on your browser. In general they were used to isolate the headers and footers from the main content.

A typical setup might be:

• A container page – sets up the frame set, including masthead, left-hand menu, main content, and footer frames

• A masthead page

• A left-hand menu page

• Many multiple main content pages

• A footer page

The problem for search engine spiders is that they cannot assemble the component pages and index them as one. If they index the main content page, there is no way for the spider to deduce which of the other pages should be indexed to complete the page. In addition, if the visitor lands on the internal main content page, there is no masthead, no footers, and no menu system. There are JavaScript workarounds for this issue but, in general, avoid frames.

What happens when Search Frog meets MSDN guys?

Leap into better Search, submit your sitemap to http://webmaster.live.com!

If you were at TechEd 2008 Orlando today and you walked by the Virtual Earth booth, the Wiley publishing booth, the MVP cabana or the Architecture track booths, you most likely picked up a green slingshot frog, you thought they were cool and you told all your friends to submit their sitemap to Webmaster.live.com!

Then came around the famous MSDN Source Fource Action Figures, and they couldn't refuse the Frogs either, but in their case, they had to keep one hand free to hand out little cards to people, except they got tackled when people saw the frogs!!! So now you know what happens when Microsoft Geek chick, takes leaping Search Frog to MSDN Source fource Action figures, they totally have a blast! :)

Just for statistics sake, we've given away 1400 frogs today! and there will only be a few more tomorrow at the 1-2:15pm session in S230A "Advance Search Engine Optimization: Driving More Traffic from Search" talk with Nathan Buggia. look for me (Ani) at the front with the box of green frogs! and don't forget to submit your sitemaps to http://webmaster.live.com.

MSDN characters with Live Search Frogs

June 04, 2008

SEO 101 - 7 of 13 common SEO Roadblocks (Splash Pages)

Splash Pages

This was once a very popular way to design a web site. Thankfully, it is slowly being consigned to history. A splash page is a usually image Java or a Flash page that appears to the visitor before they access your main site. Usually these splash pages are all images with little or no text, and usually no way to click into the site except by clicking on the Java or Flash component. By adding this in front of your site, you're committing a number of sins.

If you don't provide a way for the search engine spiders to get past your splash page, they may not be able to access your site to index it. One of the cardinal sins is to always force new visitors to see the splash screen. As the search engines' spiders tend to not preserve state, they are always seen as new visitors and so are continually redirected to the splash page.

The best way to fix this issue is to:

1. Remove the splash page; or

2. If you can't remove the splash page, add an HMTL link to the home page of your site and never force internal page accesses to go to the splash page.

June 03, 2008

SEO 101 - 5 and 6 of 13 common SEO Roadblocks (HTML Extensions and Robots.txt)

HTML Extensions

In addition to broken and poorly formed HTML, some browsers have taken it upon themselves to extend the basic W3C HTML specifications by adding their own tags. The search engine spiders may not support such non-standard tags and so pages that display well in Internet Explorer may not be readable by the spiders.

Robots.txt

The robots.txt file is a text file placed in the root folder of your web site that tells the search engines which pages on your site you would prefer didn't get indexed. In general, the major spiders will obey your request and not index your pages. However, this is a voluntary task undertaken by the spiders, and blocking pages through the use of a robots.txt file will not necessarily stop email harvesters from reading your pages.

A major problem can arise when you write your robots.txt file in such a way that you accidentally block all or part of your site from the spiders.

The basic robots.txt file consists of one or more lines of text as follows:

User-agent: * (the spider)

Disallow: /tmp (what is to be disallowed)

Disallow: /logs (what is to be disallowed)

In the previous example, all spiders (the user-agent: *) are requested not to index any pages starting with /tmp and /logs. The problem becomes that the disallow strings assume a wildcard at the end, so blocking /logs will stop the spiders from indexing /logs, /logs/log1.txt, and /logsee.php, but you may not have meant to block /logsee.php.

In addition some search engines have extended the robots.txt specification so that it allows pattern matching. Pattern matching is where, instead of looking for an exact match for the URL, wild card characters are introduced to allow partial URL matching. This includes "*" for any sequence of characters, and "$" to mean end of line. So, for instance, the line "Disallow: /abc*?$" for Google means disallow all URLs that start with "/abc" and end with a "?." For the other search engines, it means to ignore all URLs that start with "/abc*?$."

In addition, some, but not all, spiders support the "Allow:" command. The Allow command has the same syntax as Disallow but explicitly tells the search engine spiders they can index the page referenced.

Google gives an example robots.txt file that will block all robots except Googlebot from indexing your site:

User-agent: *

Disallow: /

User-agent: Googlebot

Allow: /

Because this is non-standard, most people are often confused by these additions.