Mark E. Buckley

Creating a Google Site Map

If you are a web master you may need to create an XML sitemap. The purpose of this would be to help Google better index your site. You should create a sitemap if your site is large, over 100 pages, and you do in fact want it indexed on Google.

I have 4 sites that fit into this category. While I do have three other 'large' sites I do not need them indexed in Google. I have 5 other sites where indexing is important, but they are less than 50 pages.

There seems to be quite a few firms out there offering to create a site map for you. This might be your most efficient strategy as the prices seem pretty reasonable.

Personally I am reluctant to install someone else's code on my site. It is a skeptical issue, a control issue, and a competitive issue. Often other folk's code does not work in your specific circumstances with your specific configuration. Also it is difficult to modify code that you have not originally built. Finally I just like the challenge of figuring it out myself.

XML

First you should find a sample site map so you know what you are trying to emulate. You can find a sample at google and you can also do a search for 'sitemap.xml'.

Once you have a sample in front of you, you can see how the basic structure works.

The first line is a very basic XML prolog. You'll notice that the file is well formed but not valid, i.e. there is no DTD.

Next you will see the document root is 'urlset' with a google specified name space

Next you will see the child elements are 'url'. Within each 'url' is a loc, lastmod, changefreq, and priority. From the google site you can read the default values and options for the location, last modified, change frequency and priority elements.

Now that you know what output you are trying to produce, you can determine how to produce it.

PHP

Of course you will need some type of server side scripting to produce this XML document. You could use PHP, ASP, Perl, CFM, Python or any language that you are comfortable with.

I always end up using PHP for this type of task. Make sure that whatever language you are using, you have the correct permissions and configuration to read files, read directories and write a file.

Now start a document with note pad and save it as createmysitemap.php.

Start a php block for your variables and keep that at the top of the page. The variables you will want to declare include

  • A string for the XML Prolog
  • A string for the urlset opening tag
  • A string for the urlset closing tag
  • A string for the full http path of your site
  • A default value for priority
  • A default value for changefreq

Next you will open your directory, read the item and then close the directory. If the item is a file, you will populate your file array. If the item is a directory, you will populate your directory array. If the item is a file you will also read the last modified date.

Now you will go through this process again. This time you will loop through each value of your directory array. For each item you will read the file and the last modified date.

Now that you have two really huge arrays, you will output them. You could write the information to a php file, a text file or even an XML file.

I found that writing to a text file worked best. I then just copied and pasted the output in between a XML file, inside the 'urlset' element.

Challenges

I ran my code on four different sites. Each time I had to make small adjustments.

Outputting directly to an XML file was difficult. The challenge is that XML and PHP both use similar and conflicting punctuation. In the end it was simpler to output to a text file.

Skipping directories and files. You probably have certain files and certain directories that you do not want to display on your final XML file. These might be scripting directories, private directories, or statistical reporting directories. In the code section that reads the directory or file, just add a conditional that only appends the array if the item does not meet that condition. You could create a list of bad directories and add those to your variable area. Then you could just add a series of if $item != $baddirectory1 statements.

Reading the last modified date was very inconsistent. You should create a default value in case that occurs. Then create a conditional if $date < $defaultdate then $date = $defaultdate. This is not the real code, but you get the idea.

You might want to use the same priority and change frequency on 99 percent of your pages. You may wish to update that value for your most important pages, e.g. your home page and contact page.

Fun

This was actually a pretty fun exercise. It is rare to get a chance to combine two different technologies, in this case PHP and XML.