blogs.gnome.org, permalinks, and robots=noindex



I hope this is the right list to bring this up.  Basically, I'm
looking for the people who maintain blogs.gnome.org, as in the blogs
hosted on it (blogs.gnome.org/view/username), not how b.g.o by itself
redirects to planet.gnome.org.

Like others, I have a blogs.gnome.org blog [1].  That link [1] is the
'main page', which shows the standard five most recent posts.  The
main page also points to permalinks, such as [5].  Furthermore, the
date suffix of that URL is hackable to give e.g. "All posts for 2007",
"for May 2007" or "for 1 May 2007" [2,3,4].  Here's a pretty cascade
of links:

[1] http://blogs.gnome.org/view/nigeltao
[2] http://blogs.gnome.org/view/nigeltao/2007
[3] http://blogs.gnome.org/view/nigeltao/2007/05
[4] http://blogs.gnome.org/view/nigeltao/2007/05/01
[5] http://blogs.gnome.org/view/nigeltao/2007/05/01/0

Unfortunately, the permalinked page [5] has this HTML snippet in its <head>:
<meta name="robots" content="noindex,follow" />
which means that search engines should, uh, not index it.

This particularly affects the "Bloggers of Planet GNOME" custom search
engine [6], since it's running off planet.gnome.org's OPML file, which
(for b.g.o hosted blogs) is pointing (via the ATOM feed [7]) to the
permalinked pages (ones that look like [5]), and hence sizable chunks
(40ish member blogs) of the p.g.o are currently not searchable, even
from vanilla Google (let alone the custom search engine).

[6] http://mail.gnome.org/archives/gnome-announce-list/2006-November/msg00030.html
[7] http://blogs.gnome.org/syndicate/nigeltao

For example, Elijah's entry in that OPML file points to
http://blogs.gnome.org/syndicate/newren
which contains
<guid isPermaLink="true">http://blogs.gnome.org/view/newren/2007/04/18/0</guid>
and that page contains
<meta name="robots" content="noindex,follow" />

What is odd is that some of the date-filtered posts are marked as
indexable, some aren't.  For example, [5], [4] and [2] are noindex,
but [3] and [1] have
<meta name="robots" content="index,follow" />
Note that this says "index", instead of "noindex".

This means that the monthly summaries actually do show up on Google.
For example, http://www.google.com/search?q=foxybuntu+site%3Ablogs.gnome.org
finds my October 2006 summary page
http://blogs.gnome.org/view/nigeltao/2006/10
but not the actual post
http://blogs.gnome.org/view/nigeltao/2006/10/02/0


Basically, the noindex-ability of the blog pages seems IMHO (1)
arbitrary and (2) wrong.  Normally, in good open source style, I'd
write a patch, but I don't know what software is running
blogs.gnome.org, so instead I'm making noise on this list.

My suggestion would be to scrap the <meta name="robots" ...> tag
entirely - I don't see why would we want to hide from searchers (GNOME
users!) the same content we proudly share on p.g.o.  If indeed there
are good reasons to noindex (e.g. avoiding duplicate search results,
although search engines are good enough these days to skip dupes), I
say make [1] (the blog's homepage) and [5] (the permalink) index, and
[2], [3] and [4] (yearly / monthly / daily summaries) noindex.


Cheers,
Nigel (wearing his GNOME hat).



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]