You can't go far these days without tripping over commentary on Google's strategy. I've not really paid this much attention, but it's been interesting watching the launch of Google Scholar and reactions from the library communities because it directly intersects with my day job: managing the team that has built and is enhancing IngentaConnect my employers new scholarly content aggregation.
I thought it might be interesting to share some perspectives on working with Google and a couple of notes on Google Scholar itself.
My involvement with Google goes back about nine months after they contacted us to see if we wanted to collaborate with them on an initiative to add more scholarly content to the Google indexes. Of course we jumped at the chance, this is undeniably a Good Thing (both for us and the publishers we work with).
So the first step there was to help them ensure that the crawler could get to all the content. Our original site had a fairly crufty link syntax (too much reliance on query strings) so the first issue was tweaking their crawler to work around this. The new site is much cleaner as, like a lot of people, we've learnt a thing or two about REST recently.
The second issue was to ensure that the crawler got the full text so they could work their on the full content rather than just the titles and abstracts. A bit of sleight-of-hand at our end ensured that the crawler got what it needed but with the URLs in the Google index being a suitable entry point for an end user.
Like any search engine the Googlebot simply adds the URLs that it GETs to the index, so you have to think a bit about your URL structure and where you route the bot if its different to where you'd normally route a user. The crawler doesn't seem to have any real notion of "preferred" URL for content: it investigates every link as used content based checksums to de-duplicate the data.
You can also make the bots life easier by providing it with a "sitemap" so it can quickly harvest all the content. So this is my first tip to site owners: publish an index of your site specifically for the Googlebot (and other crawlers) and you'll be indexed much quicker. If you contact Google you should be able to get the index added to the crawl, useful if you don't care to publish the sitemap to end users.
We turned all this work all round very quickly and it was then just a matter of sitting back and watching the Googlebot wash over us. Well, that and play with the Google frisbee that their marketing department sent me. Actually, that's a lie. They must breed them differently in the Googleplex. Go outside? Run, like, around? Surely some mistake?!(*)
Early on in the discussions I'd checked with Anurag whether we could co-ordinate to ensure that the 'bot came in at quiet times to avoid swamping the servers. But that's not how it works, the Googlebot wanders where it might and can't be trained to index particular sites at particular times. It is performance sensitive though so will back-off from a site if the response times start to increase.
So, another tip I'd share is to rap the 'bot on the nose by throttling it (e.g. via Apache) so that it becomes a much friendlier beast to work with. You're then in a better position to control when and how quickly the bot hits your site. Even with their built-in rate limiting you can get sudden peaks of load that could swamp a server.
With our content appearing in the Google indexes it's been interesting to watch the referral traffic increase very nicely. Now that Google Scholar has launched the referrals from the new site similarly jump into life; they must already have attracted a large user base, which isn't that surprising.
I also had a bit of fun with bookmarklets to help me highlight our content in the scholar indexes, check whats been indexed, etc. Note these are only certified for a real man's browser
at the moment. There are more to come, to better tie in Google Scholar results both with our own site and others.
Surprisingly, Google seem to be being a bit cagey about who is in/out of the scholar indexes and their criteria for selection. I know we're in, and I also know they were working with the CrossRef folk among others, so thats a fair percentage of scholarly publishers. I've also seen PubMed and other well-known sites cropping up repeatedly in test searches I've done on the site.
This highlights another mis-conception I've seen in some of the recent commentary: as far as I can tell Scholar is not yet making more of the invisible web visible, its mainly a subset of its existing index. I don't see that they've created a custom crawler so I'm expecting data to appear in both the main index and Scholar. The latter just had some limited editorial input (domain selection from what I can see) and some extra processing required, e.g. citation extraction and analysis.
Based on hard-won experience I can predict a number of debates about Google Scholar that are still to come, but one that's worth mentioning now is the old: structured metadata versus text indexing debate. In fact Danny is on this tack already.
For what it's worth IngentaConnect has had Dublin Core metadata embedded in article pages since the first beta, with RDF to follow soon. This ought to help anyone interested in writing a scutter. Again more details to follow.
In fact, the embedded metadata, and a cleaner site design is already bearing fruit in the form of the rather del.icio.us (pun intended!) CiteULike. Richard Cameron is making a nice job of that site, and hopefully gadgets like mattb's Python API will be appearing for it shortly. You can follow the development of that site in the CiteULike devblog.
----
(*) Actually Wayne Davey did use to work for us, but again, finance people are a different breed entirely :)
Posted by ldodds at November 23, 2004 08:00 PM | Feedback? | | TrackBackPhone Ringtone For You: Phone Ringtone For You
Mobile Phone Ringtone: Mobile Phone Ringtone
About Money About Money
Hello. Thank you for a lot interesting information. http://management.atspace.us/business-management-tip.html
Posted by: Sally on July 15, 2005 11:57 PMFine site. I hope that there will be new updates. http://dedicated-server.atspace.com/reselling.html
Posted by: Blens on July 16, 2005 01:21 AMHi all. I am from USA.
Good site. I hope that there will be new updates. Welcome on my site
http://debt-consolidation.atspace.biz/consumer-debt-consolidation.html
Airfares Online Airfares Online
Airline Ticket Companies Airline Ticket Companies
Airline Ticket Companies Airline Ticket Companies
Hi! Your page is good. Very good webpage you have here, and best greetings to all your visitors and also my website is considerable a little http://airline.50webs.com/airline-ticket-sale.html
Posted by: Billi on July 19, 2005 12:25 AMBest Car Insurance Companies Best Car Insurance Companies
http://best-car-insurance-companies.com
Posted by: Best Car Insurance Companies on July 19, 2005 01:33 AMHi. I am from Germany. My new page - "history of menegement" here:
http://management.50webs.com/business-management-system.html
Cigarette Central Cigarette Central
http://cigarette-central.com
Posted by: Cigarette Central on July 20, 2005 01:09 PMComputers Info Computers Info
http://computers-info.net
Posted by: Computers Info on July 20, 2005 01:09 PMhttp://discount-perfume-price.com
Posted by: Discount Perfume Price on July 20, 2005 08:57 PMhttp://domain-name-price.com
Posted by: Domain Name Price on July 21, 2005 01:21 AMhttp://fast-depression-help.com
Posted by: Fast Depression Help on July 21, 2005 11:02 AMhttp://fast-depression-help.com
Posted by: Fast Depression Help on July 21, 2005 12:23 PMThanks you!
Posted by: oreck vacuums on July 25, 2005 10:59 PMI am a student in high school. I am in a photo journalism class, and I'm thinking about going into a career in photo journalism. This site as well. Best wishes.
Posted by: Daniel on July 28, 2005 02:11 AMhttp://logo.company-si.com/vgsgicb/ endedprideteasing
Posted by: manipulate on July 29, 2005 02:51 AMhttp://home.loan-boat.com/refinancecarloans/ pressurereturnedseventh
Posted by: far on August 1, 2005 01:28 AMLots of thanks.
Posted by: concert tickets on August 1, 2005 07:20 PMDont play with me boy http://www.getfirefox.com , http://www.yahoo.com
Posted by: John Male blog on August 3, 2005 01:25 AMhz2 hz2 hz2 hz2 blog http://www.getfirefox.com , http://www.peace.org
Posted by: Nicolegrants Kidman blog on August 3, 2005 01:27 AMI dont want to read this bullshit anymore http://www.getfirefox.com , http://www.peace.org
Posted by: John Male blog on August 3, 2005 01:24 PMYou are invited to visit some information in the field of Texas Holdem Poker Texas Holdem Poker http://www.atlantis-asia.com/ http://www.atlantis-asia.com/ ... Thanks!!!
Posted by: texas holdem strategy on August 6, 2005 03:41 PMThank you! http://www.dorank.com/improvepr/ improve pagerank default. PageRank 11: google pagerank algorithm, testing of system, increase pagerank . Also [url]http://www.dorank.com/linksale/[/url] and [link=http://www.dorank.com]google rank 20[/link] from http://www.dorank.com .
Posted by: pagerank main on August 7, 2005 07:59 AMThanks! http://www.dorank.com/contacts/ google pr. [URL=http://www.dorank.com]pagerank 5[/URL]: google pagerank algorithm, testing of system, increase pagerank . Also [url=http://www.dorank.com]online pr16[/url] from http://www.dorank.com .
Posted by: google pr main on August 7, 2005 07:59 AMhttp://nude.prisonrapelife.com/whmjngi/ formallevelrecovered
Posted by: more on August 7, 2005 04:27 PM