November 23, 2004

Google Scholar

You can't go far these days without tripping over commentary on Google's strategy. I've not really paid this much attention, but it's been interesting watching the launch of Google Scholar and reactions from the library communities because it directly intersects with my day job: managing the team that has built and is enhancing IngentaConnect my employers new scholarly content aggregation.

I thought it might be interesting to share some perspectives on working with Google and a couple of notes on Google Scholar itself.

My involvement with Google goes back about nine months after they contacted us to see if we wanted to collaborate with them on an initiative to add more scholarly content to the Google indexes. Of course we jumped at the chance, this is undeniably a Good Thing (both for us and the publishers we work with).

So the first step there was to help them ensure that the crawler could get to all the content. Our original site had a fairly crufty link syntax (too much reliance on query strings) so the first issue was tweaking their crawler to work around this. The new site is much cleaner as, like a lot of people, we've learnt a thing or two about REST recently.

The second issue was to ensure that the crawler got the full text so they could work their on the full content rather than just the titles and abstracts. A bit of sleight-of-hand at our end ensured that the crawler got what it needed but with the URLs in the Google index being a suitable entry point for an end user.

Like any search engine the Googlebot simply adds the URLs that it GETs to the index, so you have to think a bit about your URL structure and where you route the bot if its different to where you'd normally route a user. The crawler doesn't seem to have any real notion of "preferred" URL for content: it investigates every link as used content based checksums to de-duplicate the data.

You can also make the bots life easier by providing it with a "sitemap" so it can quickly harvest all the content. So this is my first tip to site owners: publish an index of your site specifically for the Googlebot (and other crawlers) and you'll be indexed much quicker. If you contact Google you should be able to get the index added to the crawl, useful if you don't care to publish the sitemap to end users.

We turned all this work all round very quickly and it was then just a matter of sitting back and watching the Googlebot wash over us. Well, that and play with the Google frisbee that their marketing department sent me. Actually, that's a lie. They must breed them differently in the Googleplex. Go outside? Run, like, around? Surely some mistake?!(*)

Early on in the discussions I'd checked with Anurag whether we could co-ordinate to ensure that the 'bot came in at quiet times to avoid swamping the servers. But that's not how it works, the Googlebot wanders where it might and can't be trained to index particular sites at particular times. It is performance sensitive though so will back-off from a site if the response times start to increase.

So, another tip I'd share is to rap the 'bot on the nose by throttling it (e.g. via Apache) so that it becomes a much friendlier beast to work with. You're then in a better position to control when and how quickly the bot hits your site. Even with their built-in rate limiting you can get sudden peaks of load that could swamp a server.

With our content appearing in the Google indexes it's been interesting to watch the referral traffic increase very nicely. Now that Google Scholar has launched the referrals from the new site similarly jump into life; they must already have attracted a large user base, which isn't that surprising.

I also had a bit of fun with bookmarklets to help me highlight our content in the scholar indexes, check whats been indexed, etc. Note these are only certified for a real man's browser
at the moment. There are more to come, to better tie in Google Scholar results both with our own site and others.

Surprisingly, Google seem to be being a bit cagey about who is in/out of the scholar indexes and their criteria for selection. I know we're in, and I also know they were working with the CrossRef folk among others, so thats a fair percentage of scholarly publishers. I've also seen PubMed and other well-known sites cropping up repeatedly in test searches I've done on the site.

This highlights another mis-conception I've seen in some of the recent commentary: as far as I can tell Scholar is not yet making more of the invisible web visible, its mainly a subset of its existing index. I don't see that they've created a custom crawler so I'm expecting data to appear in both the main index and Scholar. The latter just had some limited editorial input (domain selection from what I can see) and some extra processing required, e.g. citation extraction and analysis.

Based on hard-won experience I can predict a number of debates about Google Scholar that are still to come, but one that's worth mentioning now is the old: structured metadata versus text indexing debate. In fact Danny is on this tack already.

For what it's worth IngentaConnect has had Dublin Core metadata embedded in article pages since the first beta, with RDF to follow soon. This ought to help anyone interested in writing a scutter. Again more details to follow.

In fact, the embedded metadata, and a cleaner site design is already bearing fruit in the form of the rather del.icio.us (pun intended!) CiteULike. Richard Cameron is making a nice job of that site, and hopefully gadgets like mattb's Python API will be appearing for it shortly. You can follow the development of that site in the CiteULike devblog.

----

(*) Actually Wayne Davey did use to work for us, but again, finance people are a different breed entirely :)

Posted by ldodds at November 23, 2004 08:00 PM | Feedback? | | TrackBack
Comments

Phone Ringtone For You: Phone Ringtone For You


Phone Ringtone For You


http://phone-ringtone-4you.com

Posted by: Phone Ringtone For You on July 13, 2005 04:24 AM
Comments

Mobile Phone Ringtone: Mobile Phone Ringtone


Mobile Phone Ringtone


http://mobile-phone-ringtone-online.com

Posted by: Mobile Phone Ringtone on July 13, 2005 09:08 PM
Comments

About Money About Money


About Money


http://about-money-world.com

Posted by: About Money on July 15, 2005 11:41 PM
Comments

Hello. Thank you for a lot interesting information. http://management.atspace.us/business-management-tip.html

Posted by: Sally on July 15, 2005 11:57 PM
Comments

Fine site. I hope that there will be new updates. http://dedicated-server.atspace.com/reselling.html

Posted by: Blens on July 16, 2005 01:21 AM
Comments

Hi all. I am from USA.
Good site. I hope that there will be new updates. Welcome on my site
http://debt-consolidation.atspace.biz/consumer-debt-consolidation.html

Posted by: Helli on July 16, 2005 02:03 AM
Comments

Airfares Online Airfares Online


Airfares Online


http://airfares-online.net

Posted by: Airfares Online on July 16, 2005 05:43 PM
Comments

Airline Ticket Companies Airline Ticket Companies


Airline Ticket Companies


http://airline-ticket-companies.com

Posted by: Airline Ticket Companies on July 17, 2005 05:42 PM
Comments

Airline Ticket Companies Airline Ticket Companies


Airline Ticket Companies


http://airline-ticket-companies.com

Posted by: Airline Ticket Companies on July 17, 2005 05:42 PM
Comments

Hi! Your page is good. Very good webpage you have here, and best greetings to all your visitors and also my website is considerable a little http://airline.50webs.com/airline-ticket-sale.html

Posted by: Billi on July 19, 2005 12:25 AM
Comments

Best Car Insurance Companies Best Car Insurance Companies

http://best-car-insurance-companies.com

Posted by: Best Car Insurance Companies on July 19, 2005 01:33 AM
Comments

Hi. I am from Germany. My new page - "history of menegement" here:
http://management.50webs.com/business-management-system.html

Posted by: Delli on July 19, 2005 03:00 AM
Comments

Cigarette Central Cigarette Central

http://cigarette-central.com

Posted by: Cigarette Central on July 20, 2005 01:09 PM
Comments

Computers Info Computers Info

http://computers-info.net

Posted by: Computers Info on July 20, 2005 01:09 PM
Comments

Discount Perfume Price

http://discount-perfume-price.com

Posted by: Discount Perfume Price on July 20, 2005 08:57 PM
Comments

Domain Name Price

http://domain-name-price.com

Posted by: Domain Name Price on July 21, 2005 01:21 AM
Comments

Fast Depression Help

http://fast-depression-help.com

Posted by: Fast Depression Help on July 21, 2005 11:02 AM
Comments

Fast Depression Help

http://fast-depression-help.com

Posted by: Fast Depression Help on July 21, 2005 12:23 PM
Comments

Fast Gift Idea

http://fast-gift-idea.com

Posted by: Fast Gift Idea on July 21, 2005 01:12 PM
Comments

Free Smiley Face

http://free-smiley-face.com

Posted by: Free Smiley Face on July 22, 2005 08:07 AM
Comments

Mega Web Hosting

http://mega-web-hosting.net

Posted by: Mega Web Hosting on July 23, 2005 09:45 AM
Comments

Thanks you!

Posted by: oreck vacuums on July 25, 2005 10:59 PM
Comments

I am a student in high school. I am in a photo journalism class, and I'm thinking about going into a career in photo journalism. This site as well. Best wishes.

Posted by: Daniel on July 28, 2005 02:11 AM
Comments

http://logo.company-si.com/vgsgicb/ endedprideteasing

Posted by: manipulate on July 29, 2005 02:51 AM
Comments

http://home.loan-boat.com/refinancecarloans/ pressurereturnedseventh

Posted by: far on August 1, 2005 01:28 AM
Comments

Lots of thanks.

Posted by: concert tickets on August 1, 2005 07:20 PM
Comments

Dont play with me boy http://www.getfirefox.com , http://www.yahoo.com

Posted by: John Male blog on August 3, 2005 01:25 AM
Comments

hz2 hz2 hz2 hz2 blog http://www.getfirefox.com , http://www.peace.org

Posted by: Nicolegrants Kidman blog on August 3, 2005 01:27 AM
Comments

I dont want to read this bullshit anymore http://www.getfirefox.com , http://www.peace.org

Posted by: John Male blog on August 3, 2005 01:24 PM
Comments

You are invited to visit some information in the field of Texas Holdem Poker Texas Holdem Poker http://www.atlantis-asia.com/ http://www.atlantis-asia.com/ ... Thanks!!!

Posted by: texas holdem strategy on August 6, 2005 03:41 PM
Comments

Thank you! http://www.dorank.com/improvepr/ improve pagerank default. PageRank 11: google pagerank algorithm, testing of system, increase pagerank . Also [url]http://www.dorank.com/linksale/[/url] and [link=http://www.dorank.com]google rank 20[/link] from http://www.dorank.com .

Posted by: pagerank main on August 7, 2005 07:59 AM
Comments

Thanks! http://www.dorank.com/contacts/ google pr. [URL=http://www.dorank.com]pagerank 5[/URL]: google pagerank algorithm, testing of system, increase pagerank . Also [url=http://www.dorank.com]online pr16[/url] from http://www.dorank.com .

Posted by: google pr main on August 7, 2005 07:59 AM
Comments

http://nude.prisonrapelife.com/whmjngi/ formallevelrecovered

Posted by: more on August 7, 2005 04:27 PM
-->