Home > Google, Networking, News, Tech, Web > Google Programming Contest

Google Programming Contest

February 13th, 2008 Leave a comment Go to comments

In celebration of more than three years of delivering the best search experience on the Internet, Google is sponsoring the first annual Google Programming Contest.

The Grand Prize

  • $10,000 in cash
  • Potentially run your prize-winning code on Google’s multi-billion document repository (circumstances permi
  • VIP visit to Google Inc. in Mountain View, California

Challenge

Google providing a selection of about 900,000 web pages in pre-parsed and raw format, Google contesttogether with a “ripper” program that provides a framework for processing the pre-parsed data. Your job is to write a program that does something interesting with the data, in such a way that it would scale to a web-sized collection of documents. Part of your job is to convince us of why your program is interesting and why it will scale; but, you’re free to implement whatever strikes your fancy.

We suggest you fit your entry in one of two different tracks: Applications or Systems.

1. Systems

Entries in the Systems track generally pertain to infrastructure for handling the data, where typical goals are systems related (i.e., speed/space properties). Some examples of possible projects include:

  • Designing and implementing an efficient index structure to quickly find all documents that contain a given word or phrase.
  • Achieving better compression for the repository (starting from either the pre-parsed or raw formats). You might make a case for why your compression scheme saves the most space, or saves space while still allowing quick access to the data.
  • Constructing a link graph for the data and providing fast access to it.

2. Applications

Entries in the Applications track generally deal with the semantics of the data. Some example include:

  • Clustering pages by topic or type.
  • Detecting pages that are near-duplicates of one another.
  • Classifying links on a page.
  • Detecting common templates in pages, and separating out the common structure from the individual content.

Keep this in mind when designing your implementation. You should assume that your code will ultimately run on a collection of networked machines with a reasonable amount of memory (~2-4 gigabytes each), where the data is divided among them. You will probably need to combine partial results from each machine to form a single final result. The supplied repository is several orders of magnitude smaller than the ultimate target repository for the code, because of the limitations of the distribution media and the likely resource constraints of many entrants.

The limited size of the repository being distributed and the selection of documents may preclude interesting kinds of document processing. This repository includes a selection of HTML Web pages from 100 different sites in the “edu” domain.
(more about  contest: google)

  1. No comments yet.
  1. No trackbacks yet.
GoCache - ByREV-Cache v1.0 - live served in : 0.177589 sec (gzip)