Historical Archive: This is documentation from a Stanford University computer science project completed in 2000.

DiamondSilk

Applying structure to the web

students
David Weekly	<dew@cs.stanford.edu>
Valerie Kucharewski	<valerie@cs.stanford.edu>
advisor
Armando Fox	<fox@cs.stanford.edu>

Introduction

The Internet contains an immeasurable quantity of information and knowledge. Much of this information is represented in databases on large servers. To be published to the web, these pieces of data will often go through a transformation to produce unstructured HTML. A common modern way of doing this is to have XML (structured data) on the server and then apply XML Style Sheets to inform the server as to how to present the data. The goal of my project is to recover the original structured data and to perform queries upon it. In one sense, I am creating an "Inverse XML Style Sheet Engine" and the tools to query the results of such a transform.

Structure

The project will be segmented into several discrete portions. This allows us to best tackle the problem piecewise and to ensure progress. If we will have solved only part of the puzzle, we still will have produced a useful result.

Decoding

The first portion will be dedicated to reverse-engineering the content encoding of websites. There will be a web interface that will allow users to interact with our server to discover content filters for specific websites. The user will present coded pairs (This is the title of the document, this is the textual body, this is the price of that item, etc.) and the system will learn from these results. We intend to leverage work already done in this area, such as the work of Nick Kushmerick. This stage will produce filters for structuring data from a given portion of a website. This filter database will be open for other sites to query and make use of.

If the structure of a website changes and our old filters break on it, we will allow users to update the filters; in this way, the burden of maintaining the service is shifted to those who are using the service. Many websites have successfully adopted this model to ensure scalability as the userbase grows.

Acquisition

The next portion will be dedicated to the actual acquisition of structured data from websites. Users will specify how and from what URL the system should spider a site and how frequently. We may also attempt automated approaches to discovering the most appropriate "refresh" times on a website. We do not anticipate this portion to be difficult.

Storage

The third portion concerns itself with the actual storage of the semi-structured data coming back from the websites in an efficient manner. We already have several designs for implementing this portion. We do not need extremely advanced functionality and the specific design of the database allows us to make some shortcuts. (We know that modifications, for instance, are rather infrequent.) We intend to peruse relevant papers in the field, such as Stanford's own pioneering work with the LORE system.

Queries

The last portion focuses on the querying of the semi-structured data repository. The goal is to allow ease-of-use and a large degree of flexibility to let users discover useful queries to perform. We had mentioned, for instance, that a price-comparison engine could be constructed from structured data in our database. It would be useful, therefore, to allow a user to quickly build a front end for others to use that sits on top of our database.

Difficulty

We expect this project to be difficult and we are setting two college quarters aside for its implementation. However, we have already implemented a spidering news system as a proof-of-concept that integrates many of our proposed techniques and proves that it is possible to create a useful system in this manner. As such, we believe that we can achieve interesting and useful results within this time period and before both of our respective graduations in June of 2000.

Conclusion

Ultimately, the project hopes to be Open Source, easily extensible, reliable, fast, and straightforward. The ultimate goal of the project would be to create a system capable of structuring most web data and being able to ask questions about the results. As one example, this is a complete superset of a price comparison engine. There are many other powerful possibilities that would become available if this project were to succeed.

A preliminary version is hoped to be available at the end of Winter Quarter 1999-2000, with a polished version available by mid-June 2000.

WE FINISHED DIAMONDSILK AS OF EARLY JUNE 2000. It was presented at the Senior Project Faire.

Historical Context

DiamondSilk was a pioneering Stanford CS project that anticipated many concepts later seen in modern web scraping, data extraction, and automated content understanding systems. The "inverse XML stylesheet engine" concept was ahead of its time, aiming to reverse-engineer structure from unstructured HTML - a problem still relevant today in web data extraction and semantic web technologies.

The collaborative filtering approach for maintaining extraction rules by crowdsourcing updates from users presaged many successful community-driven data curation models that became common in the 2000s and beyond.

Academic Project: This was a Stanford University computer science senior project completed in 2000. The system demonstrated early approaches to automated web data extraction and structure recovery that would become increasingly important as the web scaled.