Moving towards a Web of Meaning

01 Monday Dec 2014

Posted by Dominic in Information Architecture

Tags

Artists Books Online, authority control, BIBFRAME, Darcy DiNucci, DITA, Dublin Core Metadata Initiative, HTML, information architecture, information management, markup languages, metadata, Old Bailey Online, OWL, RDF, Semantic Web, Text Encoding Inititative, Tim Berners-Lee, W3C, Wikipedia, XML

The term “Web 2.0” was coined by Darcy DiNucci in 1999 and has gained credence during the 21st century. “Web 1.0” (which is of course a retronym as this term was not used at the time) pages were largely static, text-based and passively consumed by those who accessed them—with the notable exception of the hyperlink, they were essentially a continuation of a printed book, journal, newspaper article or a similar pre-Internet resources. In contrast, Web 2.0 was (and still is!) all about a dynamic, interactive user experience, involving multimedia messages, “mashups” of different website features using APIs, and user participation through social media, of which this blog is itself an example.

However, as interesting and meaningful as Web 2.0 may be to us, the human consumers, it is still flat and devoid of semantic information to our computers, the intermediaries. This is where the concept of the Semantic Web—also known as Web 3.0—comes in. It is a project spearheaded by the World Wide Web Consortium (W3C) and coined by Tim Berners-Lee, and aims to produce a Web of meaningful information (hence “semantic”) that can be understood—not just read—by machines. The technology that underlies this aim is the encoding of all text with metadata, using standard markup languages (such as XML) and controlled grammars and vocabularies (such as those provided by the Dublin Core Metadata Initiative). The meaningful metadata is added using linked data, which on the Internet is primarily achieved by using the standardised Resource Description Framework (RDF). RDF is used to describe entities using a tripartite Subject—Predicate—Object model. For example, Dominic Allington-Smith (subject) studies at (predicate) City University London (object). When properly marked-up, this enables a machine to understand the nature of the relationship between the three elements, whereas the Web 2.0 can only identify that a relationship exists without knowing its nature. Those of you who have read my previous post on library metadata standards in the Web 2.0 era may be interested to know that the emerging BIBFRAME bibliographic metadata standard is expressed in RDF.

An example of a series of RDF relationships displayed in graph form. The blue ovals are subjects, the nodes predicates, and the beige rectangles are subjects (the lower two blue ovals are both objects and subjects). The Uniform Resource Identifiers (URIs) labelling the nodes link to standards which define the nature of the predicate. The hierarchy is controlled by the underlying Taxonomy and any limitations by the Ontology (see below).

Of course, markup languages have existed since the genesis of the Internet. One of the fundamental building blocks of the Web as we know it is the HyperText Markup Language (HTML), which is used to format webpages, but this is extremely limited as it does not distinguish the meaningful content of text, just how it should be presented. This can be demonstrated in the Google Chrome browser by pressing Ctrl+U when viewing this or any other page in this blog to view its source code—although the markup is extensive and pervasive, it only refers to how the page is displayed on the computer screen; not its semantic meaning. The concept of the Semantic Web is different, as all text is marked-up with meaningful metadata, forcing the machine reading it to make explicit interventions in order to interpret it, thus making it open to searching, analysis and to be placed into relationships with other texts in the corpus (I should point out that current search engines are inefficient by comparison, as they must make guesses—albeit increasingly well-educated ones—on the content of text that has not been marked-up based on context and collocation).

One way in which it may be useful to visualise the concept of the Semantic Web is by referring to the ill-fated Allwiki campaign, an argument during the early years of Wikipedia that every single word in an article should be linked to another Wikipedia article, or its definition on sister project Wiktionary, instead of the accepted current model. The Semantic Web, however, is different: whereas linking every single word in a Wikipedia article is immediately obvious to the human reader, the XML or similar markup language that can be used to encode the entire contents of a webpage is not visible in the human-readable interface, and is instead only intended for consumption by the machine that uses the information to display, search (and so forth) the page.

Although the vast majority of the content available through the Internet is currently not semantic, many research projects exist that have marked-up their entire corpora of source material, digitised or otherwise, with metadata, using the principles laid out by the Text Encoding Initiative. This gives us an idea of how Web 3.0 could look. Due to the customisations needed for supporting the Semantic Web when dealing with different disciplines, contexts and so forth, it is also necessary to create a Document Type Definition, a Taxonomy (a hierarchical organisation of concepts derived from the RDF metadata suitable for use with the hierarchical nature of XML), and an Ontology (a collection of logical rules and inferences that can be applied to the Taxonomy, for example X did not influence Y if X lived before Y) in the Web Ontology Language (OWL).

The first of these research projects is the Old Bailey Online archive (used last week!). Its corpus of court cases and supplementary material has been entirely marked-up—using both manual and computerised processes—in order so that it may be searched more effectively. For an example, let us refer to the trial of Thomas Smith for murder in 1760: the link shows the digitised natural language text (with links to scans of the original, handwritten records), and this one shows the same text with its structured, hierarchical metadata included in XML. The extensive use of hierarchical categories of information using a customised, controlled vocabulary, means that the site can use a complex, faceted search engine to make best research use of the entire corpus. As I demonstrated last week, this XML data can also be exported into Voyant Tools for further analysis.

The second such project is Artists’ Books Online, “an online repository of facsimiles, metadata, and criticism”. The “About the Project” page provides full and detailed information about the website’s DTD, metadata, taxonomy, markup scheme, and other technical information. It is unfortunately not possible to view each repository item’s metadata in the XML format, but a sample markup webpage allows us to compare a sample of the machine-readable semantic metadata in XML (and some presentation information in HTML):

Compare with the beginning of the human-readable output on the same item’s normal webpage listing:

As with the Old Bailey Online project, the inclusion of meaningful metadata as an integral part of the webpage’s text allows for much greater efficiency in searching for, and comparing between, different sections of the total research corpus. Applying these principles to the entirety of the Internet may seem like a distant dream, but the rate of technological progress and innovation that resulted in Web 1.0 and 2.0 must surely make the realisation of Version 3.0 achievable.

Library metadata standards in the Web 2.0 era

29 Wednesday Oct 2014

Posted by Dominic in Information Management

≈ 4 Comments

Tags

authority control, BIBFRAME, Charles Ammi Cutter, DITA, information architecture, information management, information needs, information organisation, information retrieval, LCSH, Library of Congress, library OPACs, MARC, markup languages, metadata, programming languages, Web 2.0, WorldCat, XML

I shall begin this blog post with a pair of suitably dramatic quotations, purely in order to get the reader’s attention:

Metadata absolutely tells you everything about somebody’s life. If you have enough metadata, you don’t really need content.

—Stewart Baker, former General Counsel of the National Security Agency.

We kill people based on metadata.

—Michael Hayden, former Director of the NSA and the Central Intelligence Agency.

(Thanks to David Cole of the New York Review of Books.)

Of course, not all applications of metadata (from the Greek meta-, meaning “change” or “beyond”; used in English to indicate the abstraction of a concept, in this case “data about data”) are literally a matter of life-or-death (although a trained cataloguer like me might beg to differ!). In the Information Age, however, the collection, maintenance and access to metadata is one of the most important issues facing any library, or indeed any organisation that requires a constant supply of relevant, good-quality information to function effectively.

The mainstay of the bibliographic metadata framework used in libraries is the Machine-Readable Cataloguing (MARC) record. This is a means of recording bibliographic metadata (in up to 999 separate fields and further sub-fields) in such a way that it can be read by a machine (but also by a trained human), essentially by providing a repeatable framework with clearly defined parameters that an automated process can understand. It was developed by the Library of Congress (LC) and first trialled in the mid-1960s; the LC’s importance and prestige in the LIS sector, both in the United States and abroad, ensured that it soon spread around the world.

A key reason for libraries to adopt the MARC record format for bibliographic metadata was to save time and resources: provided that all cataloguers used the same standards—the official MARC standards website links to numerous sets of authority codes for display in the relevant fields, such as country of origin, language, and how named individuals and organisations are related to the item being catalogued (and does not include further cataloguing standards such as the Anglo-American Cataloguing Rules and the comparatively new Resource Description and Access—then each new book or similar item only needed to be catalogued once, with the cataloguing authority then sharing its record with other libraries. Initially this was carried out by the LC offering a subscription service to computer-printed catalogue cards, but the development of the Internet in subsequent decades soon allowed for the much quicker transfer of information through cyberspace. The Online Computer Library Center (OCLC) which administers WorldCat, the world’s largest online public-access catalogue (OPAC), also dates back to the mid-1960s.

The development of online library catalogues sounded the death knell for the traditional card index systems, but the underlying metadata standards remained the same. (Photo credit: Ted Eytan)

Yet even as the methods of communication of, and access to, library resources changed, the metadata standards (albeit with periodic revisions) remained substantially the same. Just as the Dewey Decimal Classification and the Library of Congress Classification schemes date from the second half of the nineteenth centuries, so the principles for a library catalogue are still underpinned by Charles Ammi Cutter’s original objectives from the same period. Thus a modern library OPAC, whilst much more convenient to use than its cumbersome predecessor, is still used mostly to search by author, title or subject (the latter either directly by using controlled vocabulary such as the Library of Congress Subject Headings (LCSH), or by proxy with a classification system).

The recent development of more general computer markup languages to describe metadata, such as Extensible Markup Language (XML) and JavaScript Object Notation (JSON) has opened up many new avenues for library applications, as this twelve-year-old prediction indicates:

[XML] has the potential to exceed the impact of MARC on librarianship. While MARC is limited to bibliographic description […] XML provides a highly-effective framework for encoding anything from a bibliographic record for a book to the book itself.

—Roy Tennant, editor of XML in Libraries (2002).

The contributing authors to XML in Libraries identify seven key applications for the language:

Library catalogue records
Interlibrary loans
Cataloguing and indexing
Collection development
Databases
Data migration
Systems interoperability

Whilst most of these developments have taken place “behind the scenes”, one visible indicator of progress is that many academic library OPACs, such as my home institution of City University London, have the functionality to export a bibliographic record in a format which can be understood by various reference management programmes—a similar process to exporting a website (such as the entire contents of this blog, which WordPress allows me to do if I so choose) using XML.

Yet despite these developments, and other features of modern library OPACS about which I have already posted—such as the use of “professional” colour schemes and the embedding of multimedia material using APIs—the metadata itself remains static, even as Web 2.0 becomes ever-more dynamic and engaging for its users. This is why the LC is developing a new model called Bibliographic Framework Initiative (BIBFRAME) which aims to (eventually) replace the ageing MARC records.

The key advantage that BIBFRAME has over MARC is that it is designed around a Linked Data model. MARC records have certain “linking fields” in which such data can be added—for example, the name of a series within which an individual monograph is published—but my impression from using the records regularly as an information professional is that each one is a discrete entity with the links tacked on as an afterthought—which is only to be expected, given that the format predates the practice of hyperlinking by several decades. In a BIBFRAME record, every aspect of metadata is relational and therefore searchable; this allows the searcher to move beyond Cutter’s simple objectives to achieve far greater precision in terms of information retrieval. But that’s not all. Reading the BIBFRAME overview, for me the key element is:

Information Resources can then be re-assembled into a coherent architecture that allows for cooperative cataloging at a far more granular level than before. Then, as we leverage the Web as an architecture for data, whenever updates to these Resources are performed (e.g. someone adds new information about a Person, new mappings related to a Subject, etc.) notification events can occur to automatically update systems that reference these Resources. Further, these information assets can now be more effectively utilized at a granular level and provide a richer substrate in which local collections, special collections and third party data can easily annotate and contextualize cooperative library content.

This allows for far more user participation in the creation and dissemination of metadata, essentially opening up the process beyond library and related employees to the library users themselves. Instead of a series of static pages, however large or however interlinked they may be, imagine a library OPAC that invites anyone who accesses it to edit the catalogue records in the manner of Wikipedia, or to link to non-traditional external sources of information, such as a LibraryThing profile or a book review posted on a personal blog. This would require a great deal of programme-writing in order to produce an interface that the layperson can use without special training, and a level of professional monitoring to prevent vandalism, but the example of Wikipedia is encouraging: in under fifteen years of existence, the article-count (all freely usable, of course) of the English-language version alone is over 4.5 million, written almost exclusively by millions of volunteers (of which approximately 100,000 are regular users), whilst the Wikimedia Foundation (which also administers many other similar projects using the same collaborative model) boasts a shade over 200 paid employees. Despite its voluntary and open nature, it has effective systems for dealing with vandalism and ensuring quality control, as shown in a number of studies which have compared its accuracy to that of conventional encyclopaedias.

Web 2.0 is defined by the interaction between its content and its users, and its pervasiveness in modern society is testament to how popular it is. In an era when libraries of all types are continually striving to improve user participation—whether due to government funding cuts or otherwise—perhaps it is fair to say that this process should be carried out, as a matter of some urgency, at its heart: the metadata that underlies all library resources.

Clash of the titans

14 Tuesday Oct 2014

Posted by Dominic in Information Architecture

≈ 7 Comments

Tags

Andrei Broder, Bing, Boolean operators, databases, DITA, Google, information architecture, information management, information needs, information retrieval, natural language, search engines, structured queries, unstructured queries

For the information professional, there are two main ways of curating (for want of a better term) data. The first is by using a relational database—an extremely efficient way of organising information within an organisation—but this is not suitable for large quantities of heterogenous data that are owned by many different identities. For such datasets (this post will focus on the most obvious example, the contents of the World Wide Web), the alternative system of Information Retrieval (IR) comes into play.

An IR system consists of input (the content to be searched), indexing, search, and interface (how the user inputs their query into the system). Indexing requires the contents of the input (and possibly any accompanying metadata) to be identified, filtered, listed and stored. The storage medium can vary from the endpages of a book (in a traditional sense of the index function) to the vast array of servers used by a Web search engine.

The most important part of the IR system is the search component, which retrieves the information required by the user. A major advantage of modern IR systems over relational databases is that users can submit unstructured queries: instead of using set command lines, Boolean operators or similar standardised codes and expressions, the user can instead input unstructured queries in the form of natural language, or indeed in full questions.

Nevertheless, despite the constant advances in search algorithms used by search engines, promising greater and greater accuracy in using “best-match” search models to improve the quality of IR for unstructured queries, it should still be possible to achieve a greater degree of precision when inputting structured search terms into the same search engines. In the DITA module’s lab session yesterday, I tested this hypothesis by attempting to answer ten questions using first unstructured and then structured queries. To introduce a degree of competition, I used both Google and Bing—the former with the majority of the British search engine market, and the latter its fastest-growing competitor—truly two titans of the Internet Age!

The ten information needs were as follows:

What was the first blogging platform?
What did the first web pages look like?
Who developed WordPress and when?
How do you disable comments on WordPress.com blogs?
What colour is produced with the #330000 HEX value?
Why are they called “Boolean operators”?
Find a WordPress theme for sale that you like or would consider paying for.
Find a high-resolution photograph of an old library which is licensed for reuse.
Find an image of computers in a library, licensed for reuse.
Find an image to illustrate the concept of information retrieval, licensed for reuse.

Note that this includes a range of information needs, as defined by Andrei Broder’s taxonomy (2002), different media types, and a distinction between specific, well-defined queries, and more general ones which require a degree of browsing.

To generate my results, I used the same natural language search terms for both Google and Bing, before using a range of structured search techniques (mentioned above) and the search engines’ own “advanced search” tools in an attempt to refine the information retrieved. I measured the precision of the searches by taking the top five results (excluding adverts and sponsored results) and using the standard formula:

Precision (P) = Relevant results (Nr) / Total results (N) (i.e. 5)

This produced the following results (I have not included the structured queries in the table due to space limitations, as they were not the same for both search engines):

Click to enlarge

It is clear to see that the structured queries were almost twice as effective as those using natural language only. This was particularly true of the final three queries: these used the image search facility, which is far more reliant on the advanced settings tools to return relevant results. It is also clear that Google has a slight edge over Bing in both unstructured and structured queries (although this may have been skewed by the fact that I habitually use Google and am therefore more accustomed to searching with it). I was surprised by some of the inconsistencies between the two search engines, particularly in regard to the unstructured queries: for example, entering “first blogging platform” into Google in order to establish the identity of the first blogging platform in history produced a reasonable level of precision, whereas Bing returned a series of guides on how to blog for the first time—a situation which was reversed for the second information need concerning the appearance of early web pages. In general, it is also apparent that unstructured queries were most effective when searching for precise information (e.g. the colour produced by the #330000 HEX value), whereas structured queries were far superior for browsing (e.g. a suitable WordPress theme, or freely-licensed images to illustrate a particular point).

Finally, to demonstrate the value of this experiment, I conclude this post with a high-resolution, public-domain photograph of a beautiful old library.

Public domain, taken by Emgonzalez and uploaded to Wikimedia Commons.

The Library of Tomorrow

~ thoughts and reflections on the world of Library and Information Science

Tag Archives: information management

Moving towards a Web of Meaning

Library metadata standards in the Web 2.0 era

Clash of the titans