A library in a box

31 Tuesday Mar 2015

Posted by Dominic in Information Society

Tags

censorship, charities, CILIP, digital divide, event reports, IdeasBox, information literacy, information needs, information society, Libraries Without Borders, NGOs

Yesterday evening, I went to the Institut Français with some of my course friends to attend a panel discussion about the IdeasBox, a charitable project launched by the Libraries Without Borders / Bibliothèques Sans Frontières foundation in partnership with the United Nations Refugee Agency and the architect and designer, Philippe Starck. The project aims to equip vulnerable populations around the world with books, information technology tools and infrastructure tailored to the needs of each particular community, in order to improve their lives through education and access to information and culture, as this promotional video explains:

There were a few things that struck me in particular about the evening:

The programme is not just intended for areas in developing countries that are affected by humanitarian catastrophes, but also for disadvantaged populations in “developed” countries: it has been used, or its us is currently being evaluated, for the benefit of Aboriginal communities in Australia, deprived urban communities in New York, and migrant settlements in Calais, in addition to the more “obvious” areas of conflict and unrest in Africa and the Middle East. This point, and the positive effects of reading, literacy and libraries in general, were emphasised by Barbara Band, the former head of CILIP, who quoted the results of some recent alarming surveys carried out in Britain: 3 in 10 British children do not own a single book of their own (2011) and a third of the poorest British children do not have internet access at home (2013) Reflecting on this figures, my mind wandered to the example of the Finnish government-provided maternity box that is given to all mothers: could an “information literacy box” be given to each new schoolchild one day in a partnership between the government and a specialist charitable scheme such as the IdeasBox, which already co-operates with governments, publishers, other charities and so forth?
I was also impressed by the cross-fertilisation of ideas that took place during the evening. For example, the author Ian McEwan proposed a tie-up with the Solar Aid initiative that he already supports, as books by themselves are of limited value if there is no light by which to read them. A member of the audience from Penguin Random House UK also offered her company’s support in the question-and-answer session that took place after the discussion. To me, this provided timely hard evidence of the rhetoric that allowing networked access can improve the lives of disadvantaged communities by demonstrating the value of utilising multiple perspectives and areas of expertise at a social event.
Several panelists pointed out that the freedom of information, expression and thought is under pressure in a number of countries throughout the world for political and religious reasons, and that this project represents the ideal riposte to this trend. This links back to the debates that we have discussed in my LAPIS module on the cost of knowledge and the digital divide that this, and other costs associated with the development of technology and infrastructure, produces between developed and developing countries. On the one hand, it is somewhat depressing to think that these economic pressures are therefore added to by the anti-information tendencies caused by political authoritarianism and religious extremism, but also encouraging that a successful project such as this has been set up to counter it.

All in all, it was an extremely informative evening, and I will follow the progress of the IdeasBox programme with much interest.

[N.B. The featured image for this post is “Ideas Box – Bibliothèques sans Frontières” by ActuaLitté on Flickr. It is licensed under an Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0) licence.]

Library metadata standards in the Web 2.0 era

29 Wednesday Oct 2014

Posted by Dominic in Information Management

≈ 4 Comments

Tags

authority control, BIBFRAME, Charles Ammi Cutter, DITA, information architecture, information management, information needs, information organisation, information retrieval, LCSH, Library of Congress, library OPACs, MARC, markup languages, metadata, programming languages, Web 2.0, WorldCat, XML

I shall begin this blog post with a pair of suitably dramatic quotations, purely in order to get the reader’s attention:

Metadata absolutely tells you everything about somebody’s life. If you have enough metadata, you don’t really need content.

—Stewart Baker, former General Counsel of the National Security Agency.

We kill people based on metadata.

—Michael Hayden, former Director of the NSA and the Central Intelligence Agency.

(Thanks to David Cole of the New York Review of Books.)

Of course, not all applications of metadata (from the Greek meta-, meaning “change” or “beyond”; used in English to indicate the abstraction of a concept, in this case “data about data”) are literally a matter of life-or-death (although a trained cataloguer like me might beg to differ!). In the Information Age, however, the collection, maintenance and access to metadata is one of the most important issues facing any library, or indeed any organisation that requires a constant supply of relevant, good-quality information to function effectively.

The mainstay of the bibliographic metadata framework used in libraries is the Machine-Readable Cataloguing (MARC) record. This is a means of recording bibliographic metadata (in up to 999 separate fields and further sub-fields) in such a way that it can be read by a machine (but also by a trained human), essentially by providing a repeatable framework with clearly defined parameters that an automated process can understand. It was developed by the Library of Congress (LC) and first trialled in the mid-1960s; the LC’s importance and prestige in the LIS sector, both in the United States and abroad, ensured that it soon spread around the world.

A key reason for libraries to adopt the MARC record format for bibliographic metadata was to save time and resources: provided that all cataloguers used the same standards—the official MARC standards website links to numerous sets of authority codes for display in the relevant fields, such as country of origin, language, and how named individuals and organisations are related to the item being catalogued (and does not include further cataloguing standards such as the Anglo-American Cataloguing Rules and the comparatively new Resource Description and Access—then each new book or similar item only needed to be catalogued once, with the cataloguing authority then sharing its record with other libraries. Initially this was carried out by the LC offering a subscription service to computer-printed catalogue cards, but the development of the Internet in subsequent decades soon allowed for the much quicker transfer of information through cyberspace. The Online Computer Library Center (OCLC) which administers WorldCat, the world’s largest online public-access catalogue (OPAC), also dates back to the mid-1960s.

The development of online library catalogues sounded the death knell for the traditional card index systems, but the underlying metadata standards remained the same. (Photo credit: Ted Eytan)

Yet even as the methods of communication of, and access to, library resources changed, the metadata standards (albeit with periodic revisions) remained substantially the same. Just as the Dewey Decimal Classification and the Library of Congress Classification schemes date from the second half of the nineteenth centuries, so the principles for a library catalogue are still underpinned by Charles Ammi Cutter’s original objectives from the same period. Thus a modern library OPAC, whilst much more convenient to use than its cumbersome predecessor, is still used mostly to search by author, title or subject (the latter either directly by using controlled vocabulary such as the Library of Congress Subject Headings (LCSH), or by proxy with a classification system).

The recent development of more general computer markup languages to describe metadata, such as Extensible Markup Language (XML) and JavaScript Object Notation (JSON) has opened up many new avenues for library applications, as this twelve-year-old prediction indicates:

[XML] has the potential to exceed the impact of MARC on librarianship. While MARC is limited to bibliographic description […] XML provides a highly-effective framework for encoding anything from a bibliographic record for a book to the book itself.

—Roy Tennant, editor of XML in Libraries (2002).

The contributing authors to XML in Libraries identify seven key applications for the language:

Library catalogue records
Interlibrary loans
Cataloguing and indexing
Collection development
Databases
Data migration
Systems interoperability

Whilst most of these developments have taken place “behind the scenes”, one visible indicator of progress is that many academic library OPACs, such as my home institution of City University London, have the functionality to export a bibliographic record in a format which can be understood by various reference management programmes—a similar process to exporting a website (such as the entire contents of this blog, which WordPress allows me to do if I so choose) using XML.

Yet despite these developments, and other features of modern library OPACS about which I have already posted—such as the use of “professional” colour schemes and the embedding of multimedia material using APIs—the metadata itself remains static, even as Web 2.0 becomes ever-more dynamic and engaging for its users. This is why the LC is developing a new model called Bibliographic Framework Initiative (BIBFRAME) which aims to (eventually) replace the ageing MARC records.

The key advantage that BIBFRAME has over MARC is that it is designed around a Linked Data model. MARC records have certain “linking fields” in which such data can be added—for example, the name of a series within which an individual monograph is published—but my impression from using the records regularly as an information professional is that each one is a discrete entity with the links tacked on as an afterthought—which is only to be expected, given that the format predates the practice of hyperlinking by several decades. In a BIBFRAME record, every aspect of metadata is relational and therefore searchable; this allows the searcher to move beyond Cutter’s simple objectives to achieve far greater precision in terms of information retrieval. But that’s not all. Reading the BIBFRAME overview, for me the key element is:

Information Resources can then be re-assembled into a coherent architecture that allows for cooperative cataloging at a far more granular level than before. Then, as we leverage the Web as an architecture for data, whenever updates to these Resources are performed (e.g. someone adds new information about a Person, new mappings related to a Subject, etc.) notification events can occur to automatically update systems that reference these Resources. Further, these information assets can now be more effectively utilized at a granular level and provide a richer substrate in which local collections, special collections and third party data can easily annotate and contextualize cooperative library content.

This allows for far more user participation in the creation and dissemination of metadata, essentially opening up the process beyond library and related employees to the library users themselves. Instead of a series of static pages, however large or however interlinked they may be, imagine a library OPAC that invites anyone who accesses it to edit the catalogue records in the manner of Wikipedia, or to link to non-traditional external sources of information, such as a LibraryThing profile or a book review posted on a personal blog. This would require a great deal of programme-writing in order to produce an interface that the layperson can use without special training, and a level of professional monitoring to prevent vandalism, but the example of Wikipedia is encouraging: in under fifteen years of existence, the article-count (all freely usable, of course) of the English-language version alone is over 4.5 million, written almost exclusively by millions of volunteers (of which approximately 100,000 are regular users), whilst the Wikimedia Foundation (which also administers many other similar projects using the same collaborative model) boasts a shade over 200 paid employees. Despite its voluntary and open nature, it has effective systems for dealing with vandalism and ensuring quality control, as shown in a number of studies which have compared its accuracy to that of conventional encyclopaedias.

Web 2.0 is defined by the interaction between its content and its users, and its pervasiveness in modern society is testament to how popular it is. In an era when libraries of all types are continually striving to improve user participation—whether due to government funding cuts or otherwise—perhaps it is fair to say that this process should be carried out, as a matter of some urgency, at its heart: the metadata that underlies all library resources.

Clash of the titans

14 Tuesday Oct 2014

Posted by Dominic in Information Architecture

≈ 7 Comments

Tags

Andrei Broder, Bing, Boolean operators, databases, DITA, Google, information architecture, information management, information needs, information retrieval, natural language, search engines, structured queries, unstructured queries

For the information professional, there are two main ways of curating (for want of a better term) data. The first is by using a relational database—an extremely efficient way of organising information within an organisation—but this is not suitable for large quantities of heterogenous data that are owned by many different identities. For such datasets (this post will focus on the most obvious example, the contents of the World Wide Web), the alternative system of Information Retrieval (IR) comes into play.

An IR system consists of input (the content to be searched), indexing, search, and interface (how the user inputs their query into the system). Indexing requires the contents of the input (and possibly any accompanying metadata) to be identified, filtered, listed and stored. The storage medium can vary from the endpages of a book (in a traditional sense of the index function) to the vast array of servers used by a Web search engine.

The most important part of the IR system is the search component, which retrieves the information required by the user. A major advantage of modern IR systems over relational databases is that users can submit unstructured queries: instead of using set command lines, Boolean operators or similar standardised codes and expressions, the user can instead input unstructured queries in the form of natural language, or indeed in full questions.

Nevertheless, despite the constant advances in search algorithms used by search engines, promising greater and greater accuracy in using “best-match” search models to improve the quality of IR for unstructured queries, it should still be possible to achieve a greater degree of precision when inputting structured search terms into the same search engines. In the DITA module’s lab session yesterday, I tested this hypothesis by attempting to answer ten questions using first unstructured and then structured queries. To introduce a degree of competition, I used both Google and Bing—the former with the majority of the British search engine market, and the latter its fastest-growing competitor—truly two titans of the Internet Age!

The ten information needs were as follows:

What was the first blogging platform?
What did the first web pages look like?
Who developed WordPress and when?
How do you disable comments on WordPress.com blogs?
What colour is produced with the #330000 HEX value?
Why are they called “Boolean operators”?
Find a WordPress theme for sale that you like or would consider paying for.
Find a high-resolution photograph of an old library which is licensed for reuse.
Find an image of computers in a library, licensed for reuse.
Find an image to illustrate the concept of information retrieval, licensed for reuse.

Note that this includes a range of information needs, as defined by Andrei Broder’s taxonomy (2002), different media types, and a distinction between specific, well-defined queries, and more general ones which require a degree of browsing.

To generate my results, I used the same natural language search terms for both Google and Bing, before using a range of structured search techniques (mentioned above) and the search engines’ own “advanced search” tools in an attempt to refine the information retrieved. I measured the precision of the searches by taking the top five results (excluding adverts and sponsored results) and using the standard formula:

Precision (P) = Relevant results (Nr) / Total results (N) (i.e. 5)

This produced the following results (I have not included the structured queries in the table due to space limitations, as they were not the same for both search engines):

Click to enlarge

It is clear to see that the structured queries were almost twice as effective as those using natural language only. This was particularly true of the final three queries: these used the image search facility, which is far more reliant on the advanced settings tools to return relevant results. It is also clear that Google has a slight edge over Bing in both unstructured and structured queries (although this may have been skewed by the fact that I habitually use Google and am therefore more accustomed to searching with it). I was surprised by some of the inconsistencies between the two search engines, particularly in regard to the unstructured queries: for example, entering “first blogging platform” into Google in order to establish the identity of the first blogging platform in history produced a reasonable level of precision, whereas Bing returned a series of guides on how to blog for the first time—a situation which was reversed for the second information need concerning the appearance of early web pages. In general, it is also apparent that unstructured queries were most effective when searching for precise information (e.g. the colour produced by the #330000 HEX value), whereas structured queries were far superior for browsing (e.g. a suitable WordPress theme, or freely-licensed images to illustrate a particular point).

Finally, to demonstrate the value of this experiment, I conclude this post with a high-resolution, public-domain photograph of a beautiful old library.

Public domain, taken by Emgonzalez and uploaded to Wikimedia Commons.

The Library of Tomorrow

~ thoughts and reflections on the world of Library and Information Science

Tag Archives: information needs

A library in a box

Library metadata standards in the Web 2.0 era

Clash of the titans