* SAA Aug. 2013 – Crowdsourcing

SAA 2013, New Orleans: SESSION 404

Incentivizing Volunteer Workforces
for Crowd-sourced Projects

Session hosts: Richard Marciano & Mark Hedges


photoLiana Diesendruck — Richard Marciano — Emily Schultz — Colleen Theisen

John Martinez

1. A framework for Crowd-sourcing in the Humanities

Mark Hedges (KCL)

  • Summary:Crowd-sourcing projects are transforming the ways in which academics collaborate with the broader community, and blurring the boundaries between the spaces inhabited by academic and non-academic communities. The purpose of the Crowd-Sourcing Scoping Study (http://www.humanitiescrowds.org) was to review crowd-sourcing practices in the academic humanities, to assess their impact and development, to consider the motivations and aspects of community among those who participate, and to develop a typology that captures the various approaches that have emerged. Here we focus on the typology, with a view to stimulating discussion its effectiveness as a conceptual framework for describing, analyzing and planning crowd-sourcing activities.
  • Bio:Mark is Director of the Centre for e-Research, part of the Department of Digital Humanities at King’s College London. His original academic background was in mathematics and philosophy, and he gained a PhD in mathematics at University College London, before starting a 17-year career in the software industry. He now works on a variety of projects related to digital libraries and archives, research infrastructures, and digital preservation, as well as teaching on MA programs at King’s.

2. Community Project Success with the 1940 Census and Beyond

Emily Schultz (FamilySearch)

  • Summary:In 2012, FamilySearch completed a searchable index of the 132 million names in the 1940 US Census, in cooperation with a large community of commercial partners, genealogical societies, and interested individuals. The goal was to complete the project within seven months, but the project only took just over four months to complete. Over 167,000 individual volunteers and over 600 societies participated in the process, which included a double-keying and arbitration, with daily online reporting.  FindMyPast.comArchives.com, and ProQuest also participated. FamilySearch International (www.familysearch.org) continues to sponsor hundreds of smaller projects, and produces approximately 1 million indexed records per day.  One follow-up project to 1940 is the Immigration/Naturalization Project which will provide free access to immigrant ancestor records. (www.familysearch.org/us-immigration-naturalization/)
  • Bio:Emily Stanford Schultz has worked for FamilySearch for 18 years, most recently as a Senior Project Manager. Her assignments include the Italian Ancestor’s Project (www.familysearch.org/italian-ancestors) , an historic agreement signed with the Italian National Archives in 2011, which will make freely available over 115M images of civil registration records for the entire country of Italy from 1800-1940. Emily also worked on the 1940 Census project in 2012, and previously has managed the Family History Library’s Cataloging Department, cataloged Italian and French records, and served in a variety of other library positions. She has a B.A. in English from Brigham Young University, and an M.B.A. from Weber State University. She is married to David W. Schultz, a landscape painter, and has two delightful children, Ethan and Sophie.

3. Incentives for Passive Crowdsourcing in the 1940 US Census

Liana Diesendruck (UI Urbana-Champaign)

  • Summary:The 1940 US Census was composed of millions of spreadsheet-like images, comprising roughly 10 billion units of information (http://www.slideshare.net/NARACAST/ncsa-posterhandout). We will briefly describe our Computer Vision approach to providing initial searchable access to the collection and then focus on the passive crowdsourcing strategies we employ to improve the system’s accuracy over time. The main idea is to get feedback on the quality of the results by keeping track of the users’ interactions with the system. The process starts when the user is presented with a list of possible matches to a query. Information is gathered about the user’s interactions with this list, from clicked results to the amount of time spent on a page. The acquired data provides invaluable feedback on the accuracy of the presented results and it is used later to improve the system’s responses.
  • Bio:Liana Diesendruck is part of the Image, Spatial, and Data Analysis (ISDA) group at the National Center for Supercomputing Applications (NCSA) at the University of Illinois. She currently applies her background in computer vision to various projects related to the digital preservation of documents. Recently, she has been working on ways to provide searchable access to the 1940’s US Census data without using human transcribers.  Liana holds a M.Sc. and a B.Sc. in Computer Science and a B.Sc. in Physics, all from Ben-Gurion University in Israel. Currently her interests lie in applied computer vision, information retrieval, data analysis and visualization, and cognitive systems.

4. NARA crowdsourcing initiatives

John Martinez (Office of Innovation / NARA)

  • Summary:The Citizen Archivist Initiative is an adaptation of the long-standing tradition of crowdsourcing in science. Citizen science projects engage amateurs and nonprofessionals in scientific research, like reporting bird sightings or categorizing galaxies. The National Archives has used digital technology to engage citizen archivists, who can use tags, transcripts, and digital images that increase public access to the records of the Federal Government. The Citizen Archivist Dashboard (http://www.archives.gov/citizen-archivist) is a central location for these crowdsourcing activities. Projects linked to the dashboard include transcribing digitized historic Navy ship logs to improve our knowledge of past environmental conditions, transcribing National Archives documents at various levels of difficulty, uploading images to a Citizen Archivist Flickr group, tagging National Archives records for better online searching, and contributing and editing articles on the Our Archives Wiki.
  • Bio:John Martinez is Director of the Business Architecture, Standards, and Authorities Division of the Office of Innovation at the National Archives and Records Administration (NARA). He holds a M.L.S. from the University of Maryland and a Certificate of Advanced Study in Library and Information Systems from the University of Pittsburgh. Prior to NARA he worked at the National Security Archive at George Washington University, and at NARA he has also worked on the electronic records staff and policy and planning staff.

5. Do-It-Yourself History

Colleen Theisen (U. of Iowa)

  • Summary:In 2011 the University of Iowa Libraries launched a low-tech transcription crowdsourcing project for Civil War diaries and letters; as word of mouth spread (with some help from Reddit), volunteers made short work of the collection, transcribing all 16,000 pages in just over a year. An expansion of the project, renamed DIY History (http://diyhistory.lib.uiowa.edu/) and built on open-source tools (Omeka content management system using the Scripto plugin), features a variety of documents including handwritten cookbooks and pioneer-era letters and diaries.
  • Bio:Colleen Theisen is the Outreach and Instruction Librarian for Special Collections and the University Archives at the University of Iowa.  She has a B.A. in Art History from the University of Missouri, was a certified art teacher for grades 6-12 and she completed her Master of Science in Information from the University of Michigan School of Information in 2011.

6. Citizen-led Crowdsourcing of Archival Material

Richard Marciano (UNC Chapel Hill)

  • Summary:The Cyber-Infrastructure for Billions of Electronic Records (CI-BER) project is a collaborative big data management project based on the integration of heterogeneous datasets and multi-source historical and digital collections, including a place-based citizen-led crowdsourcing case study of the Southside neighborhood in Asheville, North Carolina.  A first-generation open source collaborative mapping environment prototype is currently being developed to support novel “citizen-led crowdsourcing” possibilities for archival material (http://www.citizen-times.com/article/20130804/LIVING/308040026/Project-seeks-recreate-history-Asheville-s-Southside).
  • Bio:Richard Marciano is a professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill (UNC). He directs a number of “big data” collaboratives, and is a 2012 recipient of the JISC Digging Into Data Challenge Grant.  Richard holds a BS in Avionics and Electrical Engineering, and an M.S. and Ph.D. in Computer Science, and has worked as a Postdoc in Computational Geography. He conducted interdisciplinary research at the San Diego Supercomputer at UC San Diego for over a decade, working with teams of scholars in sciences, social sciences, and humanities.