The Development of The Scotsman Digital Archive
Overview
The development of this archive has been a major undertaking for The Scotsman Publications Limited and it is one that we believe will stand the test of time in terms of the quality of the product. It is also the first time a UK newspaper publisher has offered it's entire archive in a fully searchable format to the general public.
Our aim was to offer search functionality that would, to the highest degree possible, overcome potential limitations in the original material and would meet the expectations of our customers.
In developing the archive we have worked very closely with UK Archiving (formerly The Scottish Newspaper Microfilming Unit) and Olive Software. We have also received considerable assistance from a number of libraries in Scotland, in particular Edinburgh City Library, Edinburgh University and St Andrews University.
The Scotsman Publications Limited, designed and installed the infrastructure that delivers the digital archive service. The service is hosted on a fully resilient platform using the latest data storage technologies. The application front end is a highly customised version of the Olive application and was designed and developed by our in-house web development team.
Identifying the right technology partner
Having conducted considerable research into companies operating in the field of digital archiving, and undergoing a formal tender process, Olive Software was chosen as the preferred supplier.
Olive Software have been tremendously supportive in helping us deliver a world-class service.
Further information on Olive Software is available at www.olivesoftware.com.
Sourcing the Microfilm
In parallel with identifying the right technology partner, we began the process of sourcing high quality microfilm. It was important to source the best quality material, as this would determine the eventual quality of the final product.
At this time, we conducted benchmarking analysis on sample reels of existing microfilm. Although this analysis was conducted on a relatively small sample, it was decided to guarantee the best possible product for our customers by looking at alternative strategies for securing high quality microfilm.
Examples of damaged, faded and degraded editions.
As there was to our knowledge no other source of The Scotsman on microfilm we began to consider quotes for re-filming the entire back-run to 1817 and also for scanning the newspapers direct to digital format. The preservation element of re-filming along with the technical difficulties associated with digitally scanning the material, swung the decision in favour of re-filming.
This exercise was going to be expensive but necessary, to ensure the best quality and most consistent output using the latest film technology and standards.
Having conducted considerable research into companies operating in the field of preservation microfilming and undergoing a formal tender process, UK Archiving was chosen as the preferred supplier to re-film The Scotsman from 1817-1950.
On signing the agreement, work began to find the best possible original editions to microfilm. The Scotsman Publications has its own archive but some of the original material in the archive has degraded to such an extent that sections of the pages have literally disappeared [see right].
We were extremely fortunate in that a number of other organisations also hold complete, or nearly complete, archives of The Scotsman back to the first edition. Discussions were held with Edinburgh City Library, Edinburgh University and St Andrews University to secure their assistance in identifying the best possible originals available and their agreement to microfilm them. This has meant UK Archiving has quite literally compared every single page from every single edition from every volume, from up to four different archives for the whole period being re-filmed – some 600,000 pages between 1817 and 1950.
As well as obtaining the best possible originals, UK Archiving had to ensure they were delivering the highest possible standard of microfilm. Benchmarking tests were conducted to determine the optimum settings for both preservation and digitisation purposes.
All microfilming conformed to the requirements of the Guidelines for Archival Microfilming published by the National Preservation Office in 2000. All film created by UK Archiving observed all appropriate ISO and BSI standards.
As can be imagined, this was a huge logistical exercise and through their skills, experience and sheer effort, the UK Archiving team have played a central role in ensuring that we maintained a total focus on quality in these crucial areas.
Digitising the Microfilm
The microfilm was despatched to Olive Software in batches through the summer of 2004. The material was taken by Olive and scanned into digital format.
There are many technical problems associated with working with old newspapers and especially when digitising microfilm, including:
- Skewing
- Dots and spots
- Fading/yellowing
- Scratches
- Continuations
- Dense text/ink problems
- Fonts including initial letter font difference/size/capital letters/italics
In order to arrive at the best possible product many of these challenges had to be overcome. The Olive process is designed to minimise or reduce the impact of many of these issues. The process in more detail is as follows:
- Scanning: the scanning process utilises SunriseĀ® high-end scanning equipment, with each scanner producing an average of 800 frames per hour.
- Tuning: the tuning process is performed on the TIFF files as scanned, creating an exceptionally efficient workflow. This is the process through which the images are corrected, cropped and de-skewed, etc.
- PDF conversion: upon completion of scanning and tuning, the images are converted to PDF for insertion into Olive's proprietary image processing technology – Pipex.
- Zoning: this is the first step in the process of reducing the scan to XML, and is one of the most critical parts of the process. There are two zoning passes, each with a separate and specific function. The first pass creates associations between groups of letters to identify words. The second zoning pass combines the newly recognised words into complete lines, with columnar separation as is appropriate.
- Segmentation: the segmentation pass identifies all page objects, elements and entities and establishes the relationship between these elements to each other and in relation to the page. This step, when digitising from microfilm, is critical to ensure quality Optical Character Recognition (OCR), accuracy and overall usability of the digital archive.
- Optical Character Recognition (OCR): as the segmentation engine has broken the entire publication into small, easy to process segments of data, the OCR pass is dramatically enhanced. This is another advantage in technology provided by the Olive process. Many OCR systems attempt columnar or whole-page OCR, which is inefficient, and does not allow article-level returns.
- Publishing to XML: this step populates the XML repository with the digitised data.
- Indexing: this step indexes the data to enable searching of information and delivery of accurate results to users.
Limitations of the system
We have placed particular emphasis on quality and the ability of our customers to accurately search, locate and read articles from past editions of The Scotsman. However, due to the age of the source material, it is not possible to guarantee that the system will be able to search for and find every word or term that may have appeared in the original newspaper. For example, if the original newspaper page is damaged or degraded (see examples) and no alternatives are available then the damaged text is lost.
Given these constraints, we will continue to focus on improving the service as we move forward and as new technology becomes available to us.

