Documents as Data
As Dr. Snow identified the variables to be processed, DataDoc showed her the frequencies and/or sums and means obtained from her 63 million pages. She was able to clean up the sample by selecting for type of source, those with and without numeric data, date of report, etc. As she refined the report, she added graphic displays for species of bird, month of onset, weather conditions, geographic location, etc. The incremental process of working with a sample of documents until DataDoc “understood” how to find the variables was interesting, as she could see Doc get closer to understanding what she wanted from a page. After 20 or 30 instances, the program found relevant values for her variables in a large percentage of the pages. It kept presenting a random selection of the others for her guidance until she decided she had accomplished the main objectives.
Today, she merely asks to see the current data for her report, and scans to see if there is anything new, first in the world, then in areas near Newtopia. She clicks on this week’s published articles on avian influenza and drags the fólder to her in-basket for later analysis, using more specific tools of DataDoc evolved for scientific articles.
She has lived through several decades of scientific work, and remembers how in college she once abstracted data onto 3 by 5 cards in a library. Later in her career, she used Epi Info and questionnaires created on the spot to abstract data and then do frequencies and tables. DataDoc is less well defined than rows and columns, but she finds that the ease of looking at the document database from different perspectives more than makes up for the lack of rigid variable definitions. Of course, documents are not life, or even videos of life, but being able to analyze them statistically and graphically is another step closer to doing epidemiologic analysis in real time.