Documents as Data
Last year, Dr. Snow had used the new DataDoc facility on the Internet to develop an analytic program that keeps track of references to avian influenza in web pages and news articles. She first entered the search terms that she thought would pinpoint relevant pages, such as “bird flu”, “avian influenza,” H1N5, etc. , pulling up 63 million web pages and 9000 news references. She saved the search and called on DataDoc to help make sense of it. The Doc showed a random selection of the relevant pages, and asked her to click on highlighted ítems that she found particularly relevant and wanted to categorize and count. In a sense, she was assigning variable names to the ítems. Instead of being identified by their position in columns of a dataset, however, the values were characterized by their relationship to other ítems in the web pages. If they were located in tables, of course the relevant terms were at the margins of the table, but other, less structured data, was characterized by DataDoc with the help of cluster terms from similar topics on the Internet.
As Dr. Snow identified the variables to be processed, DataDoc showed her the frequencies and/or sums and means obtained from her 63 million pages. She was able to clean up the sample by selecting for type of source, those with and without numeric data, date of report, etc. As she refined the report, she added graphic displays for species of bird, month of onset, weather conditions, geographic location, etc. The incremental process of working with a sample of documents until DataDoc “understood” how to find the variables was interesting, as she could see Doc get closer to understanding what she wanted from a page. After 20 or 30 instances, the program found relevant values for her variables in a large percentage of the pages. It kept presenting a random selection of the others for her guidance until she decided she had accomplished the main objectives.
Today, she merely asks to see the current data for her report, and scans to see if there is anything new, first in the world, then in areas near Newtopia. She clicks on this week’s published articles on avian influenza and drags the fólder to her in-basket for later analysis, using more specific tools of DataDoc evolved for scientific articles.
She has lived through several decades of scientific work, and remembers how in college she once abstracted data onto 3 by 5 cards in a library. Later in her career, she used Epi Info and questionnaires created on the spot to abstract data and then do frequencies and tables. DataDoc is less well defined than rows and columns, but she finds that the ease of looking at the document database from different perspectives more than makes up for the lack of rigid variable definitions. Of course, documents are not life, or even videos of life, but being able to analyze them statistically and graphically is another step closer to doing epidemiologic analysis in real time.
As Dr. Snow identified the variables to be processed, DataDoc showed her the frequencies and/or sums and means obtained from her 63 million pages. She was able to clean up the sample by selecting for type of source, those with and without numeric data, date of report, etc. As she refined the report, she added graphic displays for species of bird, month of onset, weather conditions, geographic location, etc. The incremental process of working with a sample of documents until DataDoc “understood” how to find the variables was interesting, as she could see Doc get closer to understanding what she wanted from a page. After 20 or 30 instances, the program found relevant values for her variables in a large percentage of the pages. It kept presenting a random selection of the others for her guidance until she decided she had accomplished the main objectives.
Today, she merely asks to see the current data for her report, and scans to see if there is anything new, first in the world, then in areas near Newtopia. She clicks on this week’s published articles on avian influenza and drags the fólder to her in-basket for later analysis, using more specific tools of DataDoc evolved for scientific articles.
She has lived through several decades of scientific work, and remembers how in college she once abstracted data onto 3 by 5 cards in a library. Later in her career, she used Epi Info and questionnaires created on the spot to abstract data and then do frequencies and tables. DataDoc is less well defined than rows and columns, but she finds that the ease of looking at the document database from different perspectives more than makes up for the lack of rigid variable definitions. Of course, documents are not life, or even videos of life, but being able to analyze them statistically and graphically is another step closer to doing epidemiologic analysis in real time.
1 Comments:
I hope Dr. Snow become a role model for all the new Public Health workers in the entire World, like her great great grand father, using all this new tools. Me as a newly Health staff, will try fallow her steps.
One question, Where can I meet her?
Post a Comment
<< Home