Monday, July 10, 2006

The Internet as a Platform for Storing and “Merging” Data


We interrupt this ramble about a day in the life of Dr. Susan Snow, epidemiologist of Newtopia, to muse about some of the differences between current computer processing, and a totally Internet-based document and data storage system. First some assumptions:

1. For use in epidemiology (and many other fields), users would have to be authenticated more surely and more conveniently than with the present password gesture, and authentication would have to continue during computer use. Preferably a picture of an impostor would be stored for purposes of prosecution.
2. An Internet provider such as Google would offer to organizations and individuals secure storage of data and documents and interfaces for manipulating the contents—an extension of the already existing Gmail and Google Spreadsheet concepts.
3. Data, once stored, would never be discarded, although revisions would also be stored and applied as needed through an indexing process.

If these bits of magic are granted, then Dr. Snow will be able to work completely from a browser, whether in a computer or a more mobile, cell-phone like device, and to access her work from anywhere in the world that is sufficiently well connected to the Internet. Her data (meaning data and documents from now on), reside in something like the Google server farm, the hundred-thousand-plus microcomputers linked in parallel that underlie the Google web pages. A user has no way of knowing where his or her data are stored, but, at the right time, the Google database engine can find any particular item and deliver it for use, with due attention to its ownership, provision for backup, virus exclusion, etc.

Suppose then, that the counties in Newtopia wish to share their data with the Newtopia Department of Health (NDH). In the old days (2006), this required either a central server at NDH, with connection requirements and considerable inflexibility for the individual counties, or the counties each had to send data periodically to NDH for merging with the central database. There was very little tolerance for variation in data format among the counties, and the weekly merging process often required manual intervention and telephone conversation by the participants.

With each dataset stored on the Internet server farm, however, merging can be merely a process of setting up pointers (with permission) so that the system knows the NDH database for week 24 to be the contributions of the counties in Newtopia for week 24 viewed as a whole, with possible overlays of correction from either NDH or a county, the latest of which prevails. Instead of shipping copies of files to NDH, each county provides only a weekly pointer to its database. Alternatively, the system can be set up so that NDH has permission to create a pointer to the county database; such details are left to future negotiation.

If you think this is wild and woolly, let’s follow the process to its ultimate conclusion, in which each individual in society has his or her own database on the Internet, and the database contains the person’s health records. For reasons ranging from voluntary participation to legal requirements, the individual grants NDH access to a portion of the health record. The records are combined by the Internet database engine into what looks like a single database, which is actually only an index to the relevant bits of data scattered throughout the personal databases of the citizens of NDH. Obviously, a great deal of social evolution, and whole new legal mechanisms for sharing data will have to be evolved before such a scheme could even be considered. The barriers, however, are more social, educational, legal, and political than they are technical.

Once you have the world’s data stored securely, combining portions of it no longer requires making copies—it is more a matter of manipulating indices (or is it indexes?) that allow the same data to be viewed many ways and to be used for many different purposes.

Sunday, July 09, 2006

Public Health Surveillance

Dr. Snow had done public health surveillance for many years, beginning with the days when disease reports arrived on postcards or by telephone at the Newtopia Health Department, and were further evaluated if they seemed to represent unusual numbers or types of cases. Shoeboxes full of cards were counted by hand at the end of the year for the annual report to the national level.

The reportable disease system functioned in the same way after computers arrived, but counting the reports no longer required hash marks and surveillance secretaries. The list of reportable diseases was still comprised of 50 to 60 mostly rare conditions that are a tiny fraction of the total disease burden, although their role as preventable and potentially epidemic diseases assures their place in the surveillance system.

Dr. Snow takes her job seriously, and often thinks about the 33,000 deaths per year in Newtopia as her responsibility to prevent. She sees patterns in disease by assigning time windows and applying filters, looking first for patterns of disease and then for risk factors that might be related and could be controlled.

Google Earth has evolved to the point where every building in Newtopia is visible, now that airliners carry cameras that feed the digital world image. Of course, data on disease are confidential, and she treats geographic coordinates, names and addresses, with respect. Data from death certificates are less sensitive in most states.

This morning Dr. Snow reviews deaths and hospitalizations for the past day, week, and month by applying filters of different colors and sizes that she has previously configured. The display is Google Earth showing Newtopia, but floating above the earth are "buildings"of various sizes made in Google SketchUp, but configured programatically. The volume of a building reprsents the number of cases or episodes, and its height represents a rate (cases per population specified by the population filter). There is a very tall thin building over a town called Haight's Corners (pop 58) where one death occurred, but a more substantial structure hovers over Newtopia's capital, with a population of 1 million and many deaths per day.

The display is clearest when a single set of conditions such as "Head Injury" and "past 3 days" is selected, but the buildings can also assume colors and textures to display more than one condition. On one side of the computer screen is a control "building" for choosing scale and changing the significance of the dimensions and colors. Data sources on hospital admissions, emergency visits, deaths, and insurance claims are shown as choices. In the adjacent piece of "land", one can choose filters for the denominator population to represent particular age and sex groups, smokers, people with credit cards, telephones, etc. For many of the factors, estimates are only available for large areas, but the program does the best it can to estimate. New databases are discovered frequently and added to those available as "risk factor filters."

The top of each building has a statistical "tower" representing the height at the 95th or 99th percent confidence limit, with the corresponding lower limit being that far below the top. Various other options are available for the statistical displays.

Dr. Snow sends a couple of views to her assistant for further investigation, but today there is nothing dramatic in the surveillance overview.

Saturday, July 08, 2006

Documents as Data

Last year, Dr. Snow had used the new DataDoc facility on the Internet to develop an analytic program that keeps track of references to avian influenza in web pages and news articles. She first entered the search terms that she thought would pinpoint relevant pages, such as “bird flu”, “avian influenza,” H1N5, etc. , pulling up 63 million web pages and 9000 news references. She saved the search and called on DataDoc to help make sense of it. The Doc showed a random selection of the relevant pages, and asked her to click on highlighted ítems that she found particularly relevant and wanted to categorize and count. In a sense, she was assigning variable names to the ítems. Instead of being identified by their position in columns of a dataset, however, the values were characterized by their relationship to other ítems in the web pages. If they were located in tables, of course the relevant terms were at the margins of the table, but other, less structured data, was characterized by DataDoc with the help of cluster terms from similar topics on the Internet.

As Dr. Snow identified the variables to be processed, DataDoc showed her the frequencies and/or sums and means obtained from her 63 million pages. She was able to clean up the sample by selecting for type of source, those with and without numeric data, date of report, etc. As she refined the report, she added graphic displays for species of bird, month of onset, weather conditions, geographic location, etc. The incremental process of working with a sample of documents until DataDoc “understood” how to find the variables was interesting, as she could see Doc get closer to understanding what she wanted from a page. After 20 or 30 instances, the program found relevant values for her variables in a large percentage of the pages. It kept presenting a random selection of the others for her guidance until she decided she had accomplished the main objectives.

Today, she merely asks to see the current data for her report, and scans to see if there is anything new, first in the world, then in areas near Newtopia. She clicks on this week’s published articles on avian influenza and drags the fólder to her in-basket for later analysis, using more specific tools of DataDoc evolved for scientific articles.

She has lived through several decades of scientific work, and remembers how in college she once abstracted data onto 3 by 5 cards in a library. Later in her career, she used Epi Info and questionnaires created on the spot to abstract data and then do frequencies and tables. DataDoc is less well defined than rows and columns, but she finds that the ease of looking at the document database from different perspectives more than makes up for the lack of rigid variable definitions. Of course, documents are not life, or even videos of life, but being able to analyze them statistically and graphically is another step closer to doing epidemiologic analysis in real time.

Thursday, July 06, 2006

An Internet Epidemiologist’s Day at Work

Dr. Susan Snow is an Epidemiologist in Newtopia State Health Department, in an imaginary province somewhere with good Internet connections.

Signing In—Security, Authentication, and All That
The health department, like many public buildings, requires identification for entry, but this is done painlessly with face recognition software, so that the security guard functions as receptionist and problem-solver rather than making decisions on identity from easily forged ID cards. In this health department, the recognition software functions like that being installed in Automatic Teller Machines (ATMs), with 3-dimensional images supplemented by infrared facial temperature measurements
http://www.guardia.com/index.htm

Dr. Snow sits down at her computer desk. A few years ago, she had to sign into her computer with a password, but recently, a USB camera on her computer feeds her facial image to a program that identifies her and logs into her user account. If she leaves the computer, it recognizes her absence and logs off, logging on other authorized persons such as her assistant who may happen to sit down at the keyboard to use their own accounts.
http://www.pcmag.com/article2/0,1895,1945346,00.asp


The result is that she is verified as an authorized member of the Newtopia Health Department and her identity is recorded by each Internet program that she uses.
She is responsible and accountable for her use of the Internet, and those without such validation do not have access to the portion of the Net used for Health Department business. The pleasant part for her is that she does not have to give a password for each program she uses, and she can move without impediment from function to function on the Internet.

When she travels to epidemic investigations or meetings with her laptop, a similar device allows her to use her face rather than a password for identification, and, even if she uses someone else’s computer or one in an Internet café, she only needs to plug in the camera device to have full access to her normal facilities.

The Working Environment
Her work is completely on the Internet, using public programs offered through the browser. A few years ago, she became used to Google searches, gmail, and hotmail, and she began to use the Internet as a storage place for documents that she would email to herself. Now, however, she works in an Internet environment that offers word processing, document sharing and email seamlessly—doing her word processing in the email environment, or email in the document environment. The only difference is what happens to the documents when they are created. All go to her “file cabinet” and others are shared with various email contacts, with appropriate help for locating addresses at the time it is needed. Spreadsheets are merely another kind of document that can be shared with others.

All the work during a given day is indexed and saved on the Internet, much as in Blogger or similar programs. Dr. Snow can choose blocks of material to archive or delete, apply labels, publish to selected audiences, or send to others for review. As she works, relevant searches pop up beside the page, and she can pull references from one area to another. Some of the search results offer contact with expert colleagues in various parts of the world, and she can choose to contact them by clicking for voice or message contact. She gave up her telephone several years ago, and now controls voice contact by clicking, with appropriate clues about time zones, messages, etc. Telephone numbers are no longer needed, since email addresses are easier to use and serve the same purpose. The same camera that provides authentication also serves to add video to voice contact when desired.

She sometimes recalls the days when phone numbers had to be maintained along with Internet addresses, and she had a system in her head for remembering the passwords for 10 or 20 programs, credit card accounts, airline ticket vendors, and others who tried to offere their users security through the use of passwords and conflicting rules for constructing them. In a given day, she had entered a password 10 or 20 times, and at least once or twice had it rejected insultingly (“Have you forgotten your password?” …”NO, YOU IDIOT, I just typed it wrong, or I’m in the wrong program, or I just don’t want to think about your stupid thirty-dollar website service! Can’t you just tell me it’s the wrong password instead of making assumptions about my mental state?” ) . Of course, like everyone else, she had been forced to share passwords with colleagues so that they could access her computer when she was unexpectedly at home with a sick child. She is grateful for the face recognition method that allows access to the Net from anywhere, and incidentally has reduced the amount of erectile-aid and just-for-you spam to the vanishing point. Some major virus purveyors, fraud artists, and company saboteurs have been caught through the use of photos recorded when they accessed the Internet.






Next:

Processing documents…

Monday, July 03, 2006

A Few Thoughts on the Internet

Back somewhere in the mid 90's I tried to run a brainstorming session by proposing that every citizen would have his or her own storage on the Internet, starting with a birth certificate and maybe baby pictures as the first contents. The birth certificate would have official status and could not be altered by the owner, although he would have control over who could see it, other than officially authorized agencies. Later on, school and health records would be managed the same way. Other stuff in the individual data space would be more like what is normally created for work, play, and communication, and completely under the control of its owner. Money could be digitized, managed, stored, and exchanged through each individual's account. With proper hardware, I would not need to carry house or car keys to unlock doors, and, of course, access to my computer and its Googlet would be easy, secure, and geographically unlimited.

Nobody seemed to respond to what I thought was an obvious trend, perhaps because we couldn't do much either to make it happen or to prevent it. But Google is finally making Internet storage happen (in Gmail, Spreadsheet, and Writely), and it will be great to see how much progress can be made as the Net years go by.

Obvious problems stand in the way—like authentication, confidentiality, authorization, etc., confounded by the fact that our current government feels it is outside the reign of law with regard to private communication and storage. The technical problems may require reinventing the Internet. In my uneducated opinión, the current Internet has at least the following problems:
  1. It is designed for anonymity and therefore for crime—a Picadilly square with invisible orators, a nation of automóbiles without license plates, houses without addresses, bank visitors wearing masks, unlisted phone numbers, and—except for Google's intervention—speakers without priority, publications without review, and other chaos. In other fields of endeavor, there is a way to trace and evaluate participants, and this has a dampening influence on crime. Imagine invisible pickpockets, burglers who are not limited by walls or windows, etc.
  2. The Internet generally available is not fast enough or available enough to support heavy use of remote storage, although software that operated a constant trickle of communication rather than batch-mode transmission would help.
  3. Browsers and browser languages are designed to prevent, rather than encourage simultaneous storage of information on a local computer and a remote server. Servers are rarely designed to have their tasks performed locally on the user's computer in case they are unavailable.

So, for those having some influence over the next or some future Internet, I hope it will:

  1. Not be possible to sign on without leaving some definite piece of identification—DNA, for example, combined with a couple of other ítems, like retinal, fingerprint, or iris images. Using several characteristics is likely to work better than just one, and offers the advantage of changing measurements in the tuture, or when problems occur. No more anonymous participation.
  2. Be faster than greased lightning, universally available, (more so in rural areas, to encourage dispersing populations)
  3. Have human language translation built in.
  4. Have computer languages and operating systems designed to erase the difference between local and remote storage, and to handle times when there is no connection.
  5. As an alternative to 4, perhaps it could just have a browser and no other operating system. I once had a computer with Pascal for an operating system; why not one with just a browser?
  6. Have storage designed around the kind of permissions normally provided by school and health records, criminal records, land deeds, banking, and investment, as well as easy access to one's own diverse stuff.

    Corollary: A hardware ID device would fit in a pocket, or be part of cell phone. It would assure identification of it's owner and broadcast the OK signal to nearby receivers. Unlike current devices, it would be issued to one person only, and will not respond to others. It would be like a key that only works when in the grasp of its certified owner. It would only be available from a totally reliable source such as (ha, ha) a state agency that could customize it to identify its owner. I hope to come back to this subject later.

Sunday, July 02, 2006

Future Needs and Possibilities

Linking of Individual Data from Many Sources
Most data of interest to public health agencies today is managed by and stored in institutions. Birth and death data are stored in databases at the county and/or state level, medical records in a variety of hospitals and clinics, and funding records with insurance companies, Medicare, and Medicaid. Injury data may be in emergency medical services records, police reports, and emergency rooms. Home care, prescriptions, exercise, occupational data, immunizations, school records, and other areas related to risk factors each have their own databases. For purposes of clinical care, each institution usually puts the whole mess together by asking the patient or patient’s family, and perhaps sending for a few relevant records. Only the most compulsive and organized patient possesses a coherent copy of all the records.

Linking even such simple records as birth and death certificates is an elaborate matching exercise, make more difficult in the US by the lack of a useful identifying number, or prohibitions on the use of the Social Security Number, which is the closest to a useful gesture in that direction. Efforts to establish a health identification number such as those in European countries have been blocked by the belief that somehow this will impinge on privacy or other less-well-defined “American” values. Fortunately, the ethical discussions may well be circumvented by bioidentification techniques that will soon be used in banks and ATM machines to produce the equivalent of a unique identification number from facial images or other biologic measurements.

Opportunities in epidemiologic computing have so far been limited by the laborious process of abstracting and digitizing data, and on difficulties in linking individual data across time and place. Future public health epidemiology will advance to the extent that more dense and more accurate current data become available in digital format.

At some future time, every citizen may have his or her own private database that would include birth and health records, just as many people now have repositories of family photos or previous emails. If so, epidemiologic work could be done through voluntary or legal access to such records, and retrieval of each person’s records from a variety of institutions should not be necessary. Of course, gaining access to the individual’s database would require radical evolution of present laws and customs, with technical problems being the smallest part of the challenge.

Aggregating and Interpreting Data
Imagine a future epidemiologic computing system operating on the Internet and capable of abstracting data from web pages, xml, or similar materials with the help of metadata describing the "meaning" of each system's data. The data might be from individual health departments or clinical facilities, or eventually from individual health databases. Using the classical triad of Time, Place, and Person (or Time, Place, and Everything Else), the user would access maps to describe a geographic area, and then select a population within that area by age, sex, and/or a personal factor of interest. A timeline would allow following the selected population through time, either with animation or frame by frame. Outcomes could be shown in separate frames for presence or absence, and the frames would be compared statistically for risk ratios or odds ratios as appropriate.

Note that such a system would not be simply three-dimensional, as the user has to specify "Time of what?" (onset, exposure, etc.) and that exposure often has duration as well as type and intensity. Similarly, "Place of what?" is a complex question, since residence, work, recreation, shopping, entertainment, school, and even the source of delivered products might be specified. Choosing from a menu of risk factors for Person would seem more natural, since epidemiologists are used to considering a variety of behavioral, environmental, and host factors in this category.

Although medical diagnoses and laboratory values are fairly well defined and recorded in medical records, much progress could be made in the definition and recording of even the better-known risk factors such as smoking, alcohol, and sexual patterns. Even with perfect linking of digital medical data over time and place, there will be enormous gaps in behavioral data as we move the controls of future analytic software over time, place, and person.

An area of rapid development is genetic sequencing. Will future systems be able to work with the presence of particular genes and ultimately with the entire genome of each subject? Imagine future conferences in which audience questions inquire about whether the data are "genome standardized" as well as "age standardized". Considering the large genetic component in chronic diseases and the rapid progress of bioinformatics, genetic epidemiology could become a major focus in the near future.

Provision of Informatics Tools
Public health computing has so far made use of both free and commercial software, with smaller agencies and remote sites preferring free software such as Epi Info, and academic or larger agencies using SAS or SPSS, SQLServer, or Oracle. Both types of agencies use Google, PubMed, and other free search engines. Systems that are delivered via the Internet at low or no cost will have an advantage in international health as soon as reliable Internet service is available in remote areas. The use of email and the Internet has revolutionized public health around the world. Epidemiologists who once called colleagues or national agencies on the telephone to ask if they had seen cases of “xyz” now do a search of the Internet and have nearly the same information within minutes that 20 years ago was possessed only by experts. They may locate a specialist half-way around the world who can examine pictures of the condition and answer their questions by email.

Currently there is a trend for even commercial software developers to offer Open Source software in the tradition of Linux, and to depend on support and ancillary products for income. Although epidemiologists are not generally computer programmers, Open Source allows those interested to verify the statistical algorithms and help to pinpoint bugs even if they do not contribute to programming. It also provides a safety net for users if program support is discontinued by the original developers.

Contrary to popular belief, Open Source does not mean dozens of people randomly altering the programs. The original development team of Open Source programs maintains control of successive releases, and "forks" or wildcat versions are strongly discouraged by Open Source tradition. There seems to be little reason why government programs, by law the property of the citizens, should not be released as open source.

Summary
In the 40 years of the computer era, and 20+ years of the microcomputer revolution, epidemiologic computing has progressed to the point where essentially all large datasets are in digital format. It is unusual to find even the smallest health department without a computer. The Internet has played an enormous role, not only in the distribution of software and support, but in providing access to the biomedical literature through search engines—a radical step toward empowering epidemiologists outside of major academic centers. Much remains to be done in several areas:
• Accessing data securely through the Internet, with software to reconcile differences between diverse databases
• Fair resolution of proprietary barriers to having full text of articles and books on line at reasonable prices (preferably free).
• Uniform and accurate recording of lifestyle factors over time in digital medical records
• Linking or uniform storage of medical records of individuals over time and place
• Software and statistics capable of dealing with Time, Place, and Person in multidimensional graphic formats that handle missing data and do not mislead the unwary.
• Comparison of medical autopsy results with patterns found in digital clinical records to develop accurate digital diagnostic techniques based on patterns in the clinical record
• Incorporation of bioinformatics and genetic results into routine epidemiology
• Recruitment and training of epidemiologists capable of carrying out these and other informatic tasks, and providing support and appropriate environments for their work
• Encouragement of Open Source software, in the scientific tradition of open access to scientific techniques and results

Topic Outline

Here are a few topics I would like to blog about / hear about:

Current All-Internet Systems for Public Health
Examples, with URLs

Dreams/Possibilities/Goals
Participants should be identified and responsible for their contributions
Documents are data and data are documents--the end of database bondage
Every person and organization has web storage
Sharing is done with URLs, and, of course, permissions; no more sending and merging of data
Tools are free and easy to use
Public Health Surveillance is automatic and visual
Searching is part of clinical and public health routine

Emerging Tools, Platforms, Systems

Security--the Chasm in the way of progress—Solutions
Free speech does not mean anonymous speech

Your face, iris, fingerprints, and DNA, and can start your car, open doors, own your data, and interact on the web--but how? It has to be more fun than passwords and a pound of keys in your pocket.

Coverage--Should the Internet, water, and electricity be fundamental human rights?
How can constant Internet access be assured everywhere?
Until that time, how can systems be designed for intermittent access?

Economics--Who pays for what and how much would it cost?

Apologia and Starting Point

"The Internet and World Health" might seem like a rather large area to take on in a blog. True, the Internet is a large area (6,240 megahits in Google) and World Health is a large area(64.1 megahits), but the intersection between the two is smaller (12 megahits). Adding OR "Public Health" brings us back to 66 megahits, but the lead article is from 2001.

The URL is www.whi-not.blogspot.com. WHI-NOT stands for "World Health and the Internet--Notes for our Time". Lame, but not too hard to remember.

Blogging is easy, and, after 41 years in public health, I don't have much to lose and do not work for a censoring organization, so here goes...