Linking of Individual Data from Many SourcesMost data of interest to public health agencies today is managed by and stored in institutions. Birth and death data are stored in databases at the county and/or state level, medical records in a variety of hospitals and clinics, and funding records with insurance companies, Medicare, and Medicaid. Injury data may be in emergency medical services records, police reports, and emergency rooms. Home care, prescriptions, exercise, occupational data, immunizations, school records, and other areas related to risk factors each have their own databases. For purposes of clinical care, each institution usually puts the whole mess together by asking the patient or patient’s family, and perhaps sending for a few relevant records. Only the most compulsive and organized patient possesses a coherent copy of all the records.
Linking even such simple records as birth and death certificates is an elaborate matching exercise, make more difficult in the US by the lack of a useful identifying number, or prohibitions on the use of the Social Security Number, which is the closest to a useful gesture in that direction. Efforts to establish a health identification number such as those in European countries have been blocked by the belief that somehow this will impinge on privacy or other less-well-defined “American” values. Fortunately, the ethical discussions may well be circumvented by bioidentification techniques that will soon be used in banks and ATM machines to produce the equivalent of a unique identification number from facial images or other biologic measurements.
Opportunities in epidemiologic computing have so far been limited by the laborious process of abstracting and digitizing data, and on difficulties in linking individual data across time and place. Future public health epidemiology will advance to the extent that more dense and more accurate current data become available in digital format.
At some future time, every citizen may have his or her own private database that would include birth and health records, just as many people now have repositories of family photos or previous emails. If so, epidemiologic work could be done through voluntary or legal access to such records, and retrieval of each person’s records from a variety of institutions should not be necessary. Of course, gaining access to the individual’s database would require radical evolution of present laws and customs, with technical problems being the smallest part of the challenge.
Aggregating and Interpreting DataImagine a future epidemiologic computing system operating on the Internet and capable of abstracting data from web pages, xml, or similar materials with the help of metadata describing the "meaning" of each system's data. The data might be from individual health departments or clinical facilities, or eventually from individual health databases. Using the classical triad of Time, Place, and Person (or Time, Place, and Everything Else), the user would access maps to describe a geographic area, and then select a population within that area by age, sex, and/or a personal factor of interest. A timeline would allow following the selected population through time, either with animation or frame by frame. Outcomes could be shown in separate frames for presence or absence, and the frames would be compared statistically for risk ratios or odds ratios as appropriate.
Note that such a system would not be simply three-dimensional, as the user has to specify "Time of what?" (onset, exposure, etc.) and that exposure often has duration as well as type and intensity. Similarly, "Place of what?" is a complex question, since residence, work, recreation, shopping, entertainment, school, and even the source of delivered products might be specified. Choosing from a menu of risk factors for Person would seem more natural, since epidemiologists are used to considering a variety of behavioral, environmental, and host factors in this category.
Although medical diagnoses and laboratory values are fairly well defined and recorded in medical records, much progress could be made in the definition and recording of even the better-known risk factors such as smoking, alcohol, and sexual patterns. Even with perfect linking of digital medical data over time and place, there will be enormous gaps in behavioral data as we move the controls of future analytic software over time, place, and person.
An area of rapid development is genetic sequencing. Will future systems be able to work with the presence of particular genes and ultimately with the entire genome of each subject? Imagine future conferences in which audience questions inquire about whether the data are "genome standardized" as well as "age standardized". Considering the large genetic component in chronic diseases and the rapid progress of bioinformatics, genetic epidemiology could become a major focus in the near future.
Provision of Informatics ToolsPublic health computing has so far made use of both free and commercial software, with smaller agencies and remote sites preferring free software such as Epi Info, and academic or larger agencies using SAS or SPSS, SQLServer, or Oracle. Both types of agencies use Google, PubMed, and other free search engines. Systems that are delivered via the Internet at low or no cost will have an advantage in international health as soon as reliable Internet service is available in remote areas. The use of email and the Internet has revolutionized public health around the world. Epidemiologists who once called colleagues or national agencies on the telephone to ask if they had seen cases of “xyz” now do a search of the Internet and have nearly the same information within minutes that 20 years ago was possessed only by experts. They may locate a specialist half-way around the world who can examine pictures of the condition and answer their questions by email.
Currently there is a trend for even commercial software developers to offer Open Source software in the tradition of Linux, and to depend on support and ancillary products for income. Although epidemiologists are not generally computer programmers, Open Source allows those interested to verify the statistical algorithms and help to pinpoint bugs even if they do not contribute to programming. It also provides a safety net for users if program support is discontinued by the original developers.
Contrary to popular belief, Open Source does not mean dozens of people randomly altering the programs. The original development team of Open Source programs maintains control of successive releases, and "forks" or wildcat versions are strongly discouraged by Open Source tradition. There seems to be little reason why government programs, by law the property of the citizens, should not be released as open source.
SummaryIn the 40 years of the computer era, and 20+ years of the microcomputer revolution, epidemiologic computing has progressed to the point where essentially all large datasets are in digital format. It is unusual to find even the smallest health department without a computer. The Internet has played an enormous role, not only in the distribution of software and support, but in providing access to the biomedical literature through search engines—a radical step toward empowering epidemiologists outside of major academic centers. Much remains to be done in several areas:
• Accessing data securely through the Internet, with software to reconcile differences between diverse databases
• Fair resolution of proprietary barriers to having full text of articles and books on line at reasonable prices (preferably free).
• Uniform and accurate recording of lifestyle factors over time in digital medical records
• Linking or uniform storage of medical records of individuals over time and place
• Software and statistics capable of dealing with Time, Place, and Person in multidimensional graphic formats that handle missing data and do not mislead the unwary.
• Comparison of medical autopsy results with patterns found in digital clinical records to develop accurate digital diagnostic techniques based on patterns in the clinical record
• Incorporation of bioinformatics and genetic results into routine epidemiology
• Recruitment and training of epidemiologists capable of carrying out these and other informatic tasks, and providing support and appropriate environments for their work
• Encouragement of Open Source software, in the scientific tradition of open access to scientific techniques and results