Tuesday, April 20, 2010

HSS5301 Clinical Data De-identification

Clinical Data De-identification

Risk of patient privacy disclosure:

- The By-product of the technological innovations for facilitating the sharing of patient data

n WWW

n EMRS (electronic medical record systems)

n Increase connectivity between disparate medical institutions, although improving medicine

HIPPA (Health Insurance Portability and Accountability Act)

- An act passed by USA’s Congress in 1996

- Standards to protect individuals’ health information

HIPPA: 2 hurdles to de-identify data

- de-identified data must be cleared of one of the two hurdles

- Hurdle 1

n An expert must determine and document ‘that the risk is very small that the Ix could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the Ix.’

- Hurdle 2

n The data must be purged of a specified list of 19 categories of possible identifiers relating to the patient or relatives, household members and employers, and any other information that may make it possible to identify the individual.

Protected health information (PHI)

According to HIPAA regulation, PHI is individually identifiable health information

- Transmitted or maintained in electronic media; transmitted or maintained or in any other form or medium.

- Non PHI data: hwen P are de-idetified by removing all HIPPA-specified P-identifiers, the de-identified data set is no longer considered PHI.

- There are 19 patient identifiers (2008 version)

n Name

n All geographic subdivisions smaller than state

n For data directly related to the individual

n All ages over 89 or dates indicating such age

n Phone #

n Fax #

n Email address

n Social security #

n Medical record #

n Health plan/insurance #

n Account #

n Certificate/ license #

n Vehicle #

n Device ID

n URL

n IP

n Biomedical identifier

n Full face photo and comparable image

n Any other unique identifying number, characteristics or code;

Automatic de-identification

- HIPAA requires that de-identified data must be verified as de-identified using either statistical methods or by manual review before they are released.

- Allows the release of de-identified data:

n Without obtaining an authorization and

n Without further restriction on use or disclosure.

- need or automated de-identification

n EMRs become more common and widespread

n P data become increasingly accessible to researchers.

Automated systems:

- current system: pathology reports, medical DB

- commercial systems: De-id, de-identify

- methods

n natural languages processing (NLP)

u statistical learning: HMM, SVM, CRF, decision tree

n Complex rule sets

n Specialized dictionary or name lists

The algorithm:

Goal: to find and remove PHI from medical records while protecting the integrity of the data as much as possible.

- problems:

n Ambiguities: PHI and non-PHI can lexically overlap

u Huntington can be the name of disease (non-PHI) or the name of a person (PHI)

n Out of vocabulary PHI

u PHI can inckude misspelled and/or forign words that cannot be found in dictionaries;

u Ungrammatical text with pssible misspellings, arbitary abbreviations

Techniques and methods:

- use of clinical data in standardized format:

n HL7 message already provides well-labelled patients identifiers in report headers and lead segments

- use of regular expressions to remove numeric identifier

n remove string

n detect the 3 digit, 2digit, 4-digit pattern of a social security number

n detectstate names or abbreviations

n remove all place names, address patterns, references to location:

- refer to standard lists to remove proper names

n list of proper names

n list of clinical and common usage workd

n all proper names from open-source spell-checking dictionary

n health care provider first and last names in medical record system

n All names from the death registry

n Unified Medical Language System (UMLS)

n Medical Subject Headings (MESH) vocabulary

- Search for predictive markers that likely represent proper names.

n Proper name pre-fixes such as Mr. Mrs. And Dr

n Proper name suffixed such as MD, Jr., and PhD

n For example, when a common word such as white is found, check the words surrounding it. It ‘Mr.’ precedes it, or ‘MD’ follows it, the words White‘ is scrubbed

n Check lowercase/ uppercase ratio

u Typical names either consist of all uppercase letters, e.g. Dr Right, or are capitalized, e.g. Dr Kissinger

- l

n check number of times a term occurs in a document, i.e. term frequency

u low term frequency trends to indicate some categories of PHI

Evaluation:

- Gold standard: Human review

n Two phases:

u Hand-review a portion of the messages before they were scrubbed

u Hand review the remaining messages after they were scrubbed

n To access whether the results of the evaluation process differed depending on when the review occurred (before scrubbing vs after scrubbing).

Evaluation: totals

- The number of identifiers that the system identified for removal (success)

- The number of instances that system failed to remove identifiers (under scrubbing) (Fail)

- The number of instances when a word (or words) were removed that were not identifiers (over-scrubbing). (Fail)

Evaluation: Confusion Matrix

- Standard evaluation in NLP

- Recall

- F-measure is the weighted mean of precision and recall

Dilemma

- The ultimate goal of de-identification software is to scrub true patient identifiers while minimizing over-scrubbing

- A medical report completely scrubbed of not only all patient identifiers but all important medical data as well is of no use to researchers.

Tricky cases:

- some documents typically contain detailed patient historical information (e.g. admission notes)

n If a history is very unique, the identify of the patient could be compromised, especially when coupled with other

- despite the absence of any HIPPA identifiers, the identity of the P below are probably readily apparent

n a ‘former president of the US with Alcheimer’s disease’

n ‘An HIV-positive, 6’9 inch black male, former professional basketball player.’

No comments:

Post a Comment