12Heritability is an important statistic for evaluating genetic contribution to phenotypes. 13 Estimating heritability, however, requires a laborious recruitment of a large number of 14 relatives. Electronic health records (EHR) contain massive relative information in 15 emergency contact forms. Recently, we presented RIFTEHR, an algorithm for 16 extracting relationships from EHR. Here, we present an updated version and 17 reconstructed 4.2 million familial relationships from the latest New York-18 Presbyterian/Columbia University Irving Medical Center (CUIMC) EHR system. The 19number of updated relationships is 30 percent more than the last version. We present a 20 new implementation of RIFTEHR, which runs in linear time, thus largely improves the 21 speed of the algorithm. We also present a data encryption method, to protect patient 22 2018). This study involved in total 1.9 million subjects, which was much larger than 41 any former single family-based study. This holds the promise that EHR can be utilized 42 for large scale relative recruitment for genetic studies. 43 44 Relationship inference from the electronic health record (RIFTEHR) algorithm is used 45 for extracting familial relationships from EHR (Polubriaginof et al., 2018). It maps 46 emergency contact person to patients in the hospital database, and infers relationship 47 73 74Mapping emergency contact to patients 75 The original Python implementation was described before (Polubriaginof et al., 2018).
76The new implementation preprocesses data in Python, such as lower case converting 77 for names, and finds exact matches for first name, last name, phone number, zip code, 78 or their combinations using MySQL. The output of the new implementation is the same 79 with the original one, including patient identifier, relationship, emergency contact 80 identifier and the unique match combinations. 81 82 Execution time of mapping 83 0.5, 1, 2 or 4 percent of 1.6 million emergency contact and 5.8 million patient 84 demographics data from CUIMC EHR system were randomly selected as input datasets 85 of the algorithm. Each dataset was fed to both implementations. The processor 86 execution time of Python scripts was calculated using the time package. And the 87 processor execution time of each SQL query was returned automatically in MySQL. 88 Time for all scripts and queries were added up as the total execution time. 89 90 Data encryption for patient privacy 91The algorithm takes in the identifiable data, and output the according encrypted patient 92 identifier, first name, last name, phone number and zip code data. A patient identifier 93 and its encryption mapping table is also generated by the algorithm, for analysis which 94 needs linkage to phenotypes in EHR. This encryption should be done before using 95 RIFTEHR.
97
Results:98 Updated familial relationships 99 In the following context, we refer to results from (Polubriaginof et al., 2018) as the 100 original or version 1, and refer to results from this study as the updated or version 2.
102We retr...