You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/validation.Rmd
+95-46Lines changed: 95 additions & 46 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: "Validation tools for identifying and repairing errors in pedigrees"
2
+
title: "Validating and Repairing Pedigree Data with BGmisc"
3
3
output: rmarkdown::html_vignette
4
4
vignette: >
5
5
%\VignetteIndexEntry{Validation}
@@ -18,76 +18,98 @@ library(tidyverse)
18
18
19
19
# Introduction
20
20
21
-
The `BGmisc` R package offers a comprehensive suite of functions tailored for extended behavior genetics analysis, including model identification, calculating relatedness, pedigree conversion, and pedigree simulation. This vignette provides an overview of the validation tools available in the package, designed to identify and repair errors in pedigrees.
22
21
23
-
In an ideal world, you would have flawless error-free pedigrees. However, in the real world, pedigrees are often incomplete, contain errors, or are missing data. The `BGmisc` package provides tools to identify these errors, which is particularly useful for large pedigrees where manual inspection is not feasible.
24
-
While some errors in the package can be automatically repaired, the vast majority require manual inspection. It is often not possible to automatically repair errors in pedigrees, as the correct solution may not be obvious, or may depend on additional information that is not universally available.
22
+
Working with pedigree data often involves dealing with inconsistencies, missing information, and errors. The `BGmisc` package provides tools to identify and, where possible, repair these issues automatically. This vignette demonstrates how to validate and clean pedigree data using `BGmisc`'s validation functions.
25
23
26
-
# Identifying and Repairing Errors in Pedigrees
27
24
28
-
## ID Validation
25
+
#Identifying and Repairing ID Issues
29
26
30
-
One common issue in pedigree data is the presence of duplicate IDs. There are two main types of ID duplication: within-row duplication and across-row duplication. Within-row duplication occurs when an individual's parents' IDs are incorrectly listed as their own ID. Across-row duplication occurs when two or more individuals share the same ID.
27
+
The `checkIDs()` function detects two types of ID duplication:
28
+
29
+
- Between-row duplication: When two or more individuals share the same ID
30
+
- Within-row duplication: When an individual's parents' IDs are incorrectly listed as their own ID
31
+
32
+
33
+
To illustrate `checkIDs()` in action, we will examine a clean example using the Potter family dataset.
31
34
32
-
The `checkIDs` function in BGmisc helps identify by kinds of duplicates. Here's how to use it:
In this example, the `checkIDs` function returns a list with several elements. The `all_unique_ids` element indicates whether all IDs in the dataset are unique. The `total_non_unique_ids` element indicates the total number of non-unique IDs. The `total_own_father` and `total_own_mother` elements indicate the total number of individuals whose father's and mother's IDs match their own ID, respectively. The `total_duplicated_parents` element indicates the total number of individuals with duplicated parent IDs. The `total_within_row_duplicates` element indicates the total number of within-row duplicates. The `within_row_duplicates` element indicates whether there are any within-row duplicates in the dataset. As the output shows, there are no duplicates in the sample dataset.
47
+
The checkIDs() function checks for:
45
48
49
+
- Whether all IDs are unique (reported by `all_unique_ids`, which tells you if all IDs in the dataset are unique, and `total_non_unique_ids`, which gives you the count of non-unique IDs found)
50
+
- Cases where someone's ID matches their parent's ID (shown in `total_own_father` and `total_own_mother`, which count individuals whose father's or mother's ID matches their own ID)
51
+
- Total duplicated parent IDs (tracked by `total_duplicated_parents`, which counts individuals with duplicated parent IDs)
52
+
- Within-row duplicates (measured by `total_within_row_duplicates` showing the count and `within_row_duplicates` indicating their presence)
46
53
47
-
### Between-Person Duplicates
54
+
As the output shows, there are no duplicates in our sample dataset.
48
55
49
-
Let us now consider a scenario where there are between-person duplicates in the dataset. The `checkIDs` function can identify these duplicates and, if the `repair` argument is set to `TRUE`, attempt to repair them. In the example below, we have created two between-person duplicates. First, we have overwritten the `personID` of one person with their sibling's ID. Second, we have added a copy of Dudley Dursley to the dataset.
50
56
57
+
## A Tale of Two Duplicates
51
58
59
+
To understand how these tools work in practice, let's create a dataset with two common real-world problems. First, we'll accidentally give Vernon Dursley the same ID as his sister Marjorie (a common issue when merging family records). Then, we'll add a complete duplicate of Dudley Dursley (as might happen during data entry).
If we didn't know to look for duplicates, we might not notice the issue. Indeed, only of the duplicates was selected as are founder member. However, the `checkIDs`function can help us identify and repair these errors:
88
+
This is where `checkIDs`becomes invaluable:
73
89
74
90
```{r}
75
-
# Call the checkIDs
76
-
result <- checkIDs(df)
77
-
91
+
# Identify duplicates
92
+
result <- checkIDs(df_duplicates)
78
93
print(result)
79
94
```
80
95
81
96
As we can see from this output, there are `r result$total_non_unique_ids` non-unique IDs in the dataset, specifically `r result$non_unique_ids`. Let's take a peek at the duplicates:
82
97
83
98
```{r}
84
-
df %>%
99
+
# Let's examine the problematic entries
100
+
df_duplicates %>%
85
101
filter(personID %in% result$non_unique_ids) %>%
86
102
arrange(personID)
87
103
```
88
104
105
+
89
106
Yep, these are definitely the duplicates.
90
107
108
+
### Repairing Between-Row Duplicates
109
+
110
+
Some ID issues can be fixed automatically. Let's try the repair option:
111
+
112
+
91
113
```{r}
92
114
df_repair <- checkIDs(df, repair = TRUE)
93
115
@@ -100,43 +122,55 @@ result <- checkIDs(df_repair)
100
122
print(result)
101
123
```
102
124
103
-
Great! The function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling overwrite, but that's a more complex issue that would require manual intervention. We'll leave that for now.
125
+
Great! Notice what happened here: the function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling ID conflict, but that's a more complex issue that would require manual intervention. We'll leave that for now.
126
+
127
+
128
+
## Oedipus ID
104
129
105
130
106
-
### Handling Within-Row Duplicates
131
+
Just as Oedipus discovered his true relationship was not what records suggested, our data can reveal its own confused parentage when an ID is incorrectly listed as its own parent. Let's examine this error:
132
+
107
133
108
134
Sometimes, an individual's parents' IDs may be incorrectly listed as their own ID, leading to within-row duplicates. The checkIDs function can also identify these errors:
109
135
110
136
```{r within}
111
137
# Create a sample dataset with within-person duplicate parent IDs
In this example, we have created a within-row duplicate by setting the momID of Vernon Dursley to his own ID. The `checkIDs` function correctly identifies this error.
148
+
In this example, we have created a within-row duplicate by setting the momID of Vernon Dursley to his own ID. The `checkIDs` function correctly identifies that this error is present.
123
149
124
-
## Verifying Sex Coding
150
+
To repair within-row duplicates, you will be able to set the repair argument to `TRUE`, eventually. This feature is currently under development and will be available in future versions of the package. In the meantime, you can manually inspect and then correct these errors in your dataset.
125
151
126
-
Inconsistent coding of biological sex is a common issue in pedigree data. The `checkSex` function in `BGmisc` is designed to identify and address these errors, particularly inconsistencies where an individual's sex is incorrectly recorded. For instance, it can detect cases where a parent listed as biologically male is erroneously recorded as a mother.
152
+
```{r}
153
+
# Find the problematic entry
127
154
128
-
In genetic studies, we distinguish between biological sex (genotype) and gender identity (phenotype):
- Biological sex (genotype) refers to an individual's chromosomal configuration, typically XX for female and XY for male in humans, though variations exist.
131
-
- Gender identity (phenotype) encompasses a broader, richer, personal, deeply-held sense of being male, female, a blend of both, neither, or another gender entirely.
158
+
There are several ways to correct this issue, depending on the specifics of your dataset. In this case, you could correct the momID for Vernon Dursley to the correct value, resolving the within-row duplicate, likely by assuming that his sister Marjorie shares the same mother.
132
159
133
-
The function can also identify cases where an individual is listed as both a parent and a child in the same pedigree.
160
+
# Identifying and Repairing Sex Coding Issues
134
161
162
+
Another critical aspect of pedigree validation is ensuring the consistency of sex coding. This brings us to an important distinction in genetic studies between biological sex (genotype) and gender identity (phenotype):
135
163
164
+
- Biological sex (genotype) refers to an individual's chromosomal configuration, typically XX for female and XY for male in humans, though variations exist.
165
+
- Gender identity (phenotype) encompasses a broader, richer, personal, deeply-held sense of being male, female, a blend of both, neither, or another gender entirely.
136
166
137
-
### Using the checkSex Function
167
+
The `checkSex` function focuses on biological sex coding consistency, particularly looking for:
168
+
- Mismatches between parent roles and recorded sex
169
+
- Individuals listed as both parent and child
170
+
- Inconsistent sex coding across the dataset
171
+
172
+
Let's examine how it works:
138
173
139
-
The `checkSex` function identifies inconsistencies in sex coding within a pedigree and can optionally repair them based on predefined logic. Below is an example of how to use this function to validate and, if necessary, repair sex coding in a pedigree:
140
174
141
175
```{r}
142
176
# Validate sex coding
@@ -149,9 +183,8 @@ results <- checkSex(potter,
149
183
print(results)
150
184
```
151
185
152
-
In this example, the `checkSex` function checks the unique values in the sex column and identifies any inconsistencies in the sex coding of parents. The function returns a list containing validation results, such as the unique values found in the sex column and any inconsistencies in the sex coding of parents.
153
186
154
-
If the function identifies inconsistent sex codes, you can attempt to repair them automatically:
187
+
When inconsistencies are found, you can attempt automatic repair:
155
188
156
189
```{r}
157
190
# Repair sex coding
@@ -163,7 +196,14 @@ df_fix <- checkSex(potter,
163
196
print(df_fix)
164
197
```
165
198
166
-
When the repair argument is set to `TRUE`, the function attempts to repair the sex coding based on repair sex coding based on the most frequent sex values found among parents. This approach helps ensure that the sex coding in your dataset is consistent and won't cause recursion.
199
+
200
+
When the repair argument is set to `TRUE`, repair process follows several rules:
201
+
- Parents listed as mothers must be female
202
+
- Parents listed as fathers must be male
203
+
- Sex codes are standardized to the specified code_male and code_female values
204
+
- If no sex code is provided, the function will attempt to infer what male and female are coded with. The most frequently assigned sex for mothers and fathers will be used as the standard.
205
+
206
+
Note that automatic repairs should be carefully reviewed, as they may not always reflect the correct biological relationships. In cases where the sex coding is ambiguous or conflicts with known relationships, manual inspection and domain knowledge may be required.
167
207
168
208
<!--
169
209
## Practical Example: Cleaning a Pedigree
@@ -230,6 +270,15 @@ print(final_check)
230
270
```
231
271
-->
232
272
233
-
# Conclusion
273
+
# Best Practices for Pedigree Validation
274
+
275
+
Through extensive work with pedigree data, we've learned several key principles:
276
+
277
+
- Always inspect your data before applying automatic repairs
278
+
- Use summarizeFamilies() to get an overview of family structures
279
+
- Keep detailed records of changes made during cleaning
280
+
- Validate after each repair step
281
+
- Create backups before applying repairs
282
+
- Trust your domain knowledge - automatic repairs are helpful but not infallible
234
283
235
-
This vignette demonstrates how to use the BGmisc package to identify and repair errors in pedigrees. By leveraging functions like `checkIDs`, `checkSex`, and `recodeSex`, you can ensure the integrity of your pedigree data, facilitating accurate analysis and research.
284
+
By following these best practices, and leveraging functions like `checkIDs`, `checkSex`, and `recodeSex`, you can ensure the integrity of your pedigree data, facilitating accurate analysis and research.
0 commit comments