Skip to content

Commit 025a01e

Browse files
tidying up the vignette
1 parent e0092ef commit 025a01e

3 files changed

Lines changed: 431 additions & 367 deletions

File tree

cran-comments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This update adds functionality as well as improves documentation. The package no
88

99
1. Local OS: Windows 11 x64 (build 22635), R version 4.4.1 (2024-06-14 ucrt)
1010
2. **GitHub Actions**:
11-
- [Link](https://github.com/R-Computing-Lab/BGmisc/actions/runs/9555923086)
11+
- [Link](https://github.com/R-Computing-Lab/BGmisc/actions/runs/13376514760)
1212
- macOS (latest version) with the latest R release.
1313
- Windows (latest version) with the latest R release.
1414
- Ubuntu (latest version) with:

vignettes/validation.Rmd

Lines changed: 95 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Validation tools for identifying and repairing errors in pedigrees"
2+
title: "Validating and Repairing Pedigree Data with BGmisc"
33
output: rmarkdown::html_vignette
44
vignette: >
55
%\VignetteIndexEntry{Validation}
@@ -18,76 +18,98 @@ library(tidyverse)
1818

1919
# Introduction
2020

21-
The `BGmisc` R package offers a comprehensive suite of functions tailored for extended behavior genetics analysis, including model identification, calculating relatedness, pedigree conversion, and pedigree simulation. This vignette provides an overview of the validation tools available in the package, designed to identify and repair errors in pedigrees.
2221

23-
In an ideal world, you would have flawless error-free pedigrees. However, in the real world, pedigrees are often incomplete, contain errors, or are missing data. The `BGmisc` package provides tools to identify these errors, which is particularly useful for large pedigrees where manual inspection is not feasible.
24-
While some errors in the package can be automatically repaired, the vast majority require manual inspection. It is often not possible to automatically repair errors in pedigrees, as the correct solution may not be obvious, or may depend on additional information that is not universally available.
22+
Working with pedigree data often involves dealing with inconsistencies, missing information, and errors. The `BGmisc` package provides tools to identify and, where possible, repair these issues automatically. This vignette demonstrates how to validate and clean pedigree data using `BGmisc`'s validation functions.
2523

26-
# Identifying and Repairing Errors in Pedigrees
2724

28-
## ID Validation
25+
# Identifying and Repairing ID Issues
2926

30-
One common issue in pedigree data is the presence of duplicate IDs. There are two main types of ID duplication: within-row duplication and across-row duplication. Within-row duplication occurs when an individual's parents' IDs are incorrectly listed as their own ID. Across-row duplication occurs when two or more individuals share the same ID.
27+
The `checkIDs()` function detects two types of ID duplication:
28+
29+
- Between-row duplication: When two or more individuals share the same ID
30+
- Within-row duplication: When an individual's parents' IDs are incorrectly listed as their own ID
31+
32+
33+
To illustrate `checkIDs()` in action, we will examine a clean example using the Potter family dataset.
3134

32-
The `checkIDs` function in BGmisc helps identify by kinds of duplicates. Here's how to use it:
3335

3436
```{r,checkIDs}
3537
library(BGmisc)
36-
# Create a sample dataset
38+
39+
# Load our example dataset
3740
df <- ped2fam(potter, famID = "newFamID", personID = "personID")
3841
39-
# Call the checkIDs function
42+
# Check for ID issues
4043
result <- checkIDs(df, repair = FALSE)
4144
print(result)
4245
```
4346

44-
In this example, the `checkIDs` function returns a list with several elements. The `all_unique_ids` element indicates whether all IDs in the dataset are unique. The `total_non_unique_ids` element indicates the total number of non-unique IDs. The `total_own_father` and `total_own_mother` elements indicate the total number of individuals whose father's and mother's IDs match their own ID, respectively. The `total_duplicated_parents` element indicates the total number of individuals with duplicated parent IDs. The `total_within_row_duplicates` element indicates the total number of within-row duplicates. The `within_row_duplicates` element indicates whether there are any within-row duplicates in the dataset. As the output shows, there are no duplicates in the sample dataset.
47+
The checkIDs() function checks for:
4548

49+
- Whether all IDs are unique (reported by `all_unique_ids`, which tells you if all IDs in the dataset are unique, and `total_non_unique_ids`, which gives you the count of non-unique IDs found)
50+
- Cases where someone's ID matches their parent's ID (shown in `total_own_father` and `total_own_mother`, which count individuals whose father's or mother's ID matches their own ID)
51+
- Total duplicated parent IDs (tracked by `total_duplicated_parents`, which counts individuals with duplicated parent IDs)
52+
- Within-row duplicates (measured by `total_within_row_duplicates` showing the count and `within_row_duplicates` indicating their presence)
4653

47-
### Between-Person Duplicates
54+
As the output shows, there are no duplicates in our sample dataset.
4855

49-
Let us now consider a scenario where there are between-person duplicates in the dataset. The `checkIDs` function can identify these duplicates and, if the `repair` argument is set to `TRUE`, attempt to repair them. In the example below, we have created two between-person duplicates. First, we have overwritten the `personID` of one person with their sibling's ID. Second, we have added a copy of Dudley Dursley to the dataset.
5056

57+
## A Tale of Two Duplicates
5158

59+
To understand how these tools work in practice, let's create a dataset with two common real-world problems. First, we'll accidentally give Vernon Dursley the same ID as his sister Marjorie (a common issue when merging family records). Then, we'll add a complete duplicate of Dudley Dursley (as might happen during data entry).
5260

53-
```{r, repair}
54-
# Create a sample dataset with duplicates
55-
df <- ped2fam(potter, famID = "newFamID", personID = "personID")
5661

57-
# Sibling overwrite
58-
df$personID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Marjorie Dursley"]
62+
```{r datamade}
63+
64+
# Create our problematic dataset
65+
df_duplicates <- df
66+
# Sibling ID conflict
67+
df_duplicates$personID[df_duplicates$name == "Vernon Dursley"] <-
68+
df_duplicates$personID[df_duplicates$name == "Marjorie Dursley"]
69+
# Duplicate entry
70+
df_duplicates <- rbind(df_duplicates,
71+
df_duplicates[df_duplicates$name == "Dudley Dursley", ])
5972
60-
# Add a copy of Dudley Dursley
61-
df <- rbind(df, df[df$name == "Dudley Dursley", ])
6273
```
6374

64-
Now, let's call the `sumarizeFamilies` function to see what the dataset looks like.
75+
76+
If we look at the data using standard tools, the problems aren't immediately obvious:
6577

6678
```{r}
6779
library(tidyverse)
6880
69-
summarizeFamilies(df, famID = "newFamID", personID = "personID")$family_summary %>% glimpse()
81+
summarizeFamilies(df_duplicates,
82+
famID = "newFamID",
83+
personID = "personID")$family_summary %>%
84+
glimpse()
85+
7086
```
7187

72-
If we didn't know to look for duplicates, we might not notice the issue. Indeed, only of the duplicates was selected as are founder member. However, the `checkIDs` function can help us identify and repair these errors:
88+
This is where `checkIDs` becomes invaluable:
7389

7490
```{r}
75-
# Call the checkIDs
76-
result <- checkIDs(df)
77-
91+
# Identify duplicates
92+
result <- checkIDs(df_duplicates)
7893
print(result)
7994
```
8095

8196
As we can see from this output, there are `r result$total_non_unique_ids` non-unique IDs in the dataset, specifically `r result$non_unique_ids`. Let's take a peek at the duplicates:
8297

8398
```{r}
84-
df %>%
99+
# Let's examine the problematic entries
100+
df_duplicates %>%
85101
filter(personID %in% result$non_unique_ids) %>%
86102
arrange(personID)
87103
```
88104

105+
89106
Yep, these are definitely the duplicates.
90107

108+
### Repairing Between-Row Duplicates
109+
110+
Some ID issues can be fixed automatically. Let's try the repair option:
111+
112+
91113
```{r}
92114
df_repair <- checkIDs(df, repair = TRUE)
93115
@@ -100,43 +122,55 @@ result <- checkIDs(df_repair)
100122
print(result)
101123
```
102124

103-
Great! The function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling overwrite, but that's a more complex issue that would require manual intervention. We'll leave that for now.
125+
Great! Notice what happened here: the function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling ID conflict, but that's a more complex issue that would require manual intervention. We'll leave that for now.
126+
127+
128+
## Oedipus ID
104129

105130

106-
### Handling Within-Row Duplicates
131+
Just as Oedipus discovered his true relationship was not what records suggested, our data can reveal its own confused parentage when an ID is incorrectly listed as its own parent. Let's examine this error:
132+
107133

108134
Sometimes, an individual's parents' IDs may be incorrectly listed as their own ID, leading to within-row duplicates. The checkIDs function can also identify these errors:
109135

110136
```{r within}
111137
# Create a sample dataset with within-person duplicate parent IDs
112138
113-
df <- ped2fam(potter, famID = "newFamID", personID = "personID")
139+
df_within <- ped2fam(potter, famID = "newFamID", personID = "personID")
114140
115-
df$momID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Vernon Dursley"]
141+
df_within$momID[df_within$name == "Vernon Dursley"] <- df_within$personID[df_within$name == "Vernon Dursley"]
116142
117143
# Check for within-row duplicates
118-
result <- checkIDs(df, repair = FALSE)
144+
result <- checkIDs(df_within, repair = FALSE)
119145
print(result)
120146
```
121147

122-
In this example, we have created a within-row duplicate by setting the momID of Vernon Dursley to his own ID. The `checkIDs` function correctly identifies this error.
148+
In this example, we have created a within-row duplicate by setting the momID of Vernon Dursley to his own ID. The `checkIDs` function correctly identifies that this error is present.
123149

124-
## Verifying Sex Coding
150+
To repair within-row duplicates, you will be able to set the repair argument to `TRUE`, eventually. This feature is currently under development and will be available in future versions of the package. In the meantime, you can manually inspect and then correct these errors in your dataset.
125151

126-
Inconsistent coding of biological sex is a common issue in pedigree data. The `checkSex` function in `BGmisc` is designed to identify and address these errors, particularly inconsistencies where an individual's sex is incorrectly recorded. For instance, it can detect cases where a parent listed as biologically male is erroneously recorded as a mother.
152+
```{r}
153+
# Find the problematic entry
127154
128-
In genetic studies, we distinguish between biological sex (genotype) and gender identity (phenotype):
155+
df_within[df_within$momID %in% result$is_own_mother_ids, ]
156+
```
129157

130-
- Biological sex (genotype) refers to an individual's chromosomal configuration, typically XX for female and XY for male in humans, though variations exist.
131-
- Gender identity (phenotype) encompasses a broader, richer, personal, deeply-held sense of being male, female, a blend of both, neither, or another gender entirely.
158+
There are several ways to correct this issue, depending on the specifics of your dataset. In this case, you could correct the momID for Vernon Dursley to the correct value, resolving the within-row duplicate, likely by assuming that his sister Marjorie shares the same mother.
132159

133-
The function can also identify cases where an individual is listed as both a parent and a child in the same pedigree.
160+
# Identifying and Repairing Sex Coding Issues
134161

162+
Another critical aspect of pedigree validation is ensuring the consistency of sex coding. This brings us to an important distinction in genetic studies between biological sex (genotype) and gender identity (phenotype):
135163

164+
- Biological sex (genotype) refers to an individual's chromosomal configuration, typically XX for female and XY for male in humans, though variations exist.
165+
- Gender identity (phenotype) encompasses a broader, richer, personal, deeply-held sense of being male, female, a blend of both, neither, or another gender entirely.
136166

137-
### Using the checkSex Function
167+
The `checkSex` function focuses on biological sex coding consistency, particularly looking for:
168+
- Mismatches between parent roles and recorded sex
169+
- Individuals listed as both parent and child
170+
- Inconsistent sex coding across the dataset
171+
172+
Let's examine how it works:
138173

139-
The `checkSex` function identifies inconsistencies in sex coding within a pedigree and can optionally repair them based on predefined logic. Below is an example of how to use this function to validate and, if necessary, repair sex coding in a pedigree:
140174

141175
```{r}
142176
# Validate sex coding
@@ -149,9 +183,8 @@ results <- checkSex(potter,
149183
print(results)
150184
```
151185

152-
In this example, the `checkSex` function checks the unique values in the sex column and identifies any inconsistencies in the sex coding of parents. The function returns a list containing validation results, such as the unique values found in the sex column and any inconsistencies in the sex coding of parents.
153186

154-
If the function identifies inconsistent sex codes, you can attempt to repair them automatically:
187+
When inconsistencies are found, you can attempt automatic repair:
155188

156189
```{r}
157190
# Repair sex coding
@@ -163,7 +196,14 @@ df_fix <- checkSex(potter,
163196
print(df_fix)
164197
```
165198

166-
When the repair argument is set to `TRUE`, the function attempts to repair the sex coding based on repair sex coding based on the most frequent sex values found among parents. This approach helps ensure that the sex coding in your dataset is consistent and won't cause recursion.
199+
200+
When the repair argument is set to `TRUE`, repair process follows several rules:
201+
- Parents listed as mothers must be female
202+
- Parents listed as fathers must be male
203+
- Sex codes are standardized to the specified code_male and code_female values
204+
- If no sex code is provided, the function will attempt to infer what male and female are coded with. The most frequently assigned sex for mothers and fathers will be used as the standard.
205+
206+
Note that automatic repairs should be carefully reviewed, as they may not always reflect the correct biological relationships. In cases where the sex coding is ambiguous or conflicts with known relationships, manual inspection and domain knowledge may be required.
167207

168208
<!--
169209
## Practical Example: Cleaning a Pedigree
@@ -230,6 +270,15 @@ print(final_check)
230270
```
231271
-->
232272

233-
# Conclusion
273+
# Best Practices for Pedigree Validation
274+
275+
Through extensive work with pedigree data, we've learned several key principles:
276+
277+
- Always inspect your data before applying automatic repairs
278+
- Use summarizeFamilies() to get an overview of family structures
279+
- Keep detailed records of changes made during cleaning
280+
- Validate after each repair step
281+
- Create backups before applying repairs
282+
- Trust your domain knowledge - automatic repairs are helpful but not infallible
234283

235-
This vignette demonstrates how to use the BGmisc package to identify and repair errors in pedigrees. By leveraging functions like `checkIDs`, `checkSex`, and `recodeSex`, you can ensure the integrity of your pedigree data, facilitating accurate analysis and research.
284+
By following these best practices, and leveraging functions like `checkIDs`, `checkSex`, and `recodeSex`, you can ensure the integrity of your pedigree data, facilitating accurate analysis and research.

0 commit comments

Comments
 (0)