TransWikia.com

Inheritance simulator not working properly

Biology Asked by user1235598399 on July 15, 2021

I created a python program to simulate inheritance (a simplified version of it) to better understand how it works. A person is represented by a class that contains a chromosome property, a method for reproducing with another person, a method for generating gametes (through recombination), and a method for calculating how similar a person is to another person (by comparing the chromosomes).

The problem with my program is that cousins are not as similar to each other as they are supposed to be. My program says that 2 cousins about 75% similar when they should be 12.5% similar. I don’t understand why this is. I think it could be down to 2 things

  1. My similarity function is wrong.
  2. The way I am representing a chromosome is wrong.

Here is the code:

    import random
    
    CHROMOSONE_LENGTH = 1000
    NCHROMOSONES = 22
    
    class Person:
        def __init__(self, chromosones = None):
            self.chromosones = chromosones
            if  chromosones is None:
                self.chromosones = []
                for i in range(NCHROMOSONES):
                    self.chromosones.append ((self.get_random_dna(),self.get_random_dna()))           
    
        def get_random_dna(self):
            #return [random.random() for x in range(CHROMOSONE_LENGTH)]
            return [random.choice(["0","1"]) for x in range(CHROMOSONE_LENGTH)]
            
        def get_gamete_meiosis(self):
            gamete = []
            for i in range(len(self.chromosones)):
                gamete.append([])
                for a,b in zip(self.chromosones[i][0],self.chromosones[i][1]):
                    gamete[i].append(random.choice([a,b]))
            return gamete
    
        def reproduce(self,mate):
            chromosones = []
            my_gamete = self.get_gamete_meiosis()
            mate_gamete = mate.get_gamete_meiosis()
            for i in range(NCHROMOSONES):
                chromosones.append((my_gamete[i],mate_gamete[i]))
            return Person(chromosones)
    
        def get_similarity(self,other):
            sim = 0
            for s,o in zip(self.chromosones,other.chromosones):
                for a,b,c,d in zip(s[0],s[1],o[0],o[1]):
                    if a == c or a == d:
                        sim += 0.5
                    if b == c or b == d:
                       sim += 0.5
           
            return sim/(CHROMOSONE_LENGTH*NCHROMOSONES)
    
    grandmother = Person()
    grandfather = Person()
    child1 = grandfather.reproduce(grandmother)
    child2 = grandfather.reproduce(grandmother)
    grandchild1 = child1.reproduce(Person())
    grandchild2 = child2.reproduce(Person())
    # Should be 12.5%
    print(grandchild1.get_similarity(grandchild2))

A chromosome is represented by an array of 2-tuples. Each tuple represents a chromosome pair. The number of pairs is set by the NCHROMOSONES variable. There are 22 pairs in the above code.
The actual genetic data is represented by an array of 0s and 1s. The number of 0s and 1s in a chromosome is set by the CHROMOSONE_LENGTH variable.

The similarity function takes two people and compares them at each position of each chromosome. If the two people have the same number at a given position, then it adds 0.5 to the sim variable. It then averages this across all positions. The reason I add 0.5 is because two people can either be 0%, 50% or 100% identical at a given position on a chromosome. 0% occurs when neither pair is identical, 50% occurs when 1 pair identical, and 100% occurs when both pairs are identical.

I noticed that as I use more symbols than just 0 and 1 to represent genetic data, the similarity actually approaches 12.5%. But I don’t understand why this should be.

So my basic questions are:

  1. Is my way of calculating the similarity correct.
  2. Why can’t I use just 0 and 1 to represent genetic data?

One Answer

There are a couple of things going on here, I think. Let's start with simple conditional probability:

import random

sim1 = [random.choice([0, 1]) for x in range(1000)]
sim2 = [random.choice([0, 1]) for x in range(1000)]
for locus in range(1000):
   if sim1[locus] ==sim2[locus]:
     match+=1
print(match)
# prints 492

In other words, even two completely independent samples from [0,1] with uniform probability are expected to match about half of the time (492 / 1000 ~ 50%).

So you are correct, the problem lies in your fashion of encoding the data. When you are talking about "similarity", what I think you mean is "identical by descent". This is distinct from "identical by state", meaning what actual alleles you have.

Two individuals can be identical by state at a large number of loci without any significantly close family relationship, if they are members of a population without a lot of common alleles. This is close to what we mean when we say that humans are all 99.9% identical to each other- this is because most of the positions in the genome have very little variation. So, if you choose two people who have no common ancestors in the last 1000 generations, they will still be identical by state for >99% of their loci. This is the real lesson that your simulation is showing you, I think!!

If you want to show identity by descent, you will have to choose a different encoding. Using e.g. nucleotides will still run into the problem that very frequently the nucleotides will be identical by state in unrelated individuals, whereas what you actually care about is "did this allele originate from the exact same ancestor". So you want some way of guaranteeing (or nearly so) that every ancestor contributes a distinctive, unique allele that you can always distinguish. For example, the following code will produce a "unique" identifier that could be used for each founder individual (e.g. every time you call Person():

import uuid 
unique_id = uuid.uuid4()

So really, what your Person.get_random_dna() method should do is something like this (minimal modification to your code):

        def get_random_dna(self):
            unique_id = uuid.uuid4()
            return [unique_id for x in range(CHROMOSONE_LENGTH)]

You may need to fix something else as well, I have not actually run this.

Correct answer by Maximilian Press on July 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP