How can I appropriately handle cleaning of gender data?

Question

I’m a data science student and I’ve begun working with an open mental health dataset. As part of this, I need to clean the data so that I can perform an analysis of it.
In this dataset, the gender field is a string that could have had anything entered into it. While cleaning most entries are fairly straightforward (“f”, “F”, “female”, “cis female” and “woman” can all be coded to “F”), what I was wondering about was how to properly handle trans or queer identities (e.g. an entry that says something like “trans female” or “queer/she/they”).
Should I create a new code for trans entries for each gender, or should I just code them as though they were members of the gender they identify as?
Should I just drop them from the dataset entirely, because they might distort it? I remember reading that trans individuals suffer from much higher rates of mental illness than cis individuals.
Are there any best practices that I should follow in this regard?

Carlos Mougan · Answer

It is quite an interesting question. I guess that you can call it "dealing with non-binary gender roles in a binary language" or something like this.

In the past I did once something similar. I created 3 features:

sex at birth [male,female]
sex identification [male,female]
Attracted sexually to [male,female].

All these features are binary and you can encode it as 0,1. You can achieve most of the sexual states with a combination of both, for example, sex at birth=male, sex identification = female will gave you a trans person. sex_birth = male, attracted = male will give you a gay male.

A decision tree should be able to distinguish information and classify it correctly with this kind of encoding.

You could also do the cartesian product of all features and you will then encode it this way:

for a male[0], born male[0] attracted to male[0], == [000], 
for  a male[0], born male[0] attracted to female1 == [001].

Applying one-hot encoding to this will give you 8 features that will include a high percentage of the cases. This encoding will allow trees to distinguish gender with a split and for linear regression to correctly assign weights.

It is true, that this is not exactly and you can complain about a lot of things. But in the end, while modeling we are making approximations and we are always missing something.

All models are wrong, but some are useful

Let me know if you find something better.

Sammy · Answer

There are at least two general considerations to make:

Domain-related

If an attribute potentially has predictive power in your domain and more specifically for your task your models might benefit from a direct encoding. For example: if being trans is correlated with different psychological disorders then I'd include a direct feature for this. This way it is easier for your model to make a prediction since it does not need to combine two features in the first place (e.g. no need to combine "sex at birth" and "gender identification" to identify a transsexual person (which would not even accurate since "trans" is a much broader term than just sex at birth != gender identity)).

Moreover, I'd apply the same thinking to other feature engineering questions. Sex has predictive power for many tasks related to mental disorders, e.g. because mood disorders are more common among women and anti-social personality disorders are more common among men. However, whether these are rather related to the sex at birth or the gender a person identifies with is another question. So if your hypothesis is that in your task the gender a person identifies with is important then, again, it makes sense to include this in addition to the sex at birth.

Model-related

Different models are able to handle predictors differently. For example, tree-based models can more easily work with two separate attributes sex == female and trans == True to implicitly derive trans female == True. However, linear models like neural networks might benefit from having a combined binary feature female trans.

Joe · Answer

No need to drop from the analysis, for sure.  Your analyses should be capable of classifying by domain, even if you just assign them to a third (or fourth or ...) category.  You'll be basically comparing Female:Not Female, Male:Not Male, etc.; keeping them in the dataset means you have a better result when you're comparing those domains.

The decision you make depends to some extent on what question you are answering in your analysis.  Are you asking questions relating to gender identity?  Are you focused on a specific gender or sex?  Or are you exploring your data and looking to see what factors are important?

If you are focused on one gender identity, say, Female, then you could simply categorize the non-cis-Female non-cis-Male to a third ("Other") category, for example.  This doesn't give you any information about the trans or otherwise non-cis gendered individuals, but if that's not actually important to your question, then this is the easiest way to handle them.

However, if you are exploring, and as you note in your question you're aware this is a possibly significant factor, then you should classify it - likely as a separate variable.  However, consider how you will perform the analysis when you assign these; you may still want to assign "trans female" as a separate gender, depending on what makes your analysis easier (while still having a trans 1/0 flag variable, or cis 1/0 flag variable, or similar). If you don't have any plans to analyse based on all females (regardless of trans/cis/etc.), then it may be easier to have a separate gender code there to make it easier to analyse rather than having to include the trans/cis flag variable in those analyses.

egg egg · Answer

Gender Analysis is a pretty common trend in data science, especially when it comes to mental health. But breaking it down into categories can be difficult.

I would break it down in to two columns, minimum.

One that is designated as either 'Assigned Male at Birth (AMAB)' or 'Assigned Female at Birth (AFAM)'. This is necessary from a medical standpoint as some drugs and side-effects of drugs have different effects depending on the hormones already present in the body. There's also the male study bias, where people who go into scientific studies for drugs are only tested on non-pregnant men.

Note that the column above may change later into a broader category, depending on how culture adapts to handle intersex individuals.

The second column would have more ambiguous categories, as with the current culture shift people are exploring gender more. It would need to be open to new updates as our culture shifts. Some options for this would be M for man, W for woman, U for unknown, Q for queer, A for agender, F for fluid, ect. One-hot encoding later will help make this easier to 'study'.

It would be handy for a person to know the pronouns of the person they're going to interact with, as well as to study the trends within our culture. So having a field for pronouns would be helpful for data analysis later on, as well.

Jennifer · Answer

I've been sitting on this idea for decades, never sharing it with anyone.  I don't expect it to be accepted by anybody.  But here we go anyhow:

When looking at the question of gender, I came to the conclusion that 8 bits were needed to define gender properly, including groups (which of course can be of both genders) and uncertainties.  The bits are:

NBM (80h) - natural born male

NBF (40h) - natural born female

MTS (20h) - masculinized transsexual

FTS (10h) - feminized transsexual

PNU (08h) - parts now uncertain

PHU (04h) - parts history uncertain

PHC (02h) - parts history certain

PIT (01h) - parts in transition

There's more to this but I won't bother you with further details.

For example: a group of cisgendered men and women would be 0C0h.

I came up with this before the era of nonbinaries, which would probably need additional bits.

Naturally, it doesn't include anything about heteros vs. LGB's -- that's a separate discussion.

The reason I'm presenting this is to point out how complex the issue is.

You are free, of course, to dismiss this as the rantings of a old woman.

Geoffrey Brent · Answer

Some considerations here:

How has the data been collected?

If it's self-reporting, it's quite likely that most trans people will simply have replied with "male", "female", or other equivalent terms that give no indication of trans status. If it's reported by others, it's quite likely that the reporter will often not know that the person is trans.

If most of the trans men in your data are indistinguishable from cis men, and similarly for women, then - ignoring non-binary cases for the moment - your categorisation options are:

"Cis men and trans men" vs. "cis women and trans women" (if you map "trans man" to "man", etc.)
"Cis men, most trans men, and some trans women" vs. "cis women, most trans women, and some trans men" (if you map "trans man" to "women", etc.)

The first of those two seems clearly preferable, IMHO. It might not be the best delineation for every application, but at least it's fairly well defined. The alternative is just vague.

Are your decisions actually going to matter to the results?

It's quite likely that there won't be enough (identifiable) trans and non-binary people for you to get any useful data about "trans men", "trans women", or "non-binary people" as categories. It's also quite likely that these groups will be rare enough that they don't make a big difference to the overall stats for larger categories like "men" and "women", however defined.

If you weren't talking about open-source data, I'd also raise privacy issues with reporting for small sub-populations, but presumably that has already been considered.

What is the point of the analysis?

If you get past the above considerations... how does gender and trans status relate to whatever it is you're trying to understand? This is likely to be relevant to your decisions.

Should I just drop them from the dataset entirely, because they might distort it?

Cis people are likely to have much more influence on your results than trans people. Should we therefore drop cis people from the analysis for fear of distortion?

Trans people are people. If your aim is to produce statistics about "people" overall, then trans people should be included in those statistics. If some trans people are unusual (in whatever way) and this affects the statistics, then the statistics are simply reflecting the fact that some people are unusual.

Youseflapod · Answer

If it is an open mental health data set, then those using it would benefit from filtering into as many categories as possibly relevant as the end user may need to specify between the given subsets.

In the end, data sets are easy to modify into narrowing categories or maintaining the same categories.

If the end user wants to combine those data categories, then they can factor them both into the "female" or "male" category, otherwise, don't dilute the data.

Mikesplace · Answer

First I would determine the number of people who fall into male and female and if the number left over will not end up being statistically significant then they would be best to be discarded. After that if the "other" group is big enough maybe split it up, but once again consider if the split groups are big enough for statistical significance otherwise I think you are wasting your time, just keep it as male female and other

Answered by Mikesplace on May 8, 2021

Simon Richter · Answer

That generally depends on what you are trying to achieve.

What people report as their gender is basically the output of a black box function with a lot of input variables. As any endocrinologist can tell you, it's not as simple as "high testosteron" vs "high estrogen" but more on the order of a hundred different hormones involved, most of which have interesting consequences medically. People with "all male" or "all female" hormone configurations are seldom, it is usually a mix with a bimodal distribution.

As such, correlating gender with any other data will only give you a somewhat noisy view on the variables that went into the black box. You can derive some probabilities from that, and that is usually all you want anyway: slightly better prediction for the majority of cases. Spending effort to optimize for capturing a small group perfectly is going to give diminishing returns here.

If you give users a neutral option "prefer not to say", you will lose a few rows from privacy minded people, but this also gives an easy out to people who don't believe they neatly fit into these categories. A separate "other" option is generally considered rude.

For applications where the neutral option doesn't work (e.g. because you're investigating side effects of medication), a simple "gender" column is likely oversimplified, and you might get better results from correlating directly to measurements.

Richard Careaga · Answer

Some domain experts are also statistical experts, and some statistic experts are also domain experts. Most data scientists, I'd guess, are not yet expert in either and are more likely to work in different domains routinely. It makes little sense to attempt to become expert in both for each new project.

Rather, the contribution of the data scientist at this stage is exploratory, rather than confirmatory. As Tukey put it, the goal is to find out what questions the data can answer, not to confirm the questions that the data does answer.

Discarding data at this stage makes little sense because many ways at looking at the data can tolerate NAs. Forcing data into binary categories may, or may not, be useful for some tests. For other tests, it makes better sense to create "dummy" variables to tease out distinctions, if any, among the variation shown within  categorial variables. There are tools for categorial response variables and tools for categorical variables. A sufficiently large number of categories may make a covariate amenable to treatment as continuous.

To be reductionist, maths and physics are chock full of binary variables. For biological and social systems, binaries are created on the basis of relevance. For most of history, what was most relevant was reproductive role among humans. Much of the variability of humans, however, has variability that overlaps reproductive role and that distinction may tell us nothing about the relevance that may be the focus of an inquiry.

For those reasons alone, preserve the data (which can, untransformed, always be set aside in any model) and create dummy variables that illustrate possibily relevant distinctions. Then see what those distinctions add to the understanding of candidate response variables.

How can I appropriately handle cleaning of gender data?

10 Answers

Add your own answers!

Ask a Question