Data Science Asked on June 1, 2021
I am working on a project where I have twitter user profiles and their tweets. The users are divided based on their number of followers in two groups (g1
and g2
). Then with each user in g1
, one user from g2
were matched based on their profile and activity using nearest neighbor (not propensity score). Now I want to do some statistical tests, for example, how differently the sentiment of the tweets changes for these two groups before and after some events. So I have lets say tweets posted within 7 days before and after some date and estimated the mean sentiment scores of all tweets posted by each user in each group. For two groups sample sizes are different (even though they were matched) since not every one posted any tweets within the date range. Now if I want to do a t-test to see if people in g1
has larger positive change in sentiment than g2
after the reference date. I have the following questions:
Thanks in advance. Cheers!
Regarding your second question:
The users who did not have and tweet for the time range, is this okay to assign difference in mean zero, or I should exclude them from the samples?
You essentially have missing data in this case. How you can deal with this will depend on the model you are using, and if it is robust to missing data. If the model can ignore $mu = 0, sigma = 0$ values, then try it out. Otherwise, you might want to leave them out as you suggest, or perhaps even impute them with their previous known values. If you are e.g. using something like an ARIMA model, then it keeps track of a moving average. In this case, using zero values will have an undesired impact (assuming zeros are not common in general).
I'm not sure I understand what you are asking in your first question. What have you tried already? Have you got some results?
Answered by n1k31t4 on June 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP