Data Science Asked by joffdd on June 4, 2021
I am trying to do simple calculations in R when no raw data but grouped data with frequencies is available only. This is the case when I have a large amount of records in a database, say a large SQL table, and then for given reasons GROUP BY and COUNT to aggregate instead of downloading the original table for analysis in R. As I understand, one could say in R that I’m talking about data in a table format.
To give a simple example:
> original=c(1,2,2,3)
> aggregate=table(original)
original
1 2 3
1 2 1
In the case I describe, only data structured like aggregate
would be available as the result of my database query, usually in a data.frame similar to this:
|value|frequency|
|1 |1 |
|2 |2 |
|3 |1 |
What are standard approaches in R to work with this kind of data, particulary when it’s the aim to run analysis on value
? E.g., I may want to
boxplot(original)
I feel I’m just missing a basic concept of aggregated data processing in R, cannot imagine it’s not something being build in a native way in R.
The only idea I could find googling this was basically an approach to let R create original
based on the frequency information.
I think working with "collapsed" or "summarized" data as opposed to the data itself directly inside the SQL database is counter-intuitive. Also, trying to download the original table for analysis in R is a very bad route if the data is too big to fit in the memory, and given that you already have a tool like SQL, which is designed for that specific purpose. You can calculate the mean, median, quantiles, etc. in SQL; and then pull that into R (even into Excel) and draw box plots.
Even though overkill, here is what you can do:
You can recover the original series from summary statistics, but memory is again of concern. Let's say your original series was this:
mySeries <- c(5,9,4,4,2,5,8,9,9,5,5)
But obviously it sits in the database, and this is what you get from SQL as you described:
> fromSQL <- data.frame(table(mySeries))
> fromSQL
mySeries Freq
1 2 1
2 4 2
3 5 4
4 8 1
5 9 3
You can just use the rep function to recover the series like so:
> recoveredSeries <- as.integer(as.character(rep(fromSQL$mySeries, fromSQL$Freq)))
> recoveredSeries
[1] 2 4 4 5 5 5 5 8 9 9 9
Now define a function that returns the summary of this series and gives you what you want:
> multiSummary <- function(x) {
+ c(mean = mean(x), median = median(x))
+ }
> multiSummary(recoveredSeries)
mean median
5.909091 5.000000
As I said, in my opinion, all of this is overkill. Different tools are tailored for different needs, and trying to do something with a tool that is not really the best for your need will give you less-than-optimal results and cost you a lot of time. My recommendation would be performing SQL operations, or going for MapReduce framework for this kind of purposes, depending on the size of your data.
Answered by FatihAkici on June 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP