TransWikia.com

R: Calculations based on frequencies / grouped / aggregate data

Data Science Asked by joffdd on June 4, 2021

I am trying to do simple calculations in R when no raw data but grouped data with frequencies is available only. This is the case when I have a large amount of records in a database, say a large SQL table, and then for given reasons GROUP BY and COUNT to aggregate instead of downloading the original table for analysis in R. As I understand, one could say in R that I’m talking about data in a table format.

To give a simple example:

> original=c(1,2,2,3)
> aggregate=table(original)
original
1 2 3 
1 2 1

In the case I describe, only data structured like aggregate would be available as the result of my database query, usually in a data.frame similar to this:

|value|frequency|
|1    |1        |
|2    |2        |
|3    |1        |

What are standard approaches in R to work with this kind of data, particulary when it’s the aim to run analysis on value? E.g., I may want to

  • Calculate Mean, Median, Quantiles
  • Plot a boxplot which should give the same results as when I would run boxplot(original)

I feel I’m just missing a basic concept of aggregated data processing in R, cannot imagine it’s not something being build in a native way in R.

The only idea I could find googling this was basically an approach to let R create original based on the frequency information.

One Answer

I think working with "collapsed" or "summarized" data as opposed to the data itself directly inside the SQL database is counter-intuitive. Also, trying to download the original table for analysis in R is a very bad route if the data is too big to fit in the memory, and given that you already have a tool like SQL, which is designed for that specific purpose. You can calculate the mean, median, quantiles, etc. in SQL; and then pull that into R (even into Excel) and draw box plots.

Even though overkill, here is what you can do:

  1. Calculate the mean, median, quantiles from the frequency chart by coding basic math in R. For example mean of a series is the sum over frequency(i)*value(i)/sum(frequency(i)), where i=1,...,n, and n is the number of distinct values in the frequency chart. Median is the value which is in the mid point of series 1,...,sum(frequency(i)).
  2. You can recover the original series from summary statistics, but memory is again of concern. Let's say your original series was this:

    mySeries <- c(5,9,4,4,2,5,8,9,9,5,5)

But obviously it sits in the database, and this is what you get from SQL as you described:

> fromSQL <- data.frame(table(mySeries))
> fromSQL
  mySeries Freq
1        2    1
2        4    2
3        5    4
4        8    1
5        9    3

You can just use the rep function to recover the series like so:

> recoveredSeries <- as.integer(as.character(rep(fromSQL$mySeries, fromSQL$Freq)))
> recoveredSeries
 [1] 2 4 4 5 5 5 5 8 9 9 9

Now define a function that returns the summary of this series and gives you what you want:

> multiSummary <- function(x) {
+   c(mean = mean(x), median = median(x))
+ }
> multiSummary(recoveredSeries)
    mean   median 
5.909091 5.000000 

As I said, in my opinion, all of this is overkill. Different tools are tailored for different needs, and trying to do something with a tool that is not really the best for your need will give you less-than-optimal results and cost you a lot of time. My recommendation would be performing SQL operations, or going for MapReduce framework for this kind of purposes, depending on the size of your data.

Answered by FatihAkici on June 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP