TransWikia.com

How to convert a Dataset into an indexed dataset / association-of-associations given a column header?

Mathematica Asked by IntroductionToProbability on January 24, 2021

Given a dataset as such

Input

If "letter" is the header that is chosen, how do I convert it into an indexed dataset / association-of-associations?

i.e. How do I define f such that f[dataset_,columnHeader_] produces the following?

enter image description here

Please note GroupBy is close but fails as you are unable to use Part to work with the result to extract column data. eg:

data = {<|"letter" -> "a", "foo" -> 1, "bar" -> 2|>, <|"letter" -> "b", "foo" -> 3, "bar" -> 4|>, <|"letter" -> "c", "foo" -> 5, "bar" -> 6|>};
dataDS = Dataset[data];
dataDSg= GroupBy[dataDS, Key["letter"]];
dataDSg[All, "foo"] (* <- produces an error *)

Where as data in the format of an association-of-association works fine

data2 = <|"a" -> <|"foo" -> 1, "bar" -> 2|>, "b" -> <|"foo" -> 3, "bar" -> 4|>, "c" -> <|"foo" -> 5, "bar" -> 6|>|>;
data2DS = data2 // Dataset;
data2DS [All, "foo"] (* <- returns a dataset with 1,3,5 *)

Update

Some timing comparisons

(* make dataset to test *)
colHeader = CharacterRange["a", "z"];
colHeader[[1]] = "letter";
data = RandomReal[{-1, 1}, {100000, 26}];
table = Insert[data, colHeader, 1];
dataDS = Dataset[AssociationThread[table[[1]], #] & /@ table[[2 ;;]]];

Anton Antonov answer

f[ds_Dataset, ch_] := Dataset@Association@Normal@ds[All, #[ch] -> KeyDrop[#, ch] &]
fAns = f[dataDS, "letter"]; // RepeatedTiming (* 0.934 *)

kglr answer

f0 = GroupBy[##, Association@*KeyDrop[#2]] &;
f0ans = f0[dataDS, "letter"]; // RepeatedTiming (* 1.85 *)

f1 = #[GroupBy[#2] /* Map[Association@*KeyDrop[#2]]] &;
f1ans = f1[dataDS, "letter"]; // RepeatedTiming (* 1.714 *)

Sjoerd Smit answer

groupByKey[ds_, key_String] := GroupBy[ds, Function[Slot[key]] -> KeyDrop[key], First];
groupByKeyAns = groupByKey[dataDS, "letter"]; // RepeatedTiming (* 1.2 *)

some other timings that don’t produce an answer but help to put the times above into context

GroupBy[dataDS, "letter"]; // RepeatedTiming (* 0.25 *)
Dataset[Normal[dataDS]]; // RepeatedTiming (* 0.38 *)

3 Answers

data = {<|"letter" -> "a", "foo" -> 1, "bar" -> 2|>, <|
    "letter" -> "b", "foo" -> 3, "bar" -> 4|>, <|"letter" -> "c", 
    "foo" -> 5, "bar" -> 6|>};
dataDS = Dataset[data];

ClearAll[f];
f[ds_Dataset, ch_] := ds[Apply[Association], #[ch] -> KeyDrop[#, ch] &];

f[dataDS, "letter"]

enter image description here

(Using the definition suggested by @WReach in the comments.)

First answer

data = {<|"letter" -> "a", "foo" -> 1, "bar" -> 2|>, <|
    "letter" -> "b", "foo" -> 3, "bar" -> 4|>, <|"letter" -> "c", 
    "foo" -> 5, "bar" -> 6|>};
dataDS = Dataset[data];

ClearAll[f];
f[ds_Dataset, ch_] := Dataset@Association@Normal@ds[All, #[ch] -> KeyDrop[#, ch] &];

f[dataDS, "letter"]

enter image description here

Correct answer by Anton Antonov on January 24, 2021

ClearAll[f0]
f0 = GroupBy[##, Association @* KeyDrop[#2]] &;

Examples:

ds = Dataset @ {<|"letter" -> "a", "foo" -> 1, "bar" -> 2|>, 
        <|"letter" -> "b", "foo" -> 3, "bar" -> 4|>, 
        <|"letter" -> "c", "foo" -> 5, "bar" -> 6|>};

Row[{ds, f0[ds, "letter"], f0[ds, "foo"], f0[ds, "bar"]}, Spacer[10]]

enter image description here

You can also do:

ClearAll[f1]
f1 = #[GroupBy[#2] /* Map[Association @* KeyDrop[#2]]] &;

Row[{ds, f1[ds, "letter"], f1[ds, "foo"], f1[ds, "bar"]}, Spacer[10]]

enter image description here

and

ClearAll[f2]
f2 = #[GroupBy @ #2, All, First @ Normal @ Keys @ KeyDrop @ ##] &;

Row[{ds, f2[ds, "letter"], f2[ds, "foo"], f2[ds, "bar"]}, Spacer[10]]

enter image description here

Answered by kglr on January 24, 2021

Related to kglr's answer, here's a slight variation:

ds = Dataset @ {
    <|"letter" -> "a", "foo" -> 1, "bar" -> 2|>, 
    <|"letter" -> "b",  "foo" -> 3, "bar" -> 4|>,
    <|"letter" -> "c", "foo" -> 5, "bar" -> 6|>
};
groupByKey[ds_, key_String] := GroupBy[ds, Function[Slot[key]] -> KeyDrop[key], First];
groupByKey[ds, "letter"]

Of course, you have to be confident that the key values you're grouping by is actually unique, otherwise you'll be dropping rows.

Answered by Sjoerd Smit on January 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP