Back to Blog Archive
**Notes on GroupBy, OrderBy and DistinctBy in Dataweave**

Posted on: November 22, 2018

Dataweave allows programmers to use the operations *groupBy*, *orderBy* and *distinctBy* when defining their scripts.

- groupBy is often used with some function to group objects by certain keys.
- orderBy is often used with some function to order objects by certain keys.
- distinctBy is often used with the anonymous parameter $ to eliminate duplicate objects.

In this blog post we shall focus on how these functions behave when they are given an arbitrary function as a parameter, giving examples in each case. In the final part of the blog post we shall also list some interesting properties of these operations to further reinforce our understanding of them.

Consider the following payload, consisting of a list of objects containing the name, surname and age of a given person.

```
[
{
"name": "Allan",
"surname": "Osbourne",
"age": 55
},
{
"name": "Corine",
"surname": "Small",
"age": 15
},
{
"name": "Quentin",
"surname": "Chase",
"age": 66
},
{
"name": "Zak",
"surname": "Wilkins",
"age": 17
},
{
"name": "Rebecca",
"surname": "Dabney",
"age": 23
},
{
"name": "Merry",
"surname": "Moore",
"age": 42
}
]
```

Consider also the function *classify*, which classifies objects according to the age band in which the corresponding person falls into.

In particular the function classify will return 0 when the age of the person is greater than or equal to 65, 1 when the age of the person is strictly between 18 and 65 and 2 otherwise.

`%function classify(val,idx)`

(0 when val.age >= 65 otherwise

1 when val.age > 18 and val.age < 65 otherwise

2)

It is easy to see that every object is assigned only one code from 0, 1 or 2.

We shall now consider the groupBy operation.

The groupBy function takes a list of objects [x1,…,xn] and some function f.

It will then map f on the list, giving the list of objects [f(x1),…,f(xn)], and use these values to perform the groupBy operation.

In particular, if two objects xi and xj are mapped to the same value (that is f(xi) = f(yi)), we have that these objects will be grouped together in a list with the key f(xi).

For instance, consider applying the following groupBy operation using function classify on the payload:

`payload groupBy classify($,$$)`

This will assign one of the codes 0, 1 or 2 to each object in the payload. Since the function classify can only output one of these three codes, we have that the groupBy operation will return an object containing the keys “0”, “1” and “2”. Each key will in turn be associated to an array containing those objects from the payload which have been assigned the corresponding code by the function classify:

```
{
"2": [
{
"name": "Corine",
"surname": "Small",
"age": 15
},
{
"name": "Zak",
"surname": "Wilkins",
"age": 17
}
],
"1": [
{
"name": "Allan",
"surname": "Osbourne",
"age": 55
},
{
"name": "Rebecca",
"surname": "Dabney",
"age": 23
},
{
"name": "Merry",
"surname": "Moore",
"age": 42
}
],
"0": [
{
"name": "Quentin",
"surname": "Chase",
"age": 66
}
]
}
```

Next we shall consider the orderBy operation.

The orderBy function takes a list of objects [x1,…,xn] and some function f.

It will then map f on the list, giving the list of objects [f(x1),…,f(xn)], and use these values to perform the orderBy operation.

In particular, if for any two objects xi and xj we have that f(xi) < f(xj), it follows that xi will be ordered before xj in the result. On the other hand if f(xi) = f(xj), we have that xi is ordered before xj if it appears prior to it in the payload.

For instance, consider the following orderBy operation using function classify:

`payload orderBy classify($,$$)`

Once again this will assign one of the codes 0, 1 or 2 to each object in the payload. Objects which have been assigned code 0 will appear before those which have been assigned code 1, and the latter will appear before those which have been assigned code 2. On the other hand, objects sharing the same code are ordered in the same way in which they appear in the payload. The result is shown below:

```
[
{
"name": "Quentin",
"surname": "Chase",
"age": 66
},
{
"name": "Allan",
"surname": "Osbourne",
"age": 55
},
{
"name": "Rebecca",
"surname": "Dabney",
"age": 23
},
{
"name": "Merry",
"surname": "Moore",
"age": 42
},
{
"name": "Corine",
"surname": "Small",
"age": 15
},
{
"name": "Zak",
"surname": "Wilkins",
"age": 17
}
]
```

We now consider the distinctBy operation.

The distinctBy function takes a list of objects [x1,…,xn] and some function f.

It will then map f on the list, giving the list of objects [f(x1),…,f(xn)], and use these values to perform the distinctBy operation.

In particular, if two objects xi and xj are mapped to the same value (that is f(xi) = f(yi)), we have that xi and xj are not regarded as distinct objects by the distinctBy operation.

The distinctBy operation will then return a list containing an object for each of the possible output values of f. The object which is chosen in each case is the first object which appears in the payload and which is mapped to the corresponding value by the function f.

For instance, consider the following distinctBy operation using function classify:

`payload distinctBy classify($,$$)`

Once again this will assign one of the codes 0, 1 or 2 to each object in the payload. Only one object with code 0, one object with code 1 and one object with code 2 will appear in the result. The object which is chosen in each case is the first object with the corresponding code which appears in the payload.

```
[
{
"name": "Quentin",
"surname": "Chase",
"age": 66
},
{
"name": "Allan",
"surname": "Osbourne",
"age": 55
},
{
"name": "Corine",
"surname": "Small",
"age": 15
}
]
```

The operation thus corresponds to sampling one item from each pool of items sharing the same code.

We conclude by mentioning a few interesting properties of the groupBy, orderBy and distinctBy operations.

Suppose that we are given a finite list as a payload and a function f such as classify, which assigns some code to each object in the payload.

Then we have that f implicitly defines a relation R(x,y) between any two objects x and y which states that x is related to y if the function gives the same code to both objects (i.e. R(x,y) if and only if f(x) == f(y)).

In fact R is an *equivalence relation*, which partitions the list into disjoint sublists called *equivalence classes* containing all objects which were assigned the same code.

The actual list of equivalence classes (called the *quotient* under R, payload/R) can be obtained by using a groupBy operation follower by a pluck operation to remove the keys:

`payload groupBy classify($,$$) pluck $`

giving the following result:

```
[
[
{
"name": "Corine",
"surname": "Small",
"age": 15
},
{
"name": "Zak",
"surname": "Wilkins",
"age": 17
}
],
[
{
"name": "Allan",
"surname": "Osbourne",
"age": 55
},
{
"name": "Rebecca",
"surname": "Dabney",
"age": 23
},
{
"name": "Merry",
"surname": "Moore",
"age": 42
}
],
[
{
"name": "Quentin",
"surname": "Chase",
"age": 66
}
]
]
```

If Dataweave knows how to order the codes which f assigns to the objects within the payload, we have that the orderBy operation imposes a *total order* on the payload, which satisfies the important property that members in any one equivalence class are either all smaller or all greater than the members of any other equivalence class.

Finally, when the distinctBy operation is given a function f, it will identify a *representative* object from each equivalence class and return a list of these objects.

Ricston Ltd.

Triq G.F. Agius De Soldanis,

Birkirkara, BKR 4850,

Malta

Triq G.F. Agius De Soldanis,

Birkirkara, BKR 4850,

Malta

MT: +356 2133 4457

UK: +44 (0)2071935107

info@ricston.com

Need Help?

Ask our Experts!