Back to Blog Archive

Notes on GroupBy, OrderBy and DistinctBy in Dataweave

Posted on: November 22, 2018
Author:
Ingram

Dataweave allows programmers to use the operations groupBy, orderBy and distinctBy when defining their scripts.

  • groupBy is often used with some function to group objects by certain keys.
  • orderBy is often used with some function to order objects by certain keys.
  • distinctBy is often used with the anonymous parameter $ to eliminate duplicate objects.

In this blog post we shall focus on how these functions behave when they are given an arbitrary function as a parameter, giving examples in each case.  In the final part of the blog post we shall also list some interesting properties of these operations to further reinforce our understanding of them.

Setting Up

Consider the following payload, consisting of a list of objects containing the name, surname and age of a given person.

[
    {
        "name": "Allan",
        "surname": "Osbourne",
        "age": 55
    },
    {
        "name": "Corine",
        "surname": "Small",
        "age": 15
    },
    {
        "name": "Quentin",
        "surname": "Chase",
        "age": 66
    },
    {
        "name": "Zak",
        "surname": "Wilkins",
        "age": 17
    },
    {
        "name": "Rebecca",
        "surname": "Dabney",
        "age": 23
    },
    {
        "name": "Merry",
        "surname": "Moore",
        "age": 42
    }
]

Consider also the function classify, which classifies objects according to the age band in which the corresponding person falls into.

In particular the function classify will return 0 when the age of the person is greater than or equal to 65, 1 when the age of the person is strictly between 18 and 65 and 2 otherwise.

%function classify(val,idx)
(0 when val.age >= 65 otherwise
1 when val.age > 18 and val.age < 65 otherwise
2)

It is easy to see that every object is assigned only one code from 0, 1 or 2.

We shall now consider the groupBy operation.

groupBy

The groupBy function takes a list of objects [x1,…,xn] and some function f.

It will then map f on the list, giving the list of objects [f(x1),…,f(xn)], and use these values to perform the groupBy operation.

In particular, if two objects xi and xj are mapped to the same value (that is f(xi) = f(yi)), we have that these objects will be grouped together in a list with the key f(xi).

For instance, consider applying the following groupBy operation using function classify on the payload:

payload groupBy classify($,$$)

This will assign one of the codes 0, 1 or 2 to each object in the payload. Since the function classify can only output one of these three codes, we have that the groupBy operation will return an object containing the keys “0”, “1” and “2”. Each key will in turn be associated to an array containing those objects from the payload which have been assigned the corresponding code by the function classify:

{
  "2": [
    {
      "name": "Corine",
      "surname": "Small",
      "age": 15
    },
    {
      "name": "Zak",
      "surname": "Wilkins",
      "age": 17
    }
  ],
  "1": [
    {
      "name": "Allan",
      "surname": "Osbourne",
      "age": 55
    },
    {
      "name": "Rebecca",
      "surname": "Dabney",
      "age": 23
    },
    {
      "name": "Merry",
      "surname": "Moore",
      "age": 42
    }
  ],
  "0": [
    {
      "name": "Quentin",
      "surname": "Chase",
      "age": 66
    }
  ]
}
Tip: By splitting this collection by the corresponding keys, it becomes possible to route objects with different codes to different message processors.

Next we shall consider the orderBy operation.

orderBy

The orderBy function takes a list of objects [x1,…,xn] and some function f.

It will then map f on the list, giving the list of objects [f(x1),…,f(xn)], and use these values to perform the orderBy operation.

In particular, if for any two objects xi and xj we have that f(xi) < f(xj), it follows that xi will be ordered before xj in the result. On the other hand if f(xi) = f(xj), we have that xi is ordered before xj if it appears prior to it in the payload.

For instance, consider the following orderBy operation using function classify:

payload orderBy classify($,$$)

Once again this will assign one of the codes 0, 1 or 2 to each object in the payload. Objects which have been assigned code 0 will appear before those which have been assigned code 1, and the latter will appear before those which have been assigned code 2. On the other hand, objects sharing the same code are ordered in the same way in which they appear in the payload. The result is shown below:

[
  {
    "name": "Quentin",
    "surname": "Chase",
    "age": 66
  },
  {
    "name": "Allan",
    "surname": "Osbourne",
    "age": 55
  },
  {
    "name": "Rebecca",
    "surname": "Dabney",
    "age": 23
  },
  {
    "name": "Merry",
    "surname": "Moore",
    "age": 42
  },
  {
    "name": "Corine",
    "surname": "Small",
    "age": 15
  },
  {
    "name": "Zak",
    "surname": "Wilkins",
    "age": 17
  }
]
Tip: Ordering objects in this manner and queuing them results in a queue of objects prioritised by their corresponding code.

We now consider the distinctBy operation.

distinctBy

The distinctBy function takes a list of objects [x1,…,xn] and some function f.

It will then map f on the list, giving the list of objects [f(x1),…,f(xn)], and use these values to perform the distinctBy operation.

In particular, if two objects xi and xj are mapped to the same value (that is f(xi) = f(yi)), we have that xi and xj are not regarded as distinct objects by the distinctBy operation.

The distinctBy operation will then return a list containing an object for each of the possible output values of f. The object which is chosen in each case is the first object which appears in the payload and which is mapped to the corresponding value by the function f.

For instance, consider the following distinctBy operation using function classify:

payload distinctBy classify($,$$)

Once again this will assign one of the codes 0, 1 or 2 to each object in the payload. Only one object with code 0, one object with code 1 and one object with code 2 will appear in the result. The object which is chosen in each case is the first object with the corresponding code which appears in the payload.

[
  {
    "name": "Quentin",
    "surname": "Chase",
    "age": 66
  },
  {
    "name": "Allan",
    "surname": "Osbourne",
    "age": 55
  },
  {
    "name": "Corine",
    "surname": "Small",
    "age": 15
  }
]

The operation thus corresponds to sampling one item from each pool of items sharing the same code.

Tip: If the number of classes is large, this can be an easy way of taking a sample of the data as it moves through the script.

Some properties of groupBy, orderBy and distinctBy

We conclude by mentioning a few interesting properties of the groupBy, orderBy and distinctBy operations.

Suppose that we are given a finite list as a payload and a function f such as classify, which assigns some code to each object in the payload.

Then we have that f implicitly defines a relation R(x,y) between any two objects x and y which states that x is related to y if the function gives the same code to both objects (i.e. R(x,y) if and only if f(x) == f(y)).

In fact R is an equivalence relation, which partitions the list into disjoint sublists called equivalence classes containing all objects which were assigned the same code.

The actual list of equivalence classes (called the quotient under R, payload/R) can be obtained by using a groupBy operation follower by a pluck operation to remove the keys:

payload groupBy classify($,$$) pluck $

giving the following result:

[
  [
    {
      "name": "Corine",
      "surname": "Small",
      "age": 15
    },
    {
      "name": "Zak",
      "surname": "Wilkins",
      "age": 17
    }
  ],
  [
    {
      "name": "Allan",
      "surname": "Osbourne",
      "age": 55
    },
    {
      "name": "Rebecca",
      "surname": "Dabney",
      "age": 23
    },
    {
      "name": "Merry",
      "surname": "Moore",
      "age": 42
    }
  ],
  [
    {
      "name": "Quentin",
      "surname": "Chase",
      "age": 66
    }
  ]
]

If Dataweave knows how to order the codes which f assigns to the objects within the payload, we have that the orderBy operation imposes a total order on the payload, which satisfies the important property that members in any one equivalence class are either all smaller or all greater than the members of any other equivalence class.

Finally, when the distinctBy operation is given a function f, it will identify a representative object from each equivalence class and return a list of these objects.

Author:
Ingram

Comments

Contact Us

Ricston Ltd.
Triq G.F. Agius De Soldanis,
Birkirkara, BKR 4850,
Malta
MT: +356 2133 4457
UK: +44 (0)2071935107

Send our experts a message

Need Help?
Ask our Experts!