DVF: Aggregations - Mincong Huang

Introduction

Open data “Demande de valeurs foncières (DVF)” is an open dataset provided by the French government which collects all the real-estate transactions since January 2014, in mainland France and the overseas departments and territories. In the previous DVF articles, we talked about the write path: how to index new documents, how to optimize storage, and how to perform snapshots and restores. Starting from this article, we are going to focus on the read path: how to perform different search actions on this dataset.

This article will focus on aggregations. Aggregations are important for your application because it provides an overview to your users without showing any documents in detail. It also provides information about the selection range, such as the min/max value of a given field. This topic is also part of the Elastic Certified Engineer Exam. After reading this article, you will understand:

How to write and execute metric aggregation?
How to write and execute bucket aggregation?
How to write and execute aggregations that contain sub-aggregations?

To better demonstrate the importance of aggregations in real-world scenarios, I am going to use different examples from the DVF dataset. This article is written in Elasticsearch 7.12 and Java 11, but most of the concepts should be appliable to any Elasticsearch 7.x cluster. Most of the examples are written in two formats: HTTP requests with JSON content and Java. The goal is to let you better understand how it works even if you are not familiar with Java.

Prerequisite

Before writing any aggregation, we need to index the dataset into Elasticsearch. This has been done in the previous articles so I am not going to go into detail about it in this article. If you were interested in how to do it, you can find the previous articles under the category “Elasticsearch” of my blog, they are prefixed by “DVF”. Once the index is ready, you can find it in the Elasticsearch cluster via the _cat indices API as “transactions”:

$ curl localhost:9200/_cat/indices
yellow open transactions xMLeTfvwTYW1mdz5P85JsA 1 1 827105 0 207.3mb 207.3mb

Metric Aggregation

According to Elasticsearch documentation Metrics Aggregation (7.x), the aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.

Here I am going to take a simple one: the metric value count. As the name indicated, metric value_count shows how many documents are extracted from the aggregated documents. To do that via REST API, we can use the _search endpoint as follows:

GET /transactions/_search

{
  "query": {
    "match_all": {}  // 1
  },
  "size": 0,  // 2
  "aggs": {
    "mutation_id/value_count": {  // 3
      "value_count": {  // 4
        "field": "mutation_id"  // 5
      }
    }
  }
}

If we take a quick look into the HTTP request, you will see that:

We use a match_all query, which matches all the documents of the index “transactions” without filtering.
The number of search hits to return is set to 0. The default value is 10. Since we don’t care about those documents, setting it to 0 simplifies the HTTP response.
We use one single-value metric aggregation and name it as mutation_id/value_count. I name it using the naming convention:
```
${field_name}/${metric_type}
```
so that I can understand the target field name to be aggregated and the type of metric. But this is just a personal preference. You are free to choose the name you want.
The type of metric is value_count.
The metric applies to field mutation_id. I use this field because it is the key of the mutation (transaction), so it is always non-null.

Sending the request above will return:

{
  "took": 100,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "mutation_id/count": {
      "value": 827105
    }
  }
}

It means that there were 827105 transactions in 2020 according to the dataset. It matches the number of lines in the CSV file. There is a difference of 1 line because the CSV file includes the header.

➜  dvf git:(master u=) wc -l downloads/full.2020.csv
  827106 downloads/full.2020.csv

Now, let’s see how to do the same thing in Java:

var sourceBuilder =
    new SearchSourceBuilder()
        .size(0)  // 1
        .aggregation(AggregationBuilders.count("mutation_id/value_count").field("mutation_id"))  // 2
        .query(QueryBuilders.matchAllQuery());  // 3

var request = new SearchRequest()  // 4
    .indices("transactions")
    .source(sourceBuilder);

var response = restClient.search(request, RequestOptions.DEFAULT);  // 5
var valueCount = (ValueCount) response.getAggregations().get("mutation_id/value_count");  // 6
...

It is almost the same thing as the HTTP request. Here we:

Create an HTTP request with a search source. The hit size is set to 0 to avoid returning hits.
The aggregation used is the value count (value_count), named as mutation_id/value_count, targeting field mutation_id.
On the query side, it matches all documents without filtering.
Combining the index name and the search source, we create a search request.
We use the Java REST High Level Client to send the search request and get the search response.
We can retrieve the aggregation from the response using the name of the aggregation, i.e. using “mutation_id/value_count”.

In this section, we discussed how metric aggregation works: we need to provide the index names to be searched, the query to filter the document, and the metrics aggregations to be performed. Now, let’s go to the next part: bucket aggregations.

Bucket Aggregation

Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria. In this section, we are going to use postal code as an example: let’s see which postal code in France contains the highest number of transactions?

To answer this question, we need to prepare an HTTP request for bucket aggregation:

{
  "query": {
    "match_all": {}  // 1
  },
  "size": 0,  // 2
  "aggs": {
    "postal_code/terms": {
      "terms": {  // 3
        "field": "postal_code",
        "size": 3  // 4
      }
    }
  }
}

If we take a quick look into the HTTP request, you will see the same concept as above for the metric aggregation. This time, the bucket aggregation term does the following things:

It queries all the documents in the target index.
It sets the size to 0 to avoid returning documents (hits) because we don’t need them.
This is the key of the request. It specifies the type of aggregation to terms on field postal_code. Therefore, we can obtain a result grouped by postal code.
It only takes the top 3 results. More precisely, there are two notions: size and order. Here we specified the size, which means the aggregation will return 3 results. As for the order, terms aggregation returns results in descending order by default. So the terms having the most occurrences will be returned (defaults to 10). Therefore, combined together (size and order), this setting returns the top 3 results.

Sending the request above will return:

{
  "took": 97,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "postal_code/terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 812333,
      "buckets": [
        {
          "key": "",
          "doc_count": 9392
        },
        {
          "key": "51100",
          "doc_count": 2859
        },
        {
          "key": "75016",
          "doc_count": 2521
        }
      ]
    }
  }
}

So the top 1 result is… missing value. Oh my god 😅 But this is the forever pain for data scientists or whoever doing data analytics, isn’t it? If we filter out that result, then the top 1 goes to Reims (51100) and the top 2 goes to Paris 16e district (75016).

And here is the implementation in Java. I am not going to explain this code block because it’s pretty straightforward:

var sourceBuilder =
    new SearchSourceBuilder()
        .size(0)
        .aggregation(AggregationBuilders.terms("postal_code/terms").field("postal_code").size(3))
        .query(QueryBuilders.matchAllQuery());

var request = new SearchRequest().indices("transactions").source(sourceBuilder);

var response = restClient.search(request, RequestOptions.DEFAULT);
var terms = (ParsedStringTerms) response.getAggregations().get("postal_code/terms");
var countPerPostalCode = terms.getBuckets().stream()
    .map(b -> (ParsedBucket) b)
    .collect(
        Collectors.toMap(
            ParsedBucket::getKeyAsString,
            ParsedBucket::getDocCount));

In this section, we saw how to create a bucket aggregation using terms aggregations and field postal_code. But all we saw are very primitive examples and they didn’t provide much added value for data analytics. In the following sections, I want to share something more interesting with you, such as: what is the average price of a second-hand apartment in Paris? Before answering this question, we will need to compute the price per square meter (€/m2). I will show you how to do that in the next section. And then, we will do the analysis for Paris.

Scripted Metric Aggregation

How to compute the price per square meter (€/m2)?

Our current goal is to compute the price per square meter for each apartment sold. We can do that by doing simple math:

price_m2 = total_price / built-up area

There are mainly 3 choices to compute this field:

Do it at index-time: when we create the new document in Elasticsearch, we can compute a field in our value class in Java. This is useful when we know exactly what we need in advance.
Do it at runtime: update the index mapping to include a new scripted field. This will apply to all the documents. This is useful when we don’t know what additional fields we need when indexing documents. It provides flexibility to modify documents at runtime. Especially useful for end-users.
Do it at query-time: create the field when running the query. This is probably the most expensive one but it fits the on-demand requirement. Sometimes we don’t want to keep one additional field forever because it’s only useful for some queries.

For now, I am going to use choice 3 because it fits the current article which is about search. To prepare the scripted metric aggregation, we need to provide a runtime mapping price_m2, which is computed by two existing fields: property_value and the real_built_un_area. It looks like this:

GET /transactions/_search

{
  "runtime_mappings": {
    "price_m2": {
      "type": "double",
      "script": "emit(doc['property_value'].value / doc['real_built_up_area'].value)"
    }
  },
  ...
}

The script is written in “Painless Script”. Be careful about the logic that you are going to add to this script because it is easy to change from painless to painful 🙂. If you want to know more about Painless, visit Elasticsearch documentation: Painless Language Specification.

Now, going back to our scripted metric, we will need to handle some corner cases because the property value or real built-up area may be missing or equal to 0. I filtered them out in the “query” section of the aggregation. Also, we need to filter the nature of the transaction (mutation) to select only the sales. Other types like expropriation, exchange, judgement are not what we want. To simplify a bit, I also excluded the category “sales under construction”. The final HTTP request looks like this:

{
  "query": {  // 1
    "bool": {
      "filter": [
        { "match": { "mutation_nature": { "query": "Vente" } } },
        { "match": { "local_type": { "query": "Appartement" } } },
        { "range": { "property_value": { "gt": 0 } } },
        { "range": { "real_built_up_area": { "gt": 0 } } }
      ]
    }
  },
  "runtime_mappings": {  // 2
    "price_m2": {
      "type": "double",
      "script": "emit(doc['property_value'].value / doc['real_built_up_area'].value)"
    }
  },
  "size": 0,  // 3
  "aggs": {   // 4
    "price_m2/stats": {
      "stats": {
        "field": "price_m2"
      }
    }
  }
}

If we take a quick look into the HTTP request, you will see the same concept again, as above for the metric aggregation and bucket aggregation. This time, the metric aggregation term does the following things:

It does not query all the documents anymore. It contains multiple filters, encapsulated in a boolean query.
It defines a runtime mapping for field price_m2, computed from property value and real built-up area.
It sets the size to 0 to avoid returning documents (hits) because we don’t need them.
This is the key of the request. It specifies the type of aggregation. We use multi-valued metric aggregation stats, which returns the min, max, average, sum, and count of the field price_m2 in the selected documents.

Sending the request above will return:

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "price_m2/stats": {
      "count": 147763,
      "min": 0.003750000149011612,
      "max": 8166666.666666667,
      "avg": 26941.800127837614,
      "sum": 3981001212.2896695
    }
  }
}

Transactions In Paris

Now we have all the elements that we need, it’s time to do something fun! We know how to execute a multi-valued metric aggregation, e.g. stats. We know how to execute a bucket aggregation, e.g. per postal code. We know how to compute a scripted metric for the price per meter square (m2). Now, let’s use them to do a quick case study for Paris. This section aims to answer two questions:

What is the price for second-hand apartments in Paris per district (arrondissement)?
What is the price for second-hand apartments in Paris per type of apartment (T1, T2, T3, …)?

To answer the first question, we need to use sub-aggregations with 2 levels. The first level is a terms aggregation per postal code and the second level is a list of multi-valued metric aggregations: percentiles and stats. percentiles for price per square meter and stats for the total property value.

{
  "query": {
    "bool": {
      "must": [
        { "wildcard": { "postal_code": { "value": "75*" } } }
      ],
      "filter": [
        { "match": { "mutation_nature": { "query": "Vente" } } },
        { "match": { "local_type": { "query": "Appartement" } } },
        { "range": { "property_value": { "gt": 0 } } },
        { "range": { "real_built_up_area": { "gt": 0 } } }
      ]
    }
  },
  "runtime_mappings": {
    "price_m2": {
      "type": "double",
      "script": "emit(doc['property_value'].value / doc['real_built_up_area'].value)"
    }
  },
  "size": 0,
  "aggs": {
    "postal-code-aggregation": {
      "terms": {
        "field": "postal_code",
        "size": 20
      },
      "aggs": {
        "price_m2/percentiles": {
          "percentiles": {
            "field": "price_m2"
          }
        },
        "property_value/percentiles": {
          "stats": {
            "field": "property_value"
          }
        }
      }
    }
  }
}

Sending the request above will return:

{
  ...
  "aggregations": {
    "postal-code-aggregation": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "75018",
          "doc_count": 1740,
          "property_value/stats": {
            "count": 1740,
            "min": 0.15000000596046448,
            "max": 9278000,
            "avg": 685657.4655678796,
            "sum": 1193043990.0881104
          },
          "price_m2/percentiles": {
            "values": {
              "1.0": 38.31908831908832,
              "5.0": 4263.305322128852,
              "25.0": 8581.576948155804,
              "50.0": 10221.628370766353,
              "75.0": 12191.63493555511,
              "95.0": 62619.36339522546,
              "99.0": 205629.62962962964
            }
          }
        },
        {
          "key": "75017",
          "doc_count": 1411,
          "property_value/stats": { ... },
          "price_m2/percentiles": { ... }
        },
        ...
      ]
    }
  }
}

If we transform the response a bit, we can obtain the following tables.

Total Price Per District

Here is the total price per district in Paris in percentiles: p5, p25, p50, p75, p95. You can see that the 8th district (75008) is the most expensive for most of the percentiles. 50% of the apartments are more expensive than 1.3M€ 🤯.

Postal Code	p5 (€)	p25 (€)	p50 (€)	p75 (€)	p95 (€)
75001	41,000	404,794	691,666	1,740,000	7,005,000
75002	45,000	278,350	496,075	970,392	18,500,000
75003	610	319,583	541,163	1,095,193	12,550,000
75004	16,100	350,000	595,000	1,030,000	2,171,100
75005	86,722	303,805	500,000	852,500	1,746,600
75006	103,500	389,867	729,260	1,504,999	3,408,400
75007	146,600	452,985	845,000	1,854,400	6,000,000
75008	19,310	433,250	1,299,950	3,323,542	33,100,000
75009	80,000	301,325	551,894	1,148,000	6,300,000
75010	106,499	295,000	486,668	837,383	3,716,667
75011	122,505	272,264	436,082	719,349	9,320,000
75012	124,172	290,000	432,585	680,481	3,150,000
75013	148,800	280,867	424,521	609,400	1,378,005
75014	149,250	310,500	494,981	770,160	12,104,743
75015	154,400	316,764	471,024	699,333	1,280,571
75016	118,915	395,990	781,012	1,400,268	3,500,000
75017	97,013	312,448	562,425	1,199,034	13,230,000
75018	87,000	237,636	371,317	586,359	2,800,000
75019	117,750	262,500	353,613	540,250	1,018,594
75020	134,060	248,990	407,585	636,100	8,500,000

Price Per M2 Per District

But using the total price of the apartment is not objective for judging whether the appartment is expensive because they don’t have the same real built-in area (m2). So we should normalize it. We can normalize it by calculating the price per meter square (€/m2). It gives another table:

Postal Code	p5 (€/m2)	p25 (€/m2)	p50 (€/m2)	p75 (€/m2)	p95 (€/m2)
75001	729	11,591	14,208	26,602	231,818
75002	3,492	10,853	12,505	15,586	391,045
75003	14	11,144	13,308	17,131	167,899
75004	343	11,405	13,077	15,448	32,767
75005	2,616	11,004	12,679	15,000	30,905
75006	3,348	12,679	15,492	19,666	49,035
75007	7,114	12,857	15,236	20,179	130,105
75008	260	10,715	13,301	39,951	919,444
75009	2,925	10,208	12,207	14,734	342,463
75010	5,331	9,570	11,174	13,747	128,161
75011	6,140	9,787	11,167	13,078	198,674
75012	5,222	9,015	10,264	11,669	84,290
75013	5,010	8,345	9,769	11,386	56,522
75014	6,389	9,378	10,805	12,948	318,546
75015	6,994	9,546	10,762	12,066	17,901
75016	4,000	9,910	11,504	13,810	28,489
75017	4,152	9,967	11,700	14,165	181,150
75018	4,263	8,582	10,222	12,192	62,619
75019	5,204	7,294	9,060	10,579	30,095
75020	5,390	8,347	9,664	11,538	234,160

You can see that not only the 8th district (75008), but all the districts from 1th to 8th are very expensive. In particular, the median (percentile 50) of 6th district (75006) and 7th district (75007) are higher than 15k€/m2.

Prices Per Lot Type

In the previous tables, we use bucket aggregation on postal code. But we can also analyze from another angle: the lot type (T1, T2, T3, T4, …). T1 means there is only 1 piece in the apartment, T2 means there are 2 pieces, etc. This type of analysis is useful because different people have different needs in their lives. Young people probably want to save some money and buy a small apartment, but a family probably wants a better one (T3+) because they need more room for the babies, etc. Using the runtime mappings (painless script) we saw before, we can compute the lot type as part of the search aggregation request and obtain the following tables.

{
  "runtime_mappings": {
    "price_m2": {
      "type": "double",
      "script": "emit(doc['property_value'].value / doc['real_built_up_area'].value)"
    },
    "lot_type": {
      "type": "keyword",
      "script": "if (0 < doc['lots_count'].value && doc['lots_count'].value < 6) { emit('T' + doc['lots_count'].value) } else { emit('Others') }"
    }
  },
  ...
}

The prices per lot type in Paris in percentiles:

Lot Type	p5 (€)	p25 (€)	p50 (€)	p75 (€)	p95 (€)
T1	31,264	210,041	358,500	600,600	1,565,087
T2	179,993	338,870	517,788	792,894	1,702,597
T3	162,000	372,188	618,806	1,232,125	2,740,950
T4	147,470	414,250	699,413	1,222,750	3,173,845
T5	331,108	530,791	809,000	1,275,048	3,220,000
Others	262,500	2,880,000	5,712,857	12,572,222	33,100,000

From the table above, we can see that it will be very hard to find an apartment if your budget is below 400,000€. It means that either you have a wonderful job or you will have to the house with your partner. Or maybe with some luck you won the loto 😉. Anyway it’s very expensive.

We can also normalize the price as we did before. Here is the price per meter square (€/m2) in Paris per lot type in percentiles:

Lot Type	p5 (€/m2)	p25 (€/m2)	p50 (€/m2)	p75 (€/m2)	p95 (€/m2)
T1	840	8,936	10,857	13,118	31,321
T2	6,235	9,246	10,742	12,545	19,277
T3	5,763	9,679	11,488	13,572	23,407
T4	3,047	9,251	11,021	13,586	22,644
T5	8,169	10,498	12,649	15,878	37,168
Others	6,250	63,372	147,957	300,013	916,866

From this table, we can see that regardless the number of pieces you want (the lot type), the price per meter square (m2) does not change much. The lot type is not an important factor for the price. The district is probably more important as we saw in the previous sections.

Alright, we go far enough into the Paris real-estate market. It’s crazy and it’s not for us right now. Let’s come back to the aggregations of Elasticsearch and we are reaching the end of this article.

Recapitulation

We saw several metric and bucket aggregations in this article, but we didn’t see all of them. There are so many metrics in Elasticsearch that we cannot remember everything. For me, the takeover is the syntax of the aggregation API. Once you remember this, it’s easy to search on the internet and complete the rest:

GET /{index_name_expression}/_search

{
  "query": {
    "{query_type}": { ... }
  },
  "size": 0,
  "runtime_mappings": {
    "{mapping_name}": { ... }
  },
  "aggs": {
    "{aggregation_name}": {
      "{aggregation_type}": {
        "field": "{field_name}"
        ...
      }
    }
  }
}

The API path contains the name of the index to be searched. It can also be multiple indices separated by comma (,) or _all indices. As for the API request body, it consists of multiple parts: the query part where you can specify the criteria of the selection; the size part where you can specify the number of search hits returned (actual documents found in Elasticsearch); the runtime mappings part to define one or multiple computed fields at query time; and finally the aggregations part where you can define your metric or bucket aggregation. You can also provide sub-aggregations under a given aggregation.

As for the response:

{
  "took": 14,
  "timed_out": false,
  "_shards": { ... },
  "hits": { ... },
  "aggregations": {
    "price_m2/stats": {
      "count": 147763,
      "min": 0.003750000149011612,
      "max": 8166666.666666667,
      "avg": 26941.800127837614,
      "sum": 3981001212.2896695
    }
  }
}

It contains some metadata about the search settings and performance, such as the execution time, the timeout, the number of shards reached. Then, it contains the number of hits and the actual documents. And finally the aggregations, each aggregation is a key-value pair in the JSON response, where the key is the name of the aggregation we specified and the value is the actual result of the aggregation. As for the structure of the aggregation result:

For metric aggregation, the actual result is the metric. There is one metric if this is a single-valued metric; there are multiple metrics if this is a multi-valued metric.
For bucket aggregation, the actual results are shown under entry buckets. Each bucket contains the key, the number of documents found, and the actual metric(s) requested.

We can also see it from another angle by comparing the syntax of Elasticsearch Aggregations API to SQL. They are not exactly equivalent, but I believe this comparison is helpful for understanding:

Item	SQL	Elasticsearch	Comment
Source	`FROM {source_table}`	Index name	You can select multiple indices in Elasticsearch but you cannot do that in SQL without JOIN.
Metric	`SELECT my_func(my_field)`	Name of the aggregation
Aggregation	`GROUP BY my_field`	`"field": {my_field}`
Query	`WHERE {clause}`	`"query": {clause}`
Size	-	`"size": {size}`	There are no equivalent. In SQL, you cannot GROUP BY and select all the fields of some documents at the same time.

Going Further

How to go further from here?

To learn more about aggregations for Elasticsearch 7, visit official documentation “Aggregations (7.x)”
To learn more about boolean query for Elasticsearch 7, visit official documentation “Boolean query (7.x)”
To learn more about dataset “Demandes de valeurs foncières géolocalisées”, visit the website of the French government etalab.
To learn how to earn 1M€ and buy an apartment in Paris? Well, I want to know as well. Please let me know if you have an answer by leaving a comment below. 😬

If you are interested to see the source code, you can also find it on my GitHub under project mincong-h/learning-elasticsearch.

Conclusion

In this article, we saw how to use metric aggregation using an example of value-count aggregation, how to use bucket aggregation for postal code via terms aggregation, and how to use scripted aggregation by computing the metric of price per square meter (m2). We also saw how to perform an aggregation that contains sub-aggregations (stats and percentiles) using Paris real-estate market as a demo. Finally, we saw how to remember the syntax efficiently and how to go further from here. Interested to know more? You can subscribe to the feed of my blog, follow me on Twitter or GitHub. Hope you enjoy this article, see you the next time!

References

Elasticsearch, “Value count aggregation”, elastic.co, 2021. https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-aggregations-metrics-valuecount-aggregation.html
Elasticsearch, “Elastic Certified Engineer Exam”, elastic.co, 2021. https://www.elastic.co/training/elastic-certified-engineer-exam
Jingwen Zheng, “Second-hand apartments transactions in Paris (01/2014 - 06/2020)”, jingwen.github.io, 2021. https://jingwen-z.github.io/second-hand-apartments-transactions-in-paris-1420/

PREVIOUSElasticsearch: Generate Configuration With Python Jinja 2

NEXTDVF: Real Estate Analysis For Île-de-France in 2020