ElasticSearch Aggregation(三)_aggregation definition for min starts with a

ElasticSearch Aggregation(三)
桶聚合
date histogram聚合
date range聚合
- 缺失值
- keyed response
filter聚合
- 使用顶级`query`来限制所有的聚合
- 在多个过滤器上使用`filters`
filters聚合
匿名过滤器
Other桶
date histogram聚合

日期直方图聚合。这 date histogram 多桶聚合与 histogram 非常类似，但是 date histogram 多桶聚合只能使用在日期或者日期范围上。因为在ElasticSearch内部日期是用 long 来表示的。这两种API最大的不同就是， date histogram 的 internal 参数可以使用日期或者时间表达式。
像直方图以上，值被四舍五入到最近的桶中。例如，如果interval的间隔时间为 1 天，那么 2020-01-03T07:00:01Z 会四舍五入为 2020-01-03T00:00:00Z 。对值的四舍五入的计算公式为：
bucket_key = Math.floor(value / interval) * interval
配置日期直方图聚合时，可以通过两种方式指定时间间隔：日历感知时间间隔和固定时间间隔。 
日历感知的间隔可以理解夏令时改变特定天数的长度，月份有不同的天数，并且闰秒可以附加到特定的一年。 
相比之下，固定间隔总是国际单位制的倍数，并且不会根据日历上下文而改变。 
日历感知间隔可以通过calendar_interval参数来配置。你可以使用单位名称，例如month，或者使用数量单位，例如1m来指定日历间隔。例如day和1d是等价的。不支持多个数量，例如2d。 
日历间隔接收以下参数： 
minute,1m
hour,1h
day,1d
week,1w
month,1m
quarter,1q
year,1y 
日历间隔例子
 
以下例子是一个聚合请求，分桶间隔是以日历一个月为时间单位。 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
如果你尝试使用多个日历单位，那么聚合将会运行失败，因为日历间隔支持单个日历单位。 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "2d"
  "error" : {
    "root_cause" : [...],
    "type" : "x_content_parse_exception",
    "reason" : "[1:82] [date_histogram] failed to parse field [calendar_interval]",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "The supplied interval [2d] could not be parsed as a calendar interval.",
      "stack_trace" : "java.lang.IllegalArgumentException: The supplied interval [2d] could not be parsed as a calendar interval."
固定间隔可以通过fixed_interval参数来设置。 
与日历感知相比，固定间隔是一个固定数量的国际制单位，并且从不偏离。允许以支持单位的倍数指定固定时间。 
然而固定时间不能表达其他单位，例如月，因为月的周期是不固定的。你尝试指定日历间隔时间将会引发异常。 
固定间隔接收以下参数： 
milliseconds (ms)
seconds (s)
minutes (m)
hours (h)
days (d) 
固定间隔例子
 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "fixed_interval": "30d"
在ElasticSearch内部，时间以64位长整数表示，以毫秒为单位。这些时间戳作为桶的key名返回。key_as_string是同一个时间戳被转换成特定格式的字符串。 
提示：如果比指定format格式，则使用字段映射中指定的第一个日期格式。 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "1M",
        "format": "yyyy-MM-dd" 
  "aggregations": {
    "sales_over_time": {
      "buckets": [
          "key_as_string": "2015-01-01",
          "key": 1420070400000,
          "doc_count": 3
          "key_as_string": "2015-02-01",
          "key": 1422748800000,
          "doc_count": 2
          "key_as_string": "2015-03-01",
          "key": 1425168000000,
          "doc_count": 2
keyed response
 
keyed设置为true时，会将一个唯一字符串与每个桶相关联，并且以散列的形式返回，而不是以数组的形式返回。 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "1M",
        "format": "yyyy-MM-dd",
        "keyed": true
  "aggregations": {
    "sales_over_time": {
      "buckets": {
        "2015-01-01": {
          "key_as_string": "2015-01-01",
          "key": 1420070400000,
          "doc_count": 3
        "2015-02-01": {
          "key_as_string": "2015-02-01",
          "key": 1422748800000,
          "doc_count": 2
        "2015-03-01": {
          "key_as_string": "2015-03-01",
          "key": 1425168000000,
          "doc_count": 2
如果文档中的数据与您想要聚合的数据不完全匹配，请使用运行时字段。例如，促销销售的收入应在销售日期后一天确认： 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "runtime_mappings": {
    "date.promoted_is_tomorrow": {
      "type": "date",
      "script": "long date = doc[\u0027date\u0027].value.toInstant().toEpochMilli();\nif (doc[\u0027promoted\u0027].value) {\n  date += 86400;\n}\nemit(date);"
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date.promoted_is_tomorrow",
        "calendar_interval": "1M"
missing 参数定义了如何处理缺失值的文档。默认情况下，它们会被忽略，但也可以将它们视为具有值。 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "sale_date": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "year",
        "missing": "2000/01/01" 
默认情况下，返回的存储桶按其键升序排序，但您可以使用 order 设置控制顺序。 
date range聚合
 
专用于日期值的范围聚合。此聚合与常规范围聚合的主要区别在于，from和to值可以用Date Math表达式表示，还可以指定一种日期格式，通过该格式返回from和to响应字段。请注意，此聚合包括每个范围的from值，不包括to值。 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyyy",
        "ranges": [
          { "to": "now-10M/M" },  
          { "from": "now-10M/M" } 
以上例子会创建两个桶 
十个月之前的
前十个月到现在的 
GET my-index-000001/_search
  "size": 0,
  "aggs": {
    "range": {
      "date_range": {
        "field": "birthday",
        "ranges": [
            "from": "2015-01",
            "to": "2015-12"
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "range" : {
      "buckets" : [
          "key" : "2015-01-01T00:00:00.000Z-2015-12-01T00:00:00.000Z",
          "from" : 1.4200704E12,
          "from_as_string" : "2015-01-01T00:00:00.000Z",
          "to" : 1.448928E12,
          "to_as_string" : "2015-12-01T00:00:00.000Z",
          "doc_count" : 3
missing 参数定义应如何处理缺少值的文档。默认情况下，它们将被忽略，但也可以将它们视为具有值。这是通过添加一组 fieldname : value 映射来指定每个字段的默认值来完成的。 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
   "aggs": {
       "range": {
           "date_range": {
               "field": "date",
               "missing": "1976/11/30",
               "ranges": [
                    "key": "Older",
                    "to": "2016/02/01"
                    "key": "Newer",
                    "from": "2016/02/01",
                    "to" : "now/d"
以上"missing": "1976/11/30"是指某些文档中缺失date字段的时候，在日期范围分桶聚合的时候默认赋值为1976/11/30。




    
 
keyed response
 
将 keyed 标志设置为 true 会将唯一的字符串键与每个存储桶关联，并将范围作为散列而不是数组返回： 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyy",
        "ranges": [
          { "to": "now-10M/M" },
          { "from": "now-10M/M" }
        "keyed": true
  "aggregations": {
    "range": {
      "buckets": {
        "*-10-2015": {
          "to": 1.4436576E12,
          "to_as_string": "10-2015",
          "doc_count": 7
        "10-2015-*": {
          "from": 1.4436576E12,
          "from_as_string": "10-2015",
          "doc_count": 0
你也可以为每个范围自定义key名称 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyy",
        "ranges": [
          { "from": "01-2015", "to": "03-2015", "key": "quarter_01" },
          { "from": "03-2015", "to": "06-2015", "key": "quarter_02" }
        "keyed": true
  "aggregations": {
    "range": {
      "buckets": {
        "quarter_01": {
          "from": 1.4200704E12,
          "from_as_string": "01-2015",
          "to": 1.425168E12,
          "to_as_string": "03-2015",
          "doc_count": 5
        "quarter_02": {
          "from": 1.425168E12,
          "from_as_string": "03-2015",
          "to": 1.4331168E12,
          "to_as_string": "06-2015",
          "doc_count": 2
filter聚合
 
过滤器聚合。就是在进行桶聚合之前，对文档进行过滤，利用过滤后的文档集合来进行单桶聚合。例如： 
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
  "size": 0, 
  "aggs": {
    "avg_age":{
      "avg": {
        "field": "age"
    "filter_agg": {
      "filter": {
        "term": {
          "email": "123456@qq.com"
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
响应值为： 
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "avg_age" : {
      "value" : 39.75
    "filter_agg" : {
      "doc_count" : 2,
      "avg_age" : {
        "value" : 60.0
使用顶级query来限制所有的聚合
 
在运行搜索的时候可以利用顶级查询来限制所有文档的聚合。这种方式比单独使用filter聚合跟快。例如： 
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
  "size": 0, 
  "query": {
    "term": {
      "email": {
        "value": "123456@qq.com"
  "aggs": {
    "age_avg": {
      "avg": {
        "field": "age"
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "age_avg" : {
      "value" : 60.0
在多个过滤器上使用filters
 
使用filter aggregation来分组文档，这种方式要比使用多个单独的filter要快的多。例如： 
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
  "size": 0, 
  "aggs": {
    "myfilters": {
      "filters": {
        "filters": {
          "a": {"term": {
            "email": "123456@qq.com"
          "b":{
            "term": {
              "email": "110@qq.com"
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
以下写法与上边的写法是等价的。但是上边的性能要比下面的性能高得多： 
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
  "size": 0,
  "aggs": {
    "a": {
      "filter": {"term": {
        "email": "123456@qq.com"
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
    "b":{
      "filter": {"term": {
        "email": "110@qq.com"
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "myfilters" : {
      "buckets" : {
        "a" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 60.0
        "b" : {
          "doc_count" : 1,
          "age_avg" : {
            "value" : 13.0
filters聚合
 
多桶聚合，其中每个桶包含与查询匹配的文档。例如： 
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
  "size": 0, 
  "aggs": {
    "my_filters": {
      "filters": {
        "filters": {
          "li": {
            "match": {
              "name": "li"
          "wang":{
            "match":{
              "name":"wang"
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "my_filters" : {
      "buckets" : {
        "li" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
        "wang" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
匿名过滤器
 
过滤器字段也可以作为过滤器数组提供，如以下请求所示： 
GET my-index-000001/_search
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "filters": [
          {"match":{"name":"li"}},
          {"match":{"name":"wang"}}
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "my_filters" : {
      "buckets" : [
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
Other桶
 
设置other_bucket参数可向响应中添加一个桶，该桶包含与filters中都不匹配的所有文档。参数值可以是以下： 
 false：并计算其他桶
 
 true：如果使用的是匿名过滤器，那么最后一个桶就是other桶。如果不是匿名过滤器，那么other桶的名称由other_bucket_key参数指定。
  
以下例子中，other桶被命名为other_messages。 
GET my-index-000001/_search
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "other_bucket_key": "other_messages",
        "filters": [
            "match": {
              "name": "li"
            "match": {
              "name": "wang"
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "my_filters" : {
      "buckets" : {
        "li" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
        "wang" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
        "other_messages" : {
          "doc_count" : 0,
          "age_avg" : {
            "value" : null
以下例子使用匿名过滤器，那么最后一个桶就是other桶 
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "other_bucket_key": "other_messages",
        "filters": [
            "match": {
              "name": "li"
            "match": {
              "name": "wang"
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    "max_score" : null,
    "hits" : [ ]
  "aggregations" : {
    "my_filters" : {
      "buckets" : [
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          "doc_count" : 0,
          "age_avg" : {
            "value" : null
                                    aggregation分类
aggregations —— 聚合，提供了一种基于查询条件来对数据进行分桶、计算的方法。有点类似于 SQL 中的 group by 再加一些函数方法的操作。
聚合可以嵌套，由此可以组成复杂的操作（Bucketing聚合可以包含sub-aggregation）。
聚合整体上可以分为 3 类：
1. Bucketing：桶分聚合：
此类聚合执行的是对文档分组...
                                    Aggregation 在中文中也被称作聚合。简单地说，Elasticsearch 中的 aggregation 聚合将你的数据汇总为指标、统计数据或其他分析。聚合可帮助你回答以下问题：
我的网站的平均加载时间是多少？
根据交易量，谁是我最有价值的客户？
什么会被认为是我网络上的大文件？
每个产品类别有多少产品？.........
                                    局部更新文档的正确格式：
我的理解是，es文档的局部更新其实是在更新时传递一个叫doc的文档对象参数，里面写着你要修改的文档json数据，然后到了es的内部，把旧文档读出来，标记成deleted，文档参数和旧文档做一个数据上的合并，字段相同就覆盖，字段不同就新增，生成新的文档，重新写入es。
                                    3、在 Elasticsearch 的最新版本中，[must_not] 子句需要使用范围查询或布尔查询来指定一个或多个条件。1、排查发现最新的es版本是7.10.0  ， 而之前的是7.0.1 版本。2、换成如下写法就没问题了。最近运维新建了es集群。
                                    1. ES
The Elastic Stack, 包括 Elasticsearch、 Kibana、 Beats 和 Logstash（也称为 ELK Stack）。能够安全可靠地获取任何来源、任何格式的数据，然后实时地对数据进行搜索、分析和可视化。
Elaticsearch，简称为 ES， ES 是一个开源的高扩展的分布式全文搜索引擎， 是整个 ElasticStack 技术栈的核心。
它可以近乎实时的存储、检索数据；本身扩展性很好，可以扩展到上百台服务器，处理 PB 级别的数据。
2. ES安装
                                    Terms Aggregation
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
Example:
GET /_search
    "aggs" : {
        "genres" : {
            "terms" : { "field" : "genre" }
Cop..
                                    1.Elasticsearch相关概念
Elasticsearch中有几个基本概念，如节点、索引、文档等，下面分别说明一下。理解这些概念对熟悉Elasticsearch有帮助
节点和集群
Elasticsearch本质上是一个分布式数据库，允许多台服务器协同工作，每台服务器均可以运行多个Elasticsearch实例。
单个Elasticsearch实例称为一个节点(Node),一组节点构成一个集群(Cluster)。
索引就是index，Elasticsearch会索引所有字段，经过处理后写
from functools import reduce
conf = SparkConf().setAppName('myFirstAPP').setMaster('local') #连接spark
sc = SparkContext(conf = conf) ##生成SparkCont
                                    1、用./bin/spark-shell启动spark时遇到异常：java.net.BindException: Can't assign requested address: Service 'sparkDriver' failed after 16 retries!
解决方法：add export SPARK_LOCAL_IP="127.0.0.1" to spark-env....
                                      管道概念  POSIX多线程的使用方式中， 有一种很重要的方式-----流水线（亦称为“管道”）方式，“数据元素”流串行地被一组线程按顺序执行。它的使用架构可参考下图：    以面向对象的思想去理解，整个流水线，可以理解为一个数据传输的管道；该管道中的每一个工作线程，可以理解为一个整个流水线的一个工作阶段stage,这些工作线程之间的合作是一环扣一环的。靠输入口越近的工作线程，是时序较早的工作...