---Start Configuration---
elasticsearch的config文件夹里面有两个配置文件:elasticsearch.yml和logging.yml,第一个是es的基本配置文件,第二个是日志配置文件,es也是使用log4j来记录日志的,所以logging.yml里的设置按普通log4j配置文件来设置就行了。下面主要讲解下elasticsearch.yml这个文件中可配置的东西。
配置es的集群名称,默认是elasticsearch,es会自动发现在同一网段下的es,如果在同一网段下有多个集群,就可以用这个属性来区分不同的集群。
节点名,默认随机指定一个name列表中名字,该列表在es的jar包中config文件夹里name.txt文件中,其中有很多作者添加的有趣名字。
指定该节点是否有资格被选举成为node,默认是true,es是默认集群中的第一台机器为master,如果这台机挂了就会重新选举master。
指定该节点是否存储索引数据,默认为true。
设置默认索引分片个数,默认为5片。
设置默认索引副本个数,默认为1个副本。
设置配置文件的存储路径,默认是es根目录下的config文件夹。
设置索引数据的存储路径,默认是es根目录下的data文件夹,可以设置多个存储路径,用逗号隔开,例:
path.data: /path/to/data1,/path/to/data2
设置临时文件的存储路径,默认是es根目录下的work文件夹。
设置日志文件的存储路径,默认是es根目录下的logs文件夹
设置插件的存放路径,默认是es根目录下的plugins文件夹
设置为true来锁住内存。因为当jvm开始swapping时es的效率会降低,所以要保证它不swap,可以把ES_MIN_MEM和ES_MAX_MEM两个环境变量设置成同一个值,并且保证机器有足够的内存分配给es。同时也要允许elasticsearch的进程可以锁住内存,linux下可以通过`ulimit -l unlimited`命令。
设置绑定的ip地址,可以是ipv4或ipv6的,默认为0.0.0.0。
设置其它节点和该节点交互的ip地址,如果不设置它会自动判断,值必须是个真实的ip地址。
这个参数是用来同时设置bind_host和publish_host上面两个参数。
设置节点间交互的tcp端口,默认是9300。
设置是否压缩tcp传输时的数据,默认为false,不压缩。
设置对外服务的http端口,默认为9200。
设置内容的最大容量,默认100mb
是否使用http协议对外提供服务,默认为true,开启。
gateway的类型,默认为local即为本地文件系统,可以设置为本地文件系统,分布式文件系统,hadoop的HDFS,和amazon的s3服务器,其它文件系统的设置方法下次再详细说。
设置集群中N个节点启动时进行数据恢复,默认为1。
设置初始化数据恢复进程的超时时间,默认是5分钟。
设置这个集群中节点的数量,默认为2,一旦这N个节点启动,就会立即进行数据恢复。
初始化数据恢复时,并发恢复线程的个数,默认为4。
添加删除节点或负载均衡时并发恢复线程的个数,默认为4。
设置数据恢复时限制的带宽,如入100mb,默认为0,即无限制。
设置这个参数来限制从其它分片恢复数据时最大同时打开并发流的个数,默认为5。
设置这个参数来保证集群中的节点可以知道其它N个有master资格的节点。默认为1,对于大的集群来说,可以设置大一点的值(2-4)
设置集群中自动发现其它节点时ping连接超时时间,默认为3秒,对于比较差的网络环境可以高点的值来防止自动发现时出错。
设置是否打开多播发现节点,默认是true。
discovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3[portX-portY]"]
设置集群中master节点的初始列表,可以通过这些节点来自动发现新加入集群的节点。
index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug:500ms
index.search.slowlog.threshold.fetch.trace: 200ms
linux time_wait 过大参数设置(看看time_wait变大是不是周期性:jstat)
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 30
【以下均为官网收集未经测试】
action.get.realtime:true
设定是否实时搜索,By default, the get
API
is realtime,In order to disable realtime GET, one can either set realtime parameter to false, or globally default it to by setting the action.get.realtime to false in the node configuration.
1.设置ES_HEAP_SIZE环境变量,保证JVM使用的最大和最小内存用量相同。如果设置的最小和最大内存不一样,这意味着当jvm需要额外的内存时(最多达到最大内存的大小),它会阻塞java进程来分配内存给它。结合使用旧版本的java情况就可以解释为什么集群中的节点会停顿、出现高负载和不断的进行内存分配的情况。elasticsearch团队建议给es设置50%的系统内存
2.缩短recover_after_time超时配置,这样恢复可以马上进行,而不是还要再等一段时间。
3.配置minimum_master_nodes,避免多个节点长期暂停时,有些节点的子集合试图自行组织集群,从而导致整个集群不稳定。
4.在es初始恢复的时候,一些节点用完了磁盘空间。这个不知道是怎样发生的,因为整个集群只使用了总空间的67%,不过相信这是由于之前的高负载和旧java版本引起的。elasticsearch的团队也在跟进这个问题。
【elasticsearch团队提供了对运行大集群的几点优化建议:】
2.缩短recover_after_time超时配置,这样恢复可以马上进行,而不是还要再等一段时间。
3.配置minimum_master_nodes,避免多个节点长期暂停时,有些节点的子集合试图自行组织集群,从而导致整个集群不稳定。
4.在es初始恢复的时候,一些节点用完了磁盘空间。这个不知道是怎样发生的,因为整个集群只使用了总空间的67%,不过相信这是由于之前的高负载和旧java版本引起的。elasticsearch的团队也在跟进这个问题。
【单机与服务器环境搭建】
先到http://www.elasticsearch.org/download/下载最新版的elasticsearch运行包,本文写时最新的是0.19.1,作者是个很勤快的人,es的更新很频繁,bug修复得很快。下载完解开有三个包:bin是运行的脚本,config是设置文件,lib是放依赖的包。如果你要装插件的话就要多新建一个plugins的文件夹,把插件放到这个文件夹中。
1.单机环境:
单机版的elasticsearch运行很简单,linux下直接 bin/elasticsearch就运行了,windows运行bin/elasticsearch.bat。如果是在局域网中运行elasticsearch集群也是很简单的,只要cluster.name设置一致,并且机器在同一网段下,启动的es会自动发现对方,组成集群。
2.服务器环境:
如果是在服务器上就可以使用elasticsearch-servicewrapper这个es插件,它支持通过参数,指定是在后台或前台运行es,并且支持启动,停止,重启es服务(默认es脚本只能通过ctrl+c关闭es)。使用方法是到https://github.com/elasticsearch/elasticsearch-servicewrapper下载service文件夹,放到es的bin目录下。下面是命令集合:
bin/service/elasticsearch +
console 在前台运行es
start 在后台运行es
stop 停止es
install 使es作为服务在服务器启动时自动启动
remove 取消启动时自动启动
在service目录下有个elasticsearch.conf配置文件,主要是设置一些java运行环境参数,其中比较重要的是下面的
参数:
#es的home路径,不用用默认值就可以
set.default.ES_HOME=<Path to ElasticSearch Home>
#分配给es的最小内存
set.default.ES_MIN_MEM=256
#分配给es的最大内存
set.default.ES_MAX_MEM=1024
# 启动等待超时时间(以秒为单位)
wrapper.startup.timeout=300
# 关闭等待超时时间(以秒为单位)
wrapper.shutdown.timeout=300
# ping超时时间(以秒为单位)
wrapper.ping.timeout=300
【 Get Settings】--The get settings API allows to retrieve settings of index/indices:
1、下载nutch2.1
nutch下载地址:http://labs.mop.com/apache-mirror/nutch/2.1/apache-nutch-2.1-src.tar.gz
下载完成后解压,
2、配置nutch使用mysql作为数据存储,修改nutch根目录/ivy/ivy.xml文件
将这行的注释取消<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>
修改nutch根目录/conf/gora.properties文件,把默认存储配置改成如下
3、修改conf的nutch-site.xml
4、使用ant编译源码。
nutch根目录下输入ant
5、设置待抓取的网站
cd nutch根目录/runtime/local
创建urls目录
mkdir -p urls
创建初始爬取网站列表
echo 'http://www.web.com/' > urls/seed.txt
6、创建数据库和表
7、执行爬行操作:
执行完在mysql中即可以查看到爬虫抓取的内容
8、执行索引操作:
注意:nutch2.1是通过创建一个es节点加入集群的方式与es集群交互的,所以只需知道es集群的集群名字就可以。并且必须在同一局域网内,不过这样的话对于关闭了广播通信的es集群并不适用。
执行完后就可以通过http://eshost:9200/index/_status来看到建立的索引信息
附elasticsearch 网页索引mapping(使用ik分词):
先看一个查询请求的json例子:
1.单机环境:
单机版的elasticsearch运行很简单,linux下直接 bin/elasticsearch就运行了,windows运行bin/elasticsearch.bat。如果是在局域网中运行elasticsearch集群也是很简单的,只要cluster.name设置一致,并且机器在同一网段下,启动的es会自动发现对方,组成集群。
2.服务器环境:
如果是在服务器上就可以使用elasticsearch-servicewrapper这个es插件,它支持通过参数,指定是在后台或前台运行es,并且支持启动,停止,重启es服务(默认es脚本只能通过ctrl+c关闭es)。使用方法是到https://github.com/elasticsearch/elasticsearch-servicewrapper下载service文件夹,放到es的bin目录下。下面是命令集合:
bin/service/elasticsearch +
console 在前台运行es
start 在后台运行es
stop 停止es
install 使es作为服务在服务器启动时自动启动
remove 取消启动时自动启动
在service目录下有个elasticsearch.conf配置文件,主要是设置一些java运行环境参数,其中比较重要的是下面的
参数:
#es的home路径,不用用默认值就可以
set.default.ES_HOME=<Path to ElasticSearch Home>
#分配给es的最小内存
set.default.ES_MIN_MEM=256
#分配给es的最大内存
set.default.ES_MAX_MEM=1024
# 启动等待超时时间(以秒为单位)
wrapper.startup.timeout=300
# 关闭等待超时时间(以秒为单位)
wrapper.shutdown.timeout=300
# ping超时时间(以秒为单位)
wrapper.ping.timeout=300
【Java虚拟机配置】
Elasticsearch对Java虚拟机进行了预先的配置。通常情况下,因为这些配置的选择还是很谨慎的,所以你不需要太关心,并且你能立刻使用ElasticSearch。
但是,当你监视ElasticSearch节点内存时,你可能尝试修改一些配置。这些修改是否会改善你的处境?
这篇博文尝试揭开Elasticsearch配置的神秘面纱,并且讨论最常见的调整。最终,会给出一些推荐的配置调整。
Elasticsearch JVM 配置概览:
这些是Elasticsearch 0.19.11版本的默认配置。
JVM参数 | Elasticsearch默认值 | Environment变量 |
-Xms | 256m | ES_MIN_MEM |
-Xmx | 1g | ES_MAX_MEM |
-Xms and -Xmx | ES_HEAP_SIZE | |
-Xmn | ES_HEAP_NEWSIZE | |
-XX:MaxDirectMemorySize | ES_DIRECT_SIZE | |
-Xss | 256k | |
-XX:UseParNewGC | + | |
-XX:UseConcMarkSweepGC | + | |
-XX:CMSInitiatingOccupancyFraction | 75 | |
-XX:UseCMSInitiatingOccupancyOnly | + | |
-XX:UseCondCardMark | (commented out) |
首先你注意到的是,Elasticsearch预留了256M到1GB的堆内存。
这个设置适用于开发和演示环境。开发人员只需要简单的解压发行包,再执行./bin/elasticsearch -f就完成了Elasticsearch的安装。当然这点对于开发来说非常棒,并且在很多场景下都能工作,但是当你需要更多内存来降低Elasticsearch负载的时候就不行了,你需要比2GB RAM更多的可用内存。
ES_MIN_MEM/ES_MAX_MEM是控制堆大小的配置。新的ES_HEAP_SIZE变量是一个更为便利的选择,因为将堆的初始大小和最大值设为相同。也推荐在分配堆内存时尽可能不要用内存的碎片。内存碎片对于性能优化来说非常不利。
ES_HEAP_NEWSIZE是可选参数,它控制堆的子集大小,也就是新生代的大小。
ES_DIRECT_SIZE控制本机直接内存大小,即JVM管理NIO框架中使用的数据区域大小。本机直接内存可以被映射到虚拟地址空间上,这样在64位的机器上更高效,因为可以规避文件系统缓冲。Elasticsearch对本机直接内存没有限制(可能导致OOM)。
由于历史原因Java虚拟机有多个垃圾收集器。可以通过以下的JVM参数组合启用:
JVM parameter | Garbage collector |
-XX:+UseSerialGC | serial collector |
-XX:+UseParallelGC | parallel collector |
-XX:+UseParallelOldGC | Parallel compacting collector |
-XX:+UseConcMarkSweepGC | Concurrent-Mark-Sweep (CMS) collector |
-XX:+UseG1GC | Garbage-First collector (G1) |
UseParNewGC和UseConcMarkSweepGC组合启用垃圾收集器的并发多线程模式。UseConcMarkSweepGC自动选择UseParNewGC模式并禁用串行收集器(Serial collector)。在Java6中这是默认行为。
CMSInitiatingOccupancyFraction提炼了一种CMS(Concurrent-Mark-Sweep)垃圾收集设置;它将旧生代触发垃圾收集的阀值设为75.旧生代的大小是堆大小减去新生代大小。这告诉JVM当堆内容达到75%时启用垃圾收集。这是个估计的值,因为越小的堆可能需要越早启动GC。
UseCondCardMark将在垃圾收集器的card table使用时,在marking之前进行额外的判断,避免冗余的store操作。UseCondCardMark不影响Garbage-First收集器。强烈推荐在高并发场景下配置这个参数(规避card table marking技术在高并发场景下的降低吞吐量的负面作用)。在ElasticSearch中,这个参数是被注释掉的。
有些配置可以参考诸如Apache Cassandra项目,他们在JVM上有类似的需求。
总而言之,ElastciSearch配置上推荐:
1.不采用自动的堆内存配置,将堆大小默认最大值设为1GB
2.调整触发垃圾收集的阀值,比如将gc设为75%堆大小的时候触发,这样不会影响性能。
3.禁用Java7默认的G1收集器,前提是你的ElasticSearch跑在Java7u4以上的版本上。
JVM进程的内存结果
JVM内存由几部分组成:
Java代码本身:包括内部代码、数据、接口,调试和监控代理或者字节码指令
非堆内存:用于加载类
栈内存:用于为每个线程存储本地变量和操作数
堆内存:用于存放对象引用和对象本身
直接缓冲区:用于缓冲I/O数据
堆内存的大小设置非常重要,因为Java的运行依赖于合理的堆大小,并且JVM需要从操作系统那获取有限的堆内存,用于支撑整个JVM生命周期。
如果堆太小,垃圾回收就会频繁发生,发生OOM的几率会很大。
如果堆太大,垃圾回收会延迟,但是一旦回收,就需要处理大量的存活堆数据。并且,操作系统的压力也会变大,因为JVM进程需要更大的堆,产生换页的可能性就会提高。
注意,使用CMS垃圾收集器,Java不会把内存还给操作系统,因此配置合理的堆初始值和最大值就非常重要。
非堆内存由Java应用自动分配。没有什么参数控制这里的大小,这是由Java应用程序代码自己决定的。
栈内存在每个线程中分配,在Elasticsearch中,每个线程大小必须由128K增加到256K,因为Java7比Java6需要更大的栈内存 ,这是由于Java7支持新的编程语言特征来利用栈空间。比如,引入了continuations模型,编程语言的一个著名概念。Continuations模型对于
协同程序、绿色线程(green thread)、纤程(fiber)非常有用 。当实现非阻塞I/O时,一个大的优势是,代码可以根据线程实际使用情况编写,但是运行时仍然在后台采用非阻塞I/O。Elasticsearch使用了多个线程池,因为Netty I/O框架和Guava是Elasticsearch的基础组件,因此在用Java7时,可以考虑进一步挖掘优化线程的特性。
发挥增加栈空间大小的优势还是有挑战的,因为不同的操作系统、不同的CPU架构,甚至在不同的JVM版本之间,栈空间的消耗不是容易比较的。取决于CPU架构和操作系统,JVM的栈空间大小是内建的。他们是否在所有场景下都适合?例如Sloaris Sparc 64位的JVM Xss默认为512K,因为有更大地址指针,Sloaris X86为320K。Linux降为256K。Windows 32位Java6默认320K,Windows 64位则为1024K。
大堆的挑战
今天,几GB的内存是很常见的。但是在不久以前,系统管理员还在为多几G的内存需求泪流满面。
Java垃圾收集器是随着2006年的Java6的出现而显著改进的。从那以后,可以并发执行多任务,并且减少了GC停顿几率: stop - the - world阶段。CMS算法是革命性的,多任务,并发, 不需要移动的GC。但是不幸的是,对于堆的存活数据量来说,它是不可扩展的。Prateek Khanna 和 Aaron Morton给出了CMS垃圾收集器能够处理的堆规模的数字。
避免Stop-the-world阶段
我们已经学习了Elasticsearch如何配置CMS垃圾收集器。但这并不能组织长时间的GC停顿,它只是降低了发生的几率。CMS是一个低停顿几率的收集器,但是仍然有一些边界情况。当堆上有MB级别的大数组,或者其他一些特殊的场景,CMS可能比预期要花费更多的时间。
MB级别数组的创建在Lucene segment-based索引合并时是很常见的。如果你希望降低CMS的额外负载,就需要调整Lucene合并阶段的段数量,使用参数index.merge.policy.segments_per_tier
减少换页
大堆的风险在于内存压力上。注意,如果Java JVM在处理大堆时,这部分内存对于系统其它部分来说是不可用的。如果内存吃紧,操作系统会进行换页,并且,在紧急情况下,当所有其他方式回收内存都失败时,会强制杀掉进程。如果换页发生,整个系统的性能会下降,自然GC的性能也跟着下降。所以,不要给堆分配太多的内存。
垃圾收集器的选择
从Java JDK 7u4开始,Garbage-First(G1)收集器是Java7默认的垃圾收集器。它适用于多核的机器以及大内存。它一方面降低了停顿时间,另一方面增加了停顿的次数。整个堆的操作,例如全局标记,是在应用线程中并发执行的。这会防止随着堆或存活数据大小的变化,中断时间也成比例的变化。
G1收集器目标是获取更高的吞吐量,而不是速度。在以下情况下,它能运行的很好:
1. 存活数据占用了超过50%的Java堆
2. 对象分配比例或者promotion会有明显的变化
3. 不希望gc或者compaction停顿时间长(超过0.5至1s)
注意,如果使用G1垃圾收集器,堆不再使用的内存可能会被归还给操作系统
G1垃圾收集器的不足是CPU使用率越高,应用性能越差。因此,如果在内存足够和CPU能力一般的情况下,CMS可能更胜一筹。
对于Elasticsearch来说,G1意味着没有长时间的stop-the-world阶段,以及更灵活的内存管理,因为buffer memory和系统I/O缓存能更充分的利用机器内存资源。代价就是小成本的最大化性能,因为G1利用了更多CPU资源。
性能调优策略
你读这篇博文因为你希望在性能调优上得到一些启示:
1. 清楚了解你的性能目标。你希望最大化速度,还是最大化吞吐量?
2. 记录任何事情(log everything),收集统计数据,阅读日志、分析事件来诊断配置
3. 选择你调整的目标(最大化性能还是最大化吞吐量)
4. 计划你的调整
5. 应用你的新配置
6. 监控新配置后的系统
7. 如果新配置没有改善你的处境,重复上面的一系列动作,反复尝试
Elasticsearch垃圾收集日志格式
Elasticsearch长时间GC下warns级别的日志如下所示:
[2012-11-26 18:13:53,166][WARN ][monitor.jvm ] [Ectokid] [gc][ParNew][1135087][11248] duration [2.6m], collections [1]/[2.7m], total [2.6m]/[6.8m], memory [2.4gb]->[2.3gb]/[3.8gb], all_pools {[Code Cache] [13.7mb]->[13.7mb]/[48mb]}{[Par Eden Space] [109.6mb]->[15.4mb]/[1gb]}{[Par Survivor Space] [136.5mb]->[0b]/[136.5mb]}{[CMS Old Gen] [2.1gb]->[2.3gb]/[2.6gb]}{[CMS Perm Gen] [35.1mb]->[34.9mb]/[82mb]}
JvmMonitorService类中有相关的使用方式:
Logfile | Explanation |
gc | 运行中的gc |
ParNew | new parallel garbage collector |
duration 2.6m | gc时间为2.6分钟 |
collections [1]/[2.7m] | 在跑一个收集,共花2.7分钟 |
memory [2.4gb]->[2.3gb]/[3.8gb] | 内存消耗, 开始是2.4gb, 现在是2.3gb, 共有3.8gb内存 |
Code Cache [13.7mb]->[13.7mb]/[48mb] | code cache占用内存 |
Par Eden Space [109.6mb]->[15.4mb]/[1gb] | Par Eden Space占用内存 |
Par Survivor Space [136.5mb]->[0b]/[136.5mb] | Par Survivor Space占用内存 |
CMS Old Gen [2.1gb]->[2.3gb]/[2.6gb] | CMS Old Gen占用内存 |
CMS Perm Gen [35.1mb]->[34.9mb]/[82mb] | CMS Perm Gen占用内存 |
JvmMonitorSer
一些建议
1.不要在Java 6u22之前的发布版本中跑Elasticsearch。有内存方面的bug。那些超过两三年的bug和缺陷会妨碍Elasticsearch的正常运行。与旧的OpenJDK 6相比,更推荐Sun/Oracle的版本,因为后者修复了很多bug。
2.放弃Java6,转到Java7。Oracle宣称Java6更新到2013年2月结束。考虑到Elasticsearch还是一个相对新的软件,应该使用更新的技术来提升性能。尽量从JVM中挤压性能。检查操作系统的版本。在最新版本的操作系统中运行,有助于你的Java运行环境达到最佳性能。
3.定期更新Java运行环境。平均一个季度一次。告诉sa你需要及时更新Java版本,以获取Java性能的提升。
4.从小到大。先在Elasticsearch单节点上进行开发。但是不要忘了Elasticsearch分布式的强大功能。单节点不能模拟生产环境的特征,至少需要3个节点进行开发测试。
5.在调整JVM之前先做一下性能测试。对你的系统建立性能基线。调整测试时候的节点数量。如果索引时候负载很高,你可能需要降低Elasticsearch索引时候占用的堆大小,通过index.merge.policy.segments_per_tier参数调整段的合并。
6.调整前清楚你的性能目标,然后决定是调整速度还是吞吐量。
7.启用日志以便更好的进行诊断。在优化系统前进行小心的评估。
8.如果使用CMS垃圾收集器,你可能需要加上合理的 -XX:CMSWaitDuration 参数。
9.如果你的堆超过6-8GB,超过了CMS垃圾收集器设计容量,你会遇到长时间的stop-the-world阶段,你有几个方案:调整CMSInitiatingOccupancyFraction参数降低长时间GC的几率减少最大堆的大小;启用G1垃圾收集器。
10.学习垃圾收集调优艺术。如果你想精通的话,列出可用的JVM选项,在java命令中加入java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version,然后调优。
---End Configuration ---
---Start API---
This section describes the
REST
APIs
elasticsearch
provides (mainly) using
JSON. The
API
is exposed using
HTTP,
thrift,
memcached.
ES的API,最基本的就是CRUD操作,这部分是标准的REST。
然后还有三个API比较重要且常用,分别是:
_bulk/
_count/
_search。
Bulk:顾名思义,把多个单条的记录合并成一个大数组统一提交,这样避免一条条发送的header解析,索引频繁更新,indexing速度大大提高;
Count:根据POST的json,返回命中范围内的总条数。当然没POST时就直接返回该index的总条数了;
Search:根据POST的json或者GET的args,返回命中范围内的数据。这是最重要的部分了。下面说说常用的search API:
Query:
一旦使用search,必须至少提供query参数,然后在这个query的基础上进行接下来其他的检索。query参数又分三类:
“match_all”:{ } 直接请求全部;
“term”/”text”/”prefix”/”wildcard”:{“key”:“value”}根据字符串搜索(严格相等/片断/前缀/匹配符);
“range”: {“@timestamp”:{“from”:“now-1d”,“to”:“now”} }根据范围搜索,如果type是时间格式,可以使用内置的now表示当前,然后用-1d/h/m/s来往前推。
Filter:
上面提到的query的参数,在filter中也都存在。此外,还有比较重要的参数就是连接操作:
“or”/”and” : [{“range”:{}}, {“prefix”:””}] 多个filter的查询,交集或者合集;
“not”/”limit” : {} 取反和限定执行数。注意这个limit和mysql什么的有点不同:它限定的是在每个shards上执行多少条。如果有5个shards,其实对整个index是limit了5倍大小的设定值。
注意:filter结果默认是不缓存的,如果常用,需要指定”_cache”:true。
Facets:
facets接口可以根据query返回统计数据,最基础的是terms和statistical两种。不过在日志分析的情况下,最常用的是:
“histogram”: {“key_field”:“”,“value_field”:“”,“interval”:“”}根据时间间隔返回柱状图式的统计数据;
“terms_stats”:{“key_field”:“”,“value_field”:“”}根据key的情况返回value的统计数据,类似group by的意思。
这里就涉及到前面mapping里为什么针对每个field都设定type的原因了。因为histogram里的key_field只能是dateOptionalTime格式的,value_field只能是string格式的;而terms_stats里的key_field只能是string格式的,value_field只能是numberic格式的。
而http code那些200/304/400/503数据,看起来是数字,但需要的是它们的count数据,不是算它们的平均数。所以不能由ES动态的认定为long,得指定为string。
#Core
【
Multi Search】--The multi search API allows to execute several search requests within the same API. The endpoint for it is _msearch (available from 0.19 onwards).
The format of the request is similar to the bulk API format, and the structure is as follows (the structure is specifically optimized to reduce parsing if a specific search ends up redirected to another node):
header\n
body\n
header\n
body\n
The header part includes which index / indices to search on, optional (mapping) types to search on, the search_type, preference, and routing. The body includes the typical search body request (including the query, facets, from, size, and so on). Here is an example:
$ cat requests
{"index" : "test"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10}
{"index" : "test", "search_type" : "count"}
{"query" : {"match_all" : {}}}
{}
{"query" : {"match_all" : {}}}
{"query" : {"match_all" : {}}}
{"search_type" : "count"}
{"query" : {"match_all" : {}}}
$ curl -XGET localhost:9200/
_msearch
--data-binary @requests; echo
Note, the above includes an example of an empty header (can also be just without any content) which is supported as well.
The response returns a responses array, which includes the search response for each search request matching its order in the original multi search request. If there was a complete failure for that specific search request, an object with error message will be returned in place of the actual search response.
The endpoint allows to also search against an index/indices and type/types in the URI itself, in which case it will be used as the default unless explicitly defined otherwise in the header. For example:
$ cat requests
{}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10}
{}
{"query" : {"match_all" : {}}}
{"index" : "test2"}
{"query" : {"match_all" : {}}}
$ curl -XGET localhost:9200/test/
_msearch
--data-binary @requests; echo
The above will execute the search against the test index for all the requests that don’t define an index, and the last one will be executed against the test2 index.
The search_type can be set in a similar manner to globally apply to all search requests.
【
Percolate】--The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries.
Think of it as the reverse operation of indexing and then searching. Instead of sending docs, indexing them, and then running queries. One sends queries, registers them, and then sends docs and finds out which queries match that doc.
As an example, a user can register an interest (a query) on all tweets that contain the word “elasticsearch”. For every tweet, one can percolate the tweet against all registered user queries, and find out which ones matched.
Here is a quick sample, first, lets create a test index:
curl -XPUT localhost:9200/test
Next, we will register a percolator query with a specific name called kuku against the test index:
curl -XPUT localhost:9200/
_percolator/test/kuku -d '{
"query" : {
"term" : {
"field1" : "value1"
}
}
}'
And now, we can percolate a document and see which queries match on it (note, its not really indexed!):
curl -XGET localhost:9200/test/type1/
_percolate
-d '{
"doc" : {
"field1" : "value1"
}
}'
And the matches are part of the response:
{"ok":true, "matches":["kuku"]}
You can unregister the previous percolator query with the same API you use to delete any document in an index:
curl -XDELETE localhost:9200/
_percolator/test/kuku
【
Bulk】--The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed. The REST API endpoint is /
_bulk.
The endpoints are /
_bulk, /{index}/
_bulk, and {index}/type/
_bulk. When the index or the index/type are provided, they will be used by default on bulk items that don’t provide them explicitly.
【
Bulk Udp】--A Bulk
UDP
service is a service listening over
UDP
for bulk format requests. The idea is to provide a low latency
UDP
service that allows to easily index data that is not of critical nature.
【
Count】--The count
API
allows to easily execute a query and get the number of matches for that query. It can be executed across one or more indices and across one or more types. The query can either be provided using a simple query string as a parameter, or using the
Query DSL
defined within the request body.
Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/
_count?q=user:kimchy'
$ curl -XGET 'http://localhost:9200/twitter/tweet/
_count' -d '{"term" : { "user" : "kimchy" }}'
【
Delete By Query】--The delete by query
API
allows to delete documents from one or more indices and one or more types based on a query. The query can either be provided using a simple query string as a parameter, or using the
Query DSL
defined within the request body.
Here is an example:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/_query?q=user:kimchy'
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/_query' -d '{"term" : { "user" : "kimchy" }}
we can delete all documents across all types within the twitter index:
$ curl -XDELETE 'http://localhost:9200/twitter/_query?q=user:kimchy'
We can also delete within specific types:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet,user/_query?q=user:kimchy'
We can also delete all tweets with a certain tag across several indices (for example, when each user has his own index):
$ curl -XDELETE 'http://localhost:9200/kimchy,elasticsearch/_query?q=tag:wow'
Or even delete across all indices:
$ curl -XDELETE 'http://localhost:9200/_all/_query?q=tag:wow'
Request Parameters
When executing a delete by query using the query parameter q, the query passed is a query string using Lucene query parser. There are additional parameters that can be passed:
Name | Description |
---|---|
df | The default field to use when no field prefix is defined within the query. |
analyzer | The analyzer name to be used when analyzing the query string. |
default_operator | The default operator to be used, can be AND or OR . Defaults to OR . |
Request Body
The delete by query can use the Query DSL within its body in order to express the query that should be executed and delete all documents. The body content can also be passed as a REST parameter named source.
Distributed
The delete by query API is broadcast across all primary shards, and from there, replicated across all shards replicas.
Routing
The routing value (a comma separated list of the routing values) can be specified to control which shards the delete by query request will be executed on.
Replication Type
The replication of the operation can be done in an asynchronous manner to the replicas (the operation will return once it has be executed on the primary shard). The replication parameter can be set to async (defaults to sync) in order to enable it.
Write Consistency
Control if the operation will be allowed to execute based on the number of active shards within that partition (replication group). The values allowed are one, quorum, and all. The parameter to set it is consistency, and it defaults to the node level setting of action.write_consistency which in turn defaults to quorum.
For example, in a N shards with 2 replicas index, there will have to be at least 2 active shards within the relevant partition (quorum) for the operation to succeed. In a N shards with 1 replica scenario, there will need to be a single shard active (in this case, one and quorum is the same).
【
More Like This】--The more like this (mlt) API allows to get documents that are “like” a specified document.
Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/1/
_mlt?
mlt_fields=tag,content&min_doc_freq=1'
【
Validate】--The validate
API allows a user to validate a potentially expensive query without executing it.
Here is an example:
When the query is valid, the response contains valid:true:
curl -XGET 'http://localhost:9200/timeindex/
_validate/query?q=tune:2013'
{"valid":true,"_shards":{"total":1,"successful":1,"failed":0}}
【
explain】--The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query. This feature is available from version 0.19.9 and up.
Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/1/_explain?
q=message:search'
Query All parameters
fields
: Allows to control which fields to return as part of the document explained (support_source
for the full document). Note, this feature is available since 0.20.routing
: Controls the routing in the case the routing was used during indexing.parent
: Same effect as setting the routing parameter.preference
: Controls on which shard the explain is executed.source
: Allows the data of the request to be put in the query string of the url.q
: The query string (maps to the query_string query).df
: The default field to use when no field prefix is defined within the query. Defaults to _all field.analyzer
: The analyzer name to be used when analyzing the query string. Defaults to the analyzer of the _all field.analyze_wildcard
Should wildcard and prefix queries be analyzed or not. Defaults to false.lowercase_expanded_terms
: Should terms be automatically lowercased or not. Defaults to true.lenient
: If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false.default_operator
: The default operator to be used, can be AND or OR. Defaults to OR.
#Indices
【
Analyze】--Performs the analysis process on a text and return the tokens breakdown of the text.
Can be used without specifying an index against one of the many built in analyzers:
curl -XGET 'localhost:9200/
_analyze?analyzer=standard' -d 'this is a test'
Or by building a custom transient analyzer out of tokenizers and filters:
curl -XGET 'localhost:9200/
_analyze?tokenizer=keyword&filters=lowercase' -d 'this is a test'
It can also run against a specific index:
curl -XGET 'localhost:9200/test/
_analyze?text=this+is+a+test'
The above will run an analysis on the “this is a test” text, using the default index analyzer associated with the test index. An analyzer can also be provided to use a different analyzer:
curl -XGET 'localhost:9200/test/
_analyze?analyzer=whitespace' -d 'this is a test'
Also, the analyzer can be derived based on a field mapping, for example:
curl -XGET 'localhost:9200/test/
_analyze?field=obj1.field1' -d 'this is a test'
Will cause the analysis to happen based on the analyzer configure in the mapping for obj1.field1 (and if not, the default index analyzer).
Also, the text can be provided as part of the request body, and not as a parameter.
Format
By default, the format the tokens are returned in are in json and its called detailed. The text format value provides the analyzed data in a text stream that is a bit more readable.
【
Create Index】--The create index API allows to instantiate an index. ElasticSearch provides support for multiple indices, including executing operations across several indices. Each index created can have specific settings associated with it.
$ curl -XPUT 'http://localhost:9200/twitter/'
$ curl -XPUT 'http://localhost:9200/twitter/' -d '
index :
number_of_shards : 3
number_of_replicas : 2
'
【
Delete Index】--The delete index API allows to delete an existing index.
$ curl -XDELETE 'http://localhost:9200/twitter/'
The above example deletes an index called twitter.
The delete index API can also be applied to more than one index, or on _all indices (be careful!). All indices will also be deleted when no specific index is provided. In order to disable allowing to delete all indices, set
action.disable_delete_all_indices setting in the config to true.
【
Open/Close Index】--The open and close index APIs allow to close an index, and later on opening it. A closed index has almost no overhead on the cluster (except for maintaining its metadata), and is blocked for read/write operations. A closed index can be opened which will then go through the normal recovery process.
The REST endpoint is /{index}/_close and /{index}/_open. For example:
curl -XPOST 'localhost:9200/my_index/
_close'
curl -XPOST 'localhost:9200/my_index/
_open'
【 Get Settings】--The get settings API allows to retrieve settings of index/indices:
$ curl -XGET 'http://localhost:9200/twitter/
_settings'
$ curl -XPUT 'http://localhost:9200/twitter/' -d '{
"
settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}'
【
Get Mapping】--The get mapping API allows to retrieve mapping definition of index or index/type.
$ curl -XGET 'http://localhost:9200/twitter/tweet/
_mapping'
Multiple Indices and Types
The get mapping API can be used to get more than one index or type mapping with a single call. General usage of the API follows the following syntax: host:port/{index}/{type}/_mapping where both {index} and {type} can stand for comma-separated list of names. To get mappings for all indices you can use _all for {index}. The following are some examples:
$ curl -XGET 'http://localhost:9200/twitter,kimchy/
_mapping'
$ curl -XGET 'http://localhost:9200/_all/tweet,book/
_mapping'
If you want to get mappings of all indices and types then the following two examples are equivalent:
$ curl -XGET 'http://localhost:9200/_all/
_mapping'
$ curl -XGET 'http://localhost:9200/
_mapping'
【
Put Mapping】--The put mapping API allows to register specific mapping definition for a specific type.
$ curl -XPUT 'http://localhost:9200/twitter/tweet/
_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string", "store" : "yes"}
}
}
}'
The above example creates a mapping called tweet within the twitter index. The mapping simply defines that the message field should be stored (by default, fields are not stored, just indexed) so we can retrieve it later on using selective loading.
More information on how to define type mappings can be found in the mapping section.
Merging & Conflicts
When an existing mapping already exists under the given type, the two mapping definitions, the one already defined, and the new ones are merged. The ignore_conflicts parameters can be used to control if conflicts should be ignored or not, by default, it is set to false which means conflicts are not ignored.
The definition of conflict is really dependent on the type merged, but in general, if a different core type is defined, it is considered as a conflict. New mapping definitions can be added to object types, and core type mapping can be upgraded to multi_field type.
Multi Index
The put mapping API can be applied to more than one index with a single call, or even on _all the indices.
$ curl -XPUT 'http://localhost:9200/kimchy,elasticsearch/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string", "store" : "yes"}
}
}
}'
【
Delete Mapping】--Allow to delete a mapping (type) along with its data. The REST endpoint is /{index}/{type} with DELETE method.
Note, most times, it make more sense to reindex the data into a fresh index compared to delete large chunks of it.
【
Refresh】--The refresh API allows to explicitly refresh one or more index, making all operations performed since the last refresh available for search. The (near) real-time capabilities depend on the index engine used. For example, the robin one requires refresh to be called, but by default a refresh is scheduled periodically.
$ curl -XPOST 'http://localhost:9200/twitter/
_refresh'
Multi Index
The refresh API can be applied to more than one index with a single call, or even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/
_refresh'
$ curl -XPOST 'http://localhost:9200/
_refresh'
【
Optimize】--The optimize API allows to optimize one or more indices through an API. The optimize process basically optimizes the index for faster search operations (and relates to the number of segments a lucene index holds within each shard). The optimize operation allows to reduce the number of segments by merging them.
$ curl -XPOST 'http://localhost:9200/twitter/_optimize'
Request Parameters
The optimize API accepts the following request parameters:
Name | Description |
---|---|
max_num_segments | The number of segments to optimize to. To fully optimize the index, set it to 1 . Defaults to simply checking if a merge needs to execute, and if so, executes it. |
only_expunge_deletes | Should the optimize process only expunge segments with deletes in it. In Lucene, a document is not deleted from a segment, just marked as deleted. During a merge process of segments, a new segment is created that does not have those deletes. This flag allow to only merge segments that have deletes. Defaults to false . |
refresh | Should a refresh be performed after the optimize. Defaults to true . |
flush | Should a flush be performed after the optimize. Defaults to true . |
wait_for_merge | Should the request wait for the merge to end. Defaults to true . Note, a merge can potentially be a very heavy operation, so it might make sense to run it set to false . |
Multi Index
The optimize API can be applied to more than one index with a single call, or even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/_optimize'
$ curl -XPOST 'http://localhost:9200/_optimize'
【
Flush】
(
持久化数据:数据索引的过程先写入内存,再记录索引事务日志,最后经过flush操作,存储到硬盘,同时清除
transaction log、释放内存
,实现持久化。如果在flush之前断电,
重启时根据translog记录恢复数据
)--The flush API allows to flush one or more indices through an API. The flush process of an index basically frees memory from the index by flushing data to the index storage and clearing the internal transaction log. By default, ElasticSearch uses memory heuristics in order to automatically trigger flush operations as required in order to clear memory.
$ curl -XPOST 'http://localhost:9200/twitter/
_flush'
Request Parameters
The flush API accepts the following request parameters:
Name | Description |
---|---|
refresh | Should a refresh be performed after the flush. Defaults to false . |
Multi Index
The flush API can be applied to more than one index with a single call, or even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/
_flush'
$ curl -XPOST 'http://localhost:9200/
_flush'
【
Gateway
Snapshot】--The gateway snapshot API allows to explicitly perform a snapshot through the gateway of one or more indices (backup them). By default, each index gateway periodically snapshot changes, though it can be disabled and be controlled completely through this API.
Note,
this API only applies when using shared storage gateway implementation, and does not apply when using the (default) local gateway.
$ curl -XPOST 'http://localhost:9200/twitter/
_gateway/snapshot'
Multi Index
The gateway snapshot API can be applied to more than one index with a single call, or even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/
_gateway/snapshot'
$ curl -XPOST 'http://localhost:9200/
_gateway/snapshot'
【
Update Settings】--Change specific index level settings in real time.
The REST endpoint is /_settings (to update all indices) or {index}/_settings to update one (or more) indices settings. The body of the request includes the updated settings, for example:
{
"index" : {
"number_of_replicas" : 4
}
}
The above will change the number of replicas to 4 from the current number of replicas. Here is a curl example:
curl -XPUT 'localhost:9200/my_index/_settings' -d '
{
"index" : {
"number_of_replicas" : 4
}
}'
Below is the list of settings that can be changed using the update settings API:
Setting | Description |
---|---|
index.number_of_replicas | The number of replicas each shard has. |
index.auto_expand_replicas | Set to an actual value (like 0-all ) or false to disable it. |
index.blocks.read_only | Set to true to have the index read only. false to allow writes and metadata changes. |
index.blocks.read | Set to true to disable read operations against the index. |
index.blocks.write | Set to true to disable write operations against the index. |
index.blocks.metadata | Set to true to disable metadata operations against the index. |
index.refresh_interval | The async refresh interval of a shard. |
index.term_index_interval | The Lucene index term interval. Only applies to newly created docs. |
index.term_index_divisor | The Lucene reader term index divisor. |
index.translog.flush_threshold_ops | When to flush based on operations. |
index.translog.flush_threshold_size | When to flush based on translog (bytes) size. |
index.translog.flush_threshold_period | When to flush based on a period of not flushing. |
index.translog.disable_flush | Disables flushing. Note, should be set for a short interval and then enabled. |
index.cache.filter.max_size | The maximum size of filter cache (per segment in shard). Set to -1 to disable. |
index.cache.filter.expire | The expire after access time for filter cache. Set to -1 to disable. |
index.gateway.snapshot_interval | The gateway snapshot interval (only applies to shared gateways). |
merge policy | All the settings for the merge policy currently configured. A different merge policy can’t be set. |
index.routing.allocation.include.* | A node matching any rule will be allowed to host shards from the index. |
index.routing.allocation.exclude.* | A node matching any rule will NOT be allowed to host shards from the index. |
index.routing.allocation.require.* | Only nodes matching all rules will be allowed to host shards from the index. |
index.routing.allocation.total_shards_per_node | Controls the total number of shards allowed to be allocated on a single node. Defaults to unbounded (-1 ). |
index.recovery.initial_shards | When using local gateway a particular shard is recovered only if there can be allocated quorum shards in the cluster. It can be set to quorum (default), quorum-1 (or half ), full and full-1 . Number values are also supported, e.g. 1 . |
index.gc_deletes | |
index.ttl.disable_purge | Disables temporarily the purge of expired docs. |
Bulk Indexing Usage
For example, the update settings API can be used to dynamically change the index from being more performant for bulk indexing, and then move it to more real time indexing state. Before the bulk indexing is started, use:
curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "-1"
}
}'
(Another optimization option is to start the index without any replicas, and only later adding them, but that really depends on the use case).
Then, once bulk indexing is done, the settings can be updated (back to the defaults for example):
curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "1s"
}
}'
And, an optimize should be called:
curl -XPOST 'http://localhost:9200/test/_optimize?max_num_segments=5'
Updating Index Analysis
It is also possible to define a new analysis for the index. But it is required to close the index first and open it after the changes are made.
For example if content analyzer hasn’t been defined on myindex yet you can use the following commands to add it:
curl -XPOST 'localhost:9200/myindex/_close'
curl -XPUT 'localhost:9200/myindex/_settings' -d '{
"analysis" : {
"analyzer":{
"content":{
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}'
curl -XPOST 'localhost:9200/myindex/_open'
【
Templates】--Index templates allow to define templates that will automatically be applied to new indices created. The templates include both settings and mappings, and a simple pattern template that controls if the template will be applied to the index created. For example:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "te*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : false }
}
}
}'
Defines a template named template_1, with a template pattern of te*. The settings and mappings will be applied to any index name that matches the te* template.
Deleting a Template
Index templates are identified by a name (in the above case template_1) and can be delete as well:
curl -XDELETE localhost:9200/_template/template_1
GETting a Template
Index templates are identified by a name (in the above case template_1) and can be retrieved using the following:
curl -XGET localhost:9200/_template/template_1
To get list of all index templates you can use Cluster State API and check for the metadata/templates section of the response.
Multiple Template Matching
Multiple index templates can potentially match an index, in this case, both the settings and mappings are merged into the final configuration of the index. The order of the merging can be controlled using the order parameter, with lower order being applied first, and higher orders overriding them. For example:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "*",
"order" : 0,
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : false }
}
}
}'
curl -XPUT localhost:9200/_template/template_2 -d '
{
"template" : "te*",
"order" : 1,
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : true }
}
}
}'
The above will disable storing the _source on all type1 types, but for indices of that start with te*, source will still be enabled. Note, for mappings, the merging is “deep”, meaning that specific object/property based mappings can easily be added/overridden on higher order templates, with lower order templates providing the basis.
Config
Index templates can also be placed within the config location (path.conf) under the templates directory (note, make sure to place them on all master eligible nodes). For example, a file called template_1.json can be placed under config/templates and it will be added if it matches an index. Here is a sample of a the mentioned file:
{
"template_1" : {
"template" : "*",
"settings" : {
"index.number_of_shards" : 2
},
"mappings" : {
"_default_" : {
"_source" : {
"enabled" : false
}
},
"type1" : {
"_all" : {
"enabled" : false
}
}
}
}
}
---End API ---
---Start _mapping_存储字段类型 ---
【Date Format】-ES提供了如下时间类型,但是存储的时候因为是lucene分片,所以ES 会把下面的类型转换成long类型数字存储到lucene分片中。
The following tables lists all the defaults ISO formats supported:
Name | Description |
---|---|
basic_date | A basic formatter for a full date as four digit year, two digit month of year, and two digit day of month (yyyyMMdd). |
basic_date_time | A basic formatter that combines a basic date and time, separated by a ‘T’ (yyyyMMdd’T’HHmmss.SSSZ). |
basic_date_time_no_millis | A basic formatter that combines a basic date and time without millis, separated by a ‘T’ (yyyyMMdd’T’HHmmssZ). |
basic_ordinal_date | A formatter for a full ordinal date, using a four digit year and three digit dayOfYear (yyyyDDD). |
basic_ordinal_date_time | A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear (yyyyDDD’T’HHmmss.SSSZ). |
basic_ordinal_date_time_no_millis | A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear (yyyyDDD’T’HHmmssZ). |
basic_time | A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone offset (HHmmss.SSSZ). |
basic_time_no_millis | A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset (HHmmssZ). |
basic_t_time | A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone off set prefixed by ‘T’ (’T’HHmmss.SSSZ). |
basic_t_time_no_millis | A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by ‘T’ (’T’HHmmssZ). |
basic_week_date | A basic formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week (xxxx’W’wwe). |
basic_week_date_time | A basic formatter that combines a basic weekyear date and time, separated by a ‘T’ (xxxx’W’wwe’T’HHmmss.SSSZ). |
basic_week_date_time_no_millis | A basic formatter that combines a basic weekyear date and time without millis, separated by a ‘T’ (xxxx’W’wwe’T’HHmmssZ). |
date | A formatter for a full date as four digit year, two digit month of year, and two digit day of month (yyyy-MM-dd). |
date_hour | A formatter that combines a full date and two digit hour of day. |
date_hour_minute | A formatter that combines a full date, two digit hour of day, and two digit minute of hour. |
date_hour_minute_second | A formatter that combines a full date, two digit hour of day, two digit minute of hour, and two digit second of minute. |
date_hour_minute_second_fraction | A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (yyyy-MM-dd’T’HH:mm:ss.SSS). |
date_hour_minute_second_millis | A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (yyyy-MM-dd’T’HH:mm:ss.SSS). |
date_optional_time | a generic ISO datetime parser where the date is mandatory and the time is optional. |
date_time | A formatter that combines a full date and time, separated by a ‘T’ (yyyy-MM-dd’T’HH:mm:ss.SSSZZ). |
date_time_no_millis | A formatter that combines a full date and time without millis, separated by a ‘T’ (yyyy-MM-dd’T’HH:mm:ssZZ). |
hour | A formatter for a two digit hour of day. |
hour_minute | A formatter for a two digit hour of day and two digit minute of hour. |
hour_minute_second | A formatter for a two digit hour of day, two digit minute of hour, and two digit second of minute. |
hour_minute_second_fraction | A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (HH:mm:ss.SSS). |
hour_minute_second_millis | A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (HH:mm:ss.SSS). |
ordinal_date | A formatter for a full ordinal date, using a four digit year and three digit dayOfYear (yyyy-DDD). |
ordinal_date_time | A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear (yyyy-DDD’T’HH:mm:ss.SSSZZ). |
ordinal_date_time_no_millis | A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear (yyyy-DDD’T’HH:mm:ssZZ). |
time | A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset (HH:mm:ss.SSSZZ). |
time_no_millis | A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset (HH:mm:ssZZ). |
t_time | A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset prefixed by ‘T’ (’T’HH:mm:ss.SSSZZ). |
t_time_no_millis | A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by ‘T’ (’T’HH:mm:ssZZ). |
week_date | A formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week (xxxx-’W’ww-e). |
week_date_time | A formatter that combines a full weekyear date and time, separated by a ‘T’ (xxxx-’W’ww-e’T’HH:mm:ss.SSSZZ). |
weekDateTimeNoMillis | A formatter that combines a full weekyear date and time without millis, separated by a ‘T’ (xxxx-’W’ww-e’T’HH:mm:ssZZ). |
week_year | A formatter for a four digit weekyear. |
weekyearWeek | A formatter for a four digit weekyear and two digit week of weekyear. |
weekyearWeekDay | A formatter for a four digit weekyear, two digit week of weekyear, and one digit day of week. |
year | A formatter for a four digit year. |
year_month | A formatter for a four digit year and two digit month of year. |
year_month_day | A formatter for a four digit year, two digit month of year, and two digit day of month. |
---End _mapping_存储字段类型 ---
---Start 操作实例 ---
【_mapping】(默认索引和查询时,均使用ik插件)
PUT : http://localhost:9200/timeindex/timetype/_mapping
{
"news": {
"_source": {
"compress": true
},
"_all": {
"indexAnalyzer": "ik",
"searchAnalyzer": "ik"
},
"properties": {
"title": {
"type": "string",
"indexAnalyzer": "ik",
"searchAnalyzer": "ik",
"store": "yes"
},
"time": {
"type": "date",
"store": "no"
}
}
}
}
【插入数据】
在索引为timeindex(库),类型为timetype(表)插入数据(POST时如果没有指定id,底层会自动生成一串id;而用PUT插入数据时必须指定id)
{
"title": "时间测试短时间",
"time": "2013",
"url": "www.baidu.com"
}
{
"title": "时间测试2012短时间",
"time": "2012",
"url": "www.baidu.com/news"
}
【更新数据(PUT)】
localhost:9200/films/md/1
{ ...(data)... }
【_search--存储数据查询】
(GET、PUT、POST、DELETE操作说明)
GET:针对url数据的操作---查询;
DELETE:针对url数据的操作---删除;
PUT:针对参数的操作---插入、更新;
POST:针对参数的操作---查询、插入、优化。
1.全部index、type下查询(POST、GET均可):http://localhost:9200/_search?pretty=true/
2.某个index下的所有type下查询(GET): http://localhost:9200/timeindex/ _search?pretty=true/
3.某个index下,某个type下查询 (GET):http://localhost:9200/timeindex/timetype/ _search?pretty=true/
4.某个index下,某个type下查询 (GET)指定字段:http://localhost:9200/timeindex/timetype/_search?fields=time,url
5.指定id查询(GET):http://localhost:9200/timeindex/timetype/1/
6.指定id查询(GET)该id下的指定字段:http://localhost:9200/timeindex/timetype/1/?fields=time,url
7.http带有参数查询(GET):http://localhost:9200/timeindex/timetype/_search?q=time:2012
8.
使用
JSON
参数的查询(POST):http://localhost:9200/timeindex/timetype/_search
{
"query": {
"term": {
"time": "2013"
}
}
}
【_search--集群、节点状态查询(GET)】
1.集群健康状态:
server1:9200/_cluster/health?pretty=true
2.集群状态:
server1:9200/_cluster/state?pretty=true
server1:9200/_cluster/state?filter_nodes=true&pretty=true
server1:9200/_cluster/state? filter_routing_table
=true&pretty=true
server1:9200/_cluster/state? filter_metadata
=true&pretty=true
server1:9200/_cluster/state? filter_blocks
=true&pretty=true
server1:9200/_cluster/state? filter_indices=true&pretty=true
3.集群设置信息:
server1:9200/_cluster/settings?pretty=true
4.节点信息:
server1:9200/_cluster/nodes?pretty=true
server1:9200/_cluster/nodes/nodeId1,nodeId2?pretty=true
server1:9200/_cluster/nodes/stats?pretty=true
server1:9200/_cluster/nodes/nodeId1,nodeId2/stats?pretty=true
5.节点设置信息:
server1:9200/timeindex/_mapping?pretty=true
【优化索引(POST)】
localhost:9200/timeindex/_optimize
---End 操作实例 ---
---Start nutch2.1+mysql+elasticsearch整合linux单机部署 ---
这次主要介绍下nutch2.1和mysql和elasticsearch的整合,是在单机上运行,并不是分布式部署。1、下载nutch2.1
nutch下载地址:http://labs.mop.com/apache-mirror/nutch/2.1/apache-nutch-2.1-src.tar.gz
下载完成后解压,
2、配置nutch使用mysql作为数据存储,修改nutch根目录/ivy/ivy.xml文件
将这行的注释取消<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>
修改nutch根目录/conf/gora.properties文件,把默认存储配置改成如下
# MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://host:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
gora.sqlstore.jdbc.user=user
gora.sqlstore.jdbc.password=password
|
<? xml version = "1.0" ?>
<? xml-stylesheet type = "text/xsl" href = "configuration.xsl" ?>
< configuration >
< property >
< name >http.agent.name</ name >
< value >My Spider</ value >
</ property >
< property >
< name >http.accept.language</ name >
< value >ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q=0.3</ value >
</ property >
< property >
< name >parser.character.encoding.default</ name >
< value >utf-8</ value >
< description >The character encoding to fall back to when no other information
is available</ description >
</ property >
< property >
< name >storage.data.store.class</ name >
< value >org.apache.gora.sql.store.SqlStore</ value >
</ property >
< property >
< name >plugin.includes</ name >
< value >protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</ value >
</ property >
</ configuration >
|
nutch根目录下输入ant
5、设置待抓取的网站
cd nutch根目录/runtime/local
创建urls目录
mkdir -p urls
创建初始爬取网站列表
echo 'http://www.web.com/' > urls/seed.txt
6、创建数据库和表
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
CREATE TABLE `webpage` (`id` varchar (767) CHARACTER SET latin1 NOT NULL ,
`headers` blob,
`text` mediumtext DEFAULT NULL ,
`status` int (11) DEFAULT NULL ,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint (20) DEFAULT NULL ,
`score` float DEFAULT NULL ,
`typ` varchar (32) CHARACTER SET latin1 DEFAULT NULL ,
`baseUrl` varchar (512) CHARACTER SET latin1 DEFAULT NULL ,
`content` mediumblob,
`title` varchar (2048) DEFAULT NULL ,
`reprUrl` varchar (512) CHARACTER SET latin1 DEFAULT NULL ,
`fetchInterval` int (11) DEFAULT NULL ,
`prevFetchTime` bigint (20) DEFAULT NULL ,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint (20) DEFAULT NULL ,
`retriesSinceFetch` int (11) DEFAULT NULL ,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
|
bin /nutch crawl urls -depth 3
|
8、执行索引操作:
bin /nutch elasticindex clustername -all
|
执行完后就可以通过http://eshost:9200/index/_status来看到建立的索引信息
附elasticsearch 网页索引mapping(使用ik分词):
{
"mappings": {
"properties":{
"anchor":{
"index":"not_analyzed",
"type":"string"
},
"boost":{
"type":"string"
},
"content":{
"analyzer":"ik",
"boost":2.0,
"type":"string"
},
"digest":{
"type":"string"
},
"host":{
"type":"string"
},
"id":{
"type":"string"
},
"site":{
"type":"string"
},
"title":{
"analyzer":"ik",
"boost":4.0,
"type":"string"
},
"tstamp":{
"type":"date",
"format":"dateOptionalTime"
},
"url":{
"type":"string"
}
}
}
}
|
---End nutch2.1+mysql+elasticsearch整合linux单机部署 ---
---Start 使用More like this实现基于内容的推荐---
基于内容的推荐通常是给定一篇文档信息,然后给用户推荐与该文档相识的文档。Lucene的api中有实现查询文章相似度的接口,叫MoreLikeThis。Elasticsearch封装了该接口,通过Elasticsearch的More like this查询接口,我们可以非常方便的实现基于内容的推荐。
先看一个查询请求的json例子:
{
"more_like_this" : {
"fields" : ["title", "content"],
"like_text" : "text like this one",
}
}
其中:
fields是要匹配的字段,如果不填的话默认是_all字段
like_text是匹配的文本。
除此之外还可以添加下面条件来调节结果
percent_terms_to_match:匹配项(term)的百分比,默认是0.3
min_term_freq:一篇文档中一个词语至少出现次数,小于这个值的词将被忽略,默认是2
max_query_terms:一条查询语句中允许最多查询词语的个数,默认是25
stop_words:设置停止词,匹配时会忽略停止词
min_doc_freq:一个词语最少在多少篇文档中出现,小于这个值的词会将被忽略,默认是无限制
max_doc_freq:一个词语最多在多少篇文档中出现,大于这个值的词会将被忽略,默认是无限制
min_word_len:最小的词语长度,默认是0
max_word_len:最多的词语长度,默认无限制
boost_terms:设置词语权重,默认是1
boost:设置查询权重,默认是1
analyzer:设置使用的分词器,默认是使用该字段指定的分词器
下面介绍下如何用java api调用,一共有三种调用方式,不过本质上都是一样的,只不过是做了一些不同程度的封装。
MoreLikeThisRequestBuilder mlt = new MoreLikeThisRequestBuilder(client, "indexName", "indexType", "id");
mlt.setField("title");//匹配的字段
SearchResponse response = client.moreLikeThis(mlt.request()).actionGet();
这种是在查询与某个id的文档相似的文档。这个接口是直接在client那调用的,比较特殊。还有两种就是构造Query进行查询
MoreLikeThisQueryBuilder query = QueryBuilders.moreLikeThisQuery();
query.boost(1.0f).likeText("xxx").minTermFreq(10);
这里的boost、likeText方法完全和上面的参数对应的。下面这种就是把要匹配的字段作为参数传进来,参数和MoreLikeThisQueryBuilder是一样的。
MoreLikeThisFieldQueryBuilder query = QueryBuilders.moreLikeThisFieldQuery("fieldNmae");
---End 使用More like this实现基于内容的推荐 ---
---Start 5种分片查询优先级---
elasticsearch可以使用preference参数来指定分片查询的优先级,使用时就是在请求url上加上preference参数,如:http://ip:host/index/_search?preference=_primary
java的调用接口翻译为:client.prepareSearch("index").setPreference("_primary")。
默认情况下es有5种查询优先级:
_primary
: 指查询只在主分片中查询
_primary_first
: 指查询会先在主分片中查询,如果主分片找不到(挂了),就会在副本中查询。
_local
: 指查询操作会优先在本地节点有的分片中查询,没有的话再在其它节点查询。
_only_node
:指在指定id的节点里面进行查询,如果该节点只有要查询索引的部分分片,就只在这部分分片中查找,所以查询结果可能不完整。如_only_node:123在节点id为123的节点中查询。
Custom (string) value
:用户自定义值,指在参数cluster.routing.allocation.awareness.attributes指定的值,如这个值设置为了zone,那么preference=zone的话就在awareness.attributes=zone*这样的节点搜索,如zone1、zone2。关于这个值作用可以参考下面文章。
虽然es有提供这5种优先级,但感觉还是不能满足我的需求,我是想能指定在某一个或多个节点中查询,比如node1和node2里面的分片能组成一个完整的索引,那我可以只在node1和node2中搜索就行了。看来只能改源码解决,改源码也非常简单。
首先找到org.elasticsearch.cluster.routing.operation.plain.PlainOperationRouting这个类,es搜索时获取分片信息是通过这个类的。它的preferenceActiveShardIterator()方法就是根据条件来找出响应的分片。看源码可知其主要是根据preference这个参数来决定取出的分片的。如果没有指定该参数,就随机抽取分片进行搜索。如果参数以_shards开头,则表示只查询指定的分片。注意,这个功能官网的文档中没有写到。
然后下面就是判断我上面说的5种优先级情况。我们现在要加个多节点分片查询的功能,仿照单个节点分片查询(指_only_node)就行了,在
if (preference.startsWith("_only_node:")) {
return indexShard.onlyNodeActiveShardsIt(preference.substring("_only_node:".length()));
}
后面加上
if (preference.startsWith("_only_nodes:")) {
return indexShard.onlyNodesActiveShardsIt(preference.substring("_only_nodes:".length()));
}
onlyNodesActiveShardsIt这个方法在org.elasticsearch.cluster.routing.IndexShardRoutingTable中是没有的,要自己写。加上
/**
* Prefers execution on the provided nodes if applicable.
*/
public ShardIterator onlyNodesActiveShardsIt(String nodeIds) {
String[] ids = nodeIds.split(",");
ArrayList<ShardRouting> ordered = new ArrayList<ShardRouting>(shards.size());
// fill it in a randomized fashion
for (int i = 0; i < shards.size(); i++) {
ShardRouting shardRouting = shards.get(i);
for(String nodeId:ids){
if (nodeId.equals(shardRouting.currentNodeId())) {
ordered.add(shardRouting);
}
}
}
return new PlainShardIterator(shardId, ordered);
}
重新编译源码就行了。查询时加上?preference=_only_nodes:node1id,node2id 就可以指定在node1和node2中搜索
---End 5种分片查询优先级 ---
---Start 关键词---
【Puppetmaster】、【Puppet】、【Routing eg:curl -XGET 'http://localhost:9200/twitter/tweet/1?routing=kimchy' 】、【df】、【facets】、【min_doc_freq】、【mongodb】、
---End 关键词 ---
---Start 运维---
【故障一】高负载情况下,缓存文件损坏,分片与指定节点数据交换丢失,部分主分片和副本均丢失。
【故障二】负载不稳定和快速选举问题(主节点重复选举、关闭,不断升级配置文件变更)
1.异常跟踪和监控系统发现大量异常爆发,来自于代码查询的超时,以及加入新数据时更新代码搜索索引的后台作业。
2.配置错误,调优潜力巨大。
---End 运维---