分布式搜索-ElasticSearch

ElasticSearch的定义

ElasticSearch是一款强大的开源搜索引擎，可以帮助我们从海量数据中快速找到需要的内容。

ElasticSearch结合kibana、logstash、beats即elastic stack(ELK)，被广泛应用在日志数据分析、实时监控等领域。ElasticSearch是elastic stack的核心、负责存储、搜索、分析数据。

Lucene是一个Java语言的搜索引擎类库，具有易扩展、高性能（基于倒排索引）的优势，但是只限于Java语言开发、学习复杂、不支持水平扩展。

2010年、Shay Banon重写了Compass，取名为ElasticSearch，相比于Lucene,ElasticSearch具备下列优势：支持分布式，可水平扩展；提供Restful接口，可被任何语言调用。

总结

ElasticSearch是一个开源的分布式搜索引擎，可以用来实现搜索、日志统计、分析、系统监控等功能。

倒排索引

传统数据库（例如MySQL）采用正向索引，实例tb_goods中id插件索引：

ElasticSearch采用倒排索引：

文档（document）：每条数据就是一个文档
词条（term）：文档按照语义分成的词语

总结

什么是文档和词条

每一条数据是一个文档
对文档中的内容分词，得到词语就是词条

什么是正向索引

基于文档id创建索引。查询词条时必须先找到文档，而后判断是否包含词条

什么是倒排索引

对文档内容分词，对词条创建索引，并记录词条所在文档的信息。查询时根据词条查询文档id，而后获取到文档。

ElasticSearch与MySQL的概念对比

文档

ElasticSearch是面向文档存储的，可以是数据库中的一条商品数据、一个订单数据。

文档数据会被序列化为Json格式后存储在ElasticSearch中

索引

索引（index）:相同类型的文档的集合

映射（mapping）:索引中文档的字段约束信息，类似表的结构约束

概念对比

架构

MySQL：擅长事务类型操作，可以确保数据的安全和一致性

ElasticSearch：擅长海量数据的搜索、分析和计算

总结

索引：同类型文档的集合
文档：一个数据就是一个文档，在es中是JSON格式
字段：JSON文档中的字段
映射：索引中文档的约束，比如字段名称、类型

elasticsearch与数据库的关系

数据库负责事务类型操作
elasticsearch负责海量数据的搜索、分析和计算

部署

安装ElasticSearch

拉取ElasticSearch镜像

1	docker pull elasticsearch:7.17.3

启动ElasticSearch服务

可以使用ES_JAVA_OPTS设置占用内存大小

docker run -p 9200:9200 -p 9300:9300 \
--name elasticsearch \
-e "discovery.type=single-node" \
-e "cluster.name=elasticsearch" \
-e "ES_JAVA_OPTS=-Xms512m -Xmx1024m" \
-v /mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
-v /mydata/elasticsearch/data:/usr/share/elasticsearch/data \
-d elasticsearch:7.17.3

1	chmod 777 /mydata/elasticsearch/data/

安装中文分词器IKAnalyzer

注意下载与ElasticSearch对应的版本

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases

下载完成后解压至Elasticsearch的/mydata/elasticsearch/plugins目录下

重新启动服务

1	docker restart elasticsearch

访问返回版本信息
http://ip:9200/

安装Kibana

拉取Kibana镜像

注意版本的一致性

1	docker pull kibana:7.17.3

启动Kibana服务

docker run --name kibana -p 5601:5601 \
--link elasticsearch:es \
-e "elasticsearch.hosts=http://es:9200" \
-d kibana:7.17.3

访问地址测试：http://ip:5601/

通过Dev Tools输入DSL语句操作ES

分词器

ElasticSearch创建倒排索引时需对文档进行分词，在搜索时，需要对用户输入内容分词。

但是默认的分词规则对中文处理并不友好，可以在Kibana中的DevTools中进行测试

IK分词器

处理中文分词，一般使用IK分词器

IK分词器包含两种模式：ik_smart-最少切分 ik_max_word-最细切分

前面已经安装过了IK分词器插件，下面进行测试：

IK分词器的拓展和停用字典

扩展词库

修改ik分词器config目录中的IkAnalyzer.cfg.xml文件，扩展ik分词库的词库

停用词库

修改ik分词器config目录中的IkAnalyzer.cfg.xml文件，禁用敏感词条

实践

添加dic文件

修改xml配置

添加词库内容

重启ElasticSearch服务

查看分词结果

总结

分词器的作用

创建倒排索引时，对文档进行分词
用户搜索时，对输入的内容进行分词

IK分词器的模式

ik_smart：智能切分，粗粒度
ik_max_word：最细切分，细粒度

IK分词器的拓展和停用词条

利用config目录的IkAnalyzer.cfg.xml文件添加拓展词典和停用词典
在词典中添加拓展词条和停用词条

索引库操作

mapping属性

mapping是对索引库中文档的约束，常见的mapping属性包括如下：

type：字段数据类型，常见的简单类型有：

字符串：text-可分词的文本 keyword-精确值（例如：品牌、国家、ip地址）

数值：long、integer、short、byte、double、float

布尔：boolean

日期：date

对象：object

index：是否创建索引，默认为true

analyzer：使用哪些分词器

properties：字段的子字段

总结

创建索引库

ElasticSearch中通过Restful请求操作索引库、文档。

请求内容用DSL语句来表示

创建索引库和mapping的DSL语法

# 创建索引库
PUT /test
{
  "mappings": {
    "properties": {
      "info": {
        "type": "text",
        "analyzer": "ik_smart"
       },
       "email": {
         "type": "keyword",
         "index": false
       },
       "name": {
         "type": "object",
         "properties": {
           "firstName": {
             "type": "keyword"
           },
            "lastName": {
             "type": "keyword"
           }
         }
       }
    }
  }
}

查询索引库

删除索引库

修改索引库

索引库和mapping一旦创建无法修改，但是可以添加新的字段

#  修改索引库 只可以添加mapping
PUT /test/_mapping
{
  "properties": {
    "age": {
      "type": "integer"
    }
  }
}

总结

文档操作

新增文档

# 新增文档
POST /test/_doc/1
{
  "info": "你好，世界",
  "email": "xx@qq.com",
  "name": {
    "firstName": "Yuan",
    "lastName": "JianWei"
  },
  "age": 20
}

查询文档

删除文档

修改文档

方式一：全量修改，删除旧文档、添加新文档

# 全量修改文档
PUT /test/_doc/1
{
  "info": "你好，世界",
  "email": "xx@123.com",
  "name": {
    "firstName": "Yuan",
    "lastName": "JianWei"
  },
  "age": 20
}

方式二：增量修改，修改指定字段值

总结

新增文档

1	POST /索引库名/_doc/文档id

查询文档

1	GET /索引库名/_doc/文档id

删除文档

1	DELETE /索引库名/_doc/文档id

全量修改

1	PUT /索引库名/_doc/文档id

增量修改

1	POST /索引库名/_update/文档id

RestClient操作索引库

ElasticSearch官方提供了各种不同语言的客户端，用来操作ElasticSearch。

这些客户端的本质是组装DSL语句，通过http请求发送给ElasticSearch服务器。

官网地址：https://www.elastic.co/guide/en/elasticsearch/client/index.html

数据结构分析

# hotel mapping
PUT /hotel
{
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "address": {
        "type": "text",
        "index": false
      },
      "price": {
        "type": "integer"
      },
      "score": {
        "type": "integer"
      },
      "brand": {
        "type": "keyword"
      },
      "city": {
        "type": "keyword"
      },
      "starName": {
        "type": "keyword"
      },
      "business": {
        "type": "keyword"
      },
      "location": {
        "type": "geo_point"
      },
      "pic": {
        "type": "keyword",
        "index": false
      }
    }
  }
}

ElasticSearch支持两种地理坐标的数据类型：

geo_point：由纬度（latitude）和经度（longitude）确定一个点

geo_shape：有多个geo_point组成的复杂几何图形

字段拷贝可以使用copy_to属性将当前字段拷贝到指定字段：

# hotel mapping
PUT /hotel
{
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name": {
        "type": "text",
        "analyzer": "ik_max_word",
        "copy_to": "all"
      },
      "address": {
        "type": "text",
        "index": false
      },
      "price": {
        "type": "integer"
      },
      "score": {
        "type": "integer"
      },
      "brand": {
        "type": "keyword",
        "copy_to": "all"
      },
      "city": {
        "type": "keyword"
      },
      "starName": {
        "type": "keyword"
      },
      "business": {
        "type": "keyword",
        "copy_to": "all"
      },
      "location": {
        "type": "geo_point"
      },
      "pic": {
        "type": "keyword",
        "index": false
      },
      "all": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

初始化RestClient

引入ElasticSearchRestHighLevelClient依赖

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.17.3</version>
</dependency>

覆盖SpringBoot默认的ElasticSearch版本

<properties>
    <java.version>1.8</java.version>
    <elasticsearch.version>7.17.3</elasticsearch.version>
</properties>

初始化RestHignLevelClient

@Slf4j
public class RestClientTest {
    private RestHighLevelClient restHighLevelClient;

    @BeforeEach
    void setUp() {
        this.restHighLevelClient = new RestHighLevelClient(RestClient.builder(
                HttpHost.create("http://1.117.34.49:5601")
        ));
    }

    @AfterEach
    void tearDown() throws IOException {
        this.restHighLevelClient.close();
    }

    @Test
    void testInit() {
        System.out.println(this.restHighLevelClient);
    }
}

创建索引库

@Test
void createProductIndex() throws IOException {
    // 创建CreateIndexRequest对象
    CreateIndexRequest createIndexRequest = new CreateIndexRequest("product");
    // 设置请求参数：PRODUCT_TEMPLATE-DSL语句 XContentType.JSON
    createIndexRequest.source(PRODUCT_TEMPLATE, XContentType.JSON);
    // 发送请求
    this.restHighLevelClient.indices().create(createIndexRequest, RequestOptions.DEFAULT);
}

删除索引库

@Test
void deleteProductIndex() throws IOException {
    // 创建DeleteIndexRequest对象
    DeleteIndexRequest deleteIndexRequest = new DeleteIndexRequest("product");
    // 发送请求
    this.restHighLevelClient.indices().delete(deleteIndexRequest, RequestOptions.DEFAULT);
}

判断索引库是否存在

@Test
void existsProductIndex() throws IOException {
    // 创建GetIndexRequest对象
    GetIndexRequest getIndexRequest = new GetIndexRequest("product");
    // 发送请求，接收结果
    boolean exists = this.restHighLevelClient.indices().exists(getIndexRequest, RequestOptions.DEFAULT);
    System.out.println(exists ? "索引库存在" : "索引库不存在");
}

总结

索引库操作的基本步骤

初始化RestHighLevelClient
创建IndexRequest
准备DSL语句
发送请求，调用restHighLevelClient.indices()的API

RestClient操作文档

初始化RestClient

@Slf4j
@RunWith(SpringRunner.class)
@SpringBootTest
public class RestDocumentTest {
    @Resource
    private PmsProductService pmsProductService;

    private RestHighLevelClient restHighLevelClient;

    @Before
    public void setUp() {
        // 初始化RestHighLevelClient
        this.restHighLevelClient = new RestHighLevelClient(RestClient.builder(
                HttpHost.create("http://1.117.34.49:9200")
        ));
    }

    @After
    public void tearDown() throws IOException {
        // 关闭客户端资源
        this.restHighLevelClient.close();
    }

    @Test
    public void testInit() {
        System.out.println(this.restHighLevelClient);
    }
}

新增文档

@Test
public void postIndexDocument() throws IOException {
    // 获取Json对象
    PmsProduct pmsProduct = pmsProductService.getPmsProductById(1L);
    EsProduct esProduct = new EsProduct();
    // 拷贝为es商品信息实体
    BeanUtils.copyProperties(pmsProduct, esProduct);
    // 序列化为JSON
    String source = JSONUtil.toJsonStr(esProduct);
    // 创建Request对象
    IndexRequest indexRequest = new IndexRequest("product").id(String.valueOf(esProduct.getId()));
    // 设置请求参数
    indexRequest.source(source, XContentType.JSON);
    // 发送请求
    this.restHighLevelClient.index(indexRequest, RequestOptions.DEFAULT);
}

查询文档

@Test
public void getIndexDocument() throws IOException{
    // 创建GetRequest对象
    GetRequest getRequest = new GetRequest("product", "1");
    // 发送请求
    GetResponse getResponse = this.restHighLevelClient.get(getRequest, RequestOptions.DEFAULT);
    // 获取source
    String sourceAsString = getResponse.getSourceAsString();
    // 将json放序列化为对象
    EsProduct esProduct = JSONUtil.toBean(sourceAsString, EsProduct.class);
    System.out.println(esProduct);
}

修改文档

修改文档数据有两种方式：

方式一：全量更新：写入id和之前一样的文档，就会删除旧文档，添加新文档

方式二：局部更新：只更新部分字段

@Test
public void updateIndexDocument() throws IOException {
    // 创建UpdateRequest对象
    UpdateRequest updateRequest = new UpdateRequest("product", "1");
    // 准备参数
    updateRequest.doc("price","128", "sale", "10");
    // 发送请求
    this.restHighLevelClient.update(updateRequest, RequestOptions.DEFAULT);
}

删除文档

@Test
public void deleteIndexDocument() throws IOException {
    // 创建DeleteRequest对象
    DeleteRequest deleteRequest = new DeleteRequest("product", "1");
    // 发送请求
    this.restHighLevelClient.delete(deleteRequest, RequestOptions.DEFAULT);
}

批量导入文档

@Test
public void bulkIndexDocument() throws IOException {
    BulkRequest bulkRequest = new BulkRequest();
    List<PmsProduct> pmsProducts = pmsProductService.getPmsProducts(new PmsProduct());
    pmsProducts.forEach(pmsProduct -> {
        // 对象赋值
        EsProduct esProduct = new EsProduct(pmsProduct);
        // 遍历添加准备参数
        bulkRequest.add(new IndexRequest("product")
                .id(String.valueOf(esProduct.getId()))
                .source(JSONUtil.toJsonStr(esProduct), XContentType.JSON));
    });
    this.restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
}

分布式搜索-ElasticSearch

ElasticSearch的定义

总结

倒排索引

总结

什么是文档和词条

什么是正向索引

什么是倒排索引

ElasticSearch与MySQL的概念对比

文档

索引

概念对比

架构

总结

部署

安装ElasticSearch

拉取ElasticSearch镜像

启动ElasticSearch服务

修改挂载数据目录的访问权限

安装中文分词器IKAnalyzer

重新启动服务

访问返回版本信息

安装Kibana

拉取Kibana镜像

启动Kibana服务

访问地址测试：http://ip:5601/

通过Dev Tools输入DSL语句操作ES

分词器

IK分词器

IK分词器的拓展和停用字典

扩展词库

停用词库

实践

添加dic文件

修改xml配置

添加词库内容

重启ElasticSearch服务

查看分词结果

总结

分词器的作用

IK分词器的模式

IK分词器的拓展和停用词条

索引库操作

mapping属性

总结

创建索引库

创建索引库和mapping的DSL语法

查询索引库

删除索引库

修改索引库

总结

文档操作

新增文档

查询文档

删除文档

修改文档

方式一：全量修改，删除旧文档、添加新文档

方式二：增量修改，修改指定字段值

总结

RestClient操作索引库

数据结构分析

初始化RestClient

引入ElasticSearchRestHighLevelClient依赖

覆盖SpringBoot默认的ElasticSearch版本

初始化RestHignLevelClient

创建索引库

删除索引库

判断索引库是否存在

总结

索引库操作的基本步骤

RestClient操作文档

初始化RestClient

新增文档

查询文档

修改文档

删除文档

批量导入文档

总结