博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
14_ElasticSearch 使用most_fields策略进行cross-fields search
阅读量:3726 次
发布时间:2019-05-22

本文共 3713 字,大约阅读时间需要 12 分钟。

ElasticSearch使用most_fields策略进行cross-fields search

更多干货

概述

  • cross-fields搜索,一个唯一标识,跨了多个field
  • 比如一个人,标识,是姓名;一个建筑,它的标识是地址。姓名可以散落在多个field中,比如first_name和last_name中,地址可以散落在country,province,city中。
  • 跨多个field搜索一个标识,比如搜索一个人名,或者一个地址,就是cross-fields搜索
  • 初步来说,如果要实现,可能用most_fields比较合适。因为best_fields是优先搜索单个field最匹配的结果,cross-fields本身就不是一个field的问题了。

存在的问题:

  • 只是找到尽可能多的field匹配的doc,而不是某个field完全匹配的doc
  • most_fields,没办法用minimum_should_match去掉长尾数据,就是匹配的特别少的结果
  • TF/IDF算法,比如Peter Smith和Smith Williams,搜索Peter Smith的时候,由于first_name中很少有Smith的,所以query在所有document中的频率很低,得到的分数很高,可能Smith Williams反而会排在Peter Smith前面

例子

增加属性:author_first_name、author_last_name

POST /forum/article/_bulk{ "update": { "_id": "1"} }{ "doc" : {
"author_first_name" : "Peter", "author_last_name" : "Smith"} }{ "update": { "_id": "2"} }{ "doc" : {
"author_first_name" : "Smith", "author_last_name" : "Williams"} }{ "update": { "_id": "3"} }{ "doc" : {
"author_first_name" : "Jack", "author_last_name" : "Ma"} }{ "update": { "_id": "4"} }{ "doc" : {
"author_first_name" : "Robbin", "author_last_name" : "Li"} }{ "update": { "_id": "5"} }{ "doc" : {
"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }

most_fields方式实现查询:

GET /forum/article/_search{  "query": {    "multi_match": {      "query":       "Peter Smith",      "type":        "most_fields",      "fields":      [ "author_first_name", "author_last_name" ]    }  }}

查询结果:

{  "took": 2,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "failed": 0  },  "hits": {    "total": 3,    "max_score": 0.6931472,    "hits": [      {        "_index": "forum",        "_type": "article",        "_id": "2",        "_score": 0.6931472,        "_source": {          "articleID": "KDKE-B-9947-#kL5",          "userID": 1,          "hidden": false,          "postDate": "2017-01-02",          "tag": [            "java"          ],          "tag_cnt": 1,          "view_cnt": 50,          "title": "this is java blog",          "content": "i think java is the best programming language",          "sub_title": "learned a lot of course",          "author_first_name": "Smith",          "author_last_name": "Williams"        }      },      {        "_index": "forum",        "_type": "article",        "_id": "1",        "_score": 0.5753642,        "_source": {          "articleID": "XHDK-A-1293-#fJ3",          "userID": 1,          "hidden": false,          "postDate": "2017-01-01",          "tag": [            "java",            "hadoop"          ],          "tag_cnt": 2,          "view_cnt": 30,          "title": "this is java and elasticsearch blog",          "content": "i like to write best elasticsearch article",          "sub_title": "learning more courses",          "author_first_name": "Peter",          "author_last_name": "Smith"        }      },      {        "_index": "forum",        "_type": "article",        "_id": "5",        "_score": 0.51623213,        "_source": {          "articleID": "DHJK-B-1395-#Ky5",          "userID": 3,          "hidden": false,          "postDate": "2017-03-01",          "tag": [            "elasticsearch"          ],          "tag_cnt": 1,          "view_cnt": 10,          "title": "this is spark blog",          "content": "spark is best big data solution based on scala ,an programming language similar to java",          "sub_title": "haha, hello world",          "author_first_name": "Tonny",          "author_last_name": "Peter Smith"        }      }    ]  }}
  • 查询结果的排序 不是我们想要的。
  • Peter Smith,匹配author_first_name,匹配到了Smith,这时候它的分数很高,为什么。
  • 因为IDF分数高,IDF分数要高,那么这个匹配到的term(Smith),在所有doc中的出现频率要低,author_first_name field中,Smith就出现过1次
  • Peter Smith这个人,doc 1,Smith在author_last_name中,但是author_last_name出现了两次Smith,所以导致doc 1的IDF分数较低

更多相关文章

转载地址:http://neonn.baihongyu.com/

你可能感兴趣的文章
3、MapReduce详解与源码分析
查看>>
sql语句执行步骤详解
查看>>
MYSQL5.7 INDEXES之如何使用索引(一)
查看>>
MongoDB(三):数据库操作、集合操作
查看>>
阿里云ECS服务器部署HADOOP集群(三):ZooKeeper 完全分布式集群搭建
查看>>
关于缓存穿透,缓存击穿,缓存雪崩,热点数据失效问题的解决方案(转)
查看>>
sqlserver查看表空间
查看>>
linux下安装mysql数据库
查看>>
启动/关闭数据库、实例及服务
查看>>
MyBatis启动流程分析
查看>>
oracle自定义存储过程:删除表(无论表是否存在)和检测表是否存在
查看>>
SQL实用技巧:如何将表中某一列的部分数据合并到一行中
查看>>
MYSQL安装
查看>>
神奇的 SQL 之团结的力量 → JOIN
查看>>
MySQL详细安装(windows)
查看>>
【MySQL】rds 不支持镜像表/联合表,怎么办?
查看>>
SQL-连接查询:left join,right join,inner join,full join之间的区别
查看>>
LeetCode——Department Highest Salary(花式使用IN以及GROUP BY)
查看>>
Linux下安装及使用mysql
查看>>
MySQL 配置错误
查看>>