如何存储房地产属性的属性

时间:2015-10-24 07:31:06

标签: elasticsearch

我是Elasticsearch的新手。我有一些文件可以有这样的属性:

  1. 浴室否
  2. 卧室
  3. 邮编
  4. 地址
  5. 我想将这些属性存储在一个字段中,以便用户可以使用“3张床位于97778(zip)”进行搜索。

    我尝试使用单个数组字段,使[3个床位,2个浴室,97778],[7个床位,3个浴室,97778]使用禁用分析器,这样我就可以限制“at”,“in”这种单词,但似乎这不是正确的方法,因为第二个doc分数高于第一个doc。

    另外,我有一个同义词分析器,因为如果用户搜索“3 bd”,它应该返回“3个床位”。

    现在我的问题是存储属性的最佳方法是什么?这是我的一些虚拟文件。

    {
        "Beds" : 3,
        "Bath" : 2,
        "Zip" : 97778,
        "Attributes" : ["3 beds","2 baths", "97778"]
    },
    {
        "Beds" : 7,
        "Bath" : 3,
        "Zip" : 97778,
        "Attributes" : [7 beds,3 baths, 97778]
    }
    

    我应该将此架构更改为

    {
        "Beds" : 7,
        "Bath" : 3,
        "Zip" : 97778,
        "Attributes" : [bed : "7", bath : "3", zip : "97778"]
    }
    

    如果是这样,那么我该如何放置同义词分析器?

1 个答案:

答案 0 :(得分:3)

第一个结构对我来说似乎更好看。我使用Marvel在本地计算机上创建了一个带有这些属性的简单索引:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "my_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        },
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "bd => bed",
            "bt, baths, bth => bath"]
        },
        "my_shingle": {
          "type" : "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 3,
          "output_unigrams": false,
          "output_unigrams_if_no_shingles": true
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer":  "standard",
          "filter": [
            "my_possessive_stemmer",
            "lowercase",
            "my_stop",
            "my_synonym",
            "kstem",
            "my_shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "documents": {
      "properties": {
        "Beds": {
          "type": "integer"
        },
        "Baths": {
          "type": "integer"
        },
        "Zip": {
          "type": "integer"
        },
        "Attributes": {
          "type": "string",
          "analyzer": "my_english"
        }
      }
    }
  }
}

这是非常标准的英语分析器(我只排除了词干分析器,我认为它过于激进并用kstem取代)当然还有你的同义词。我还添加了shingle过滤器,它产生令牌组合,这正是我们正在寻找的!

我已经添加了您的测试数据。请注意,如果用户希望查找zip 97778或97778 zip,我已将关键字zip加倍。

PUT /test/documents/1
{
  "Beds": 3,
  "Bath": 2,
  "Zip": 97778,
  "Attributes": ["3 beds", "2 baths", "zip 97778 zip"]
}

PUT /test/documents/2
{
  "Beds": 7,
  "Bath": 3,
  "Zip": 97778,
  "Attributes": ["7 beds", "3 baths", "zip 97778 zip"]
}

POST /test/documents/3
{
  "Attributes" : ["8310 prairie rose place", "md", "baltimore", "21208", "us", "3 bd", "3 bth", "1 pbh", "1 hbh", "cooktop", "dishwasher", "dryer", "garbage disposer", "ice maker", "microwave", "oven", "oven - double", "refrigerator", "washer", "appliances", "contemporary architecture", "ceiling fan(s)", "colling system", "brick", "basement", "forced air", "heating system", "3 floors", "2 parkings", "garage", "asphalt roof"]
}

POST /test/documents/4
{
  "Attributes" : ["8 winners circle", "md", "owings mills", "21117", "us", "2 bd", "1 bth", "dishwasher", "dryer", "garbage disposer", "microwave", "range", "refrigerator", "washer", "appliances", "traditional architecture", "new traditional architecture", "central a/c", "colling system", "vinyl siding", "heat pump", "heating system", "1 floors", "assigned", "unassigned", "unknown roof"]
}

这是一个简单的匹配查询:

POST /test/documents/_search
{
  "query": {
    "match": {
      "Attributes": {
        "query": "3 beds at 97778(zip)"
      }
    }
  }
}

它根据要求提供所需的数据:

{
  "_index" : "test",
  "_type" : "documents",
  "_id" : "1",
  "_score" : 0.020668881,
  "_source" : {
    "Beds" : 3,
    "Bath" : 2,
    "Zip" : 97778,
    "Attributes" : [
      "3 beds",
      "2 baths",
      "zip 97778 zip"
    ]
  }
},
{
  "_index" : "test",
  "_type" : "documents",
  "_id" : "2",
  "_score" : 0.004767749,
  "_source" : {
    "Beds" : 7,
    "Bath" : 3,
    "Zip" : 97778,
    "Attributes" : [
      "7 beds",
      "3 baths",
      "zip 97778 zip"
    ]
  }
},
{
  "_index" : "test",
  "_type" : "documents",
  "_id" : "3",
  "_score" : 0.0014899216,
  "_source" : {
    "Attributes" : [
      "8310 prairie rose place",
      "md",
      "baltimore",
      "21208",
      "us",
      "3 bd",
      "3 bth",
      "1 pbh",
      "1 hbh",
      "cooktop",
      "dishwasher",
      "dryer",
      "garbage disposer",
      "ice maker",
      "microwave",
      "oven",
      "oven - double",
      "refrigerator",
      "washer",
      "appliances",
      "contemporary architecture",
      "ceiling fan(s)",
      "colling system",
      "brick",
      "basement",
      "forced air",
      "heating system",
      "3 floors",
      "2 parkings",
      "garage",
      "asphalt roof"
    ]
  }
}

现在我在查询时:

POST /test/documents/_search
{
  "query": {
    "match": {
      "Attributes": {
        "query": "2 bd and 1 bth at md"
      }
    }
  }
}

返回此结果,这是正确的:

{
  "_index" : "test",
  "_type" : "documents",
  "_id" : "4",
  "_score" : 0.0032357208,
  "_source" : {
    "Attributes" : [
      "8 winners circle",
      "md",
      "owings mills",
      "21117",
      "us",
      "2 bd",
      "1 bth",
      "dishwasher",
      "dryer",
      "garbage disposer",
      "microwave",
      "range",
      "refrigerator",
      "washer",
      "appliances",
      "traditional architecture",
      "new traditional architecture",
      "central a/c",
      "colling system",
      "vinyl siding",
      "heat pump",
      "heating system",
      "1 floors",
      "assigned",
      "unassigned",
      "unknown roof"
    ]
  }
}

你说你的结果总是得1分。这表明你的查询运行不正确。我猜这个问题是你在attributes字段而不是Attributes上运行,不幸的是,Elasticsearch非常区分大小写。

从评论中,你说你正在使用term query - 因为它一直在寻找精确的术语匹配,所以对文本数据使用它是不对的。 始终在您搜索文本数据时使用match query

如果有帮助,请告诉我。