Elasticsearch中 重新索引数据脚本(可用作数据迁移)

释放双眼,带上耳机,听听看~!

重新索引原理

更改分析器,升级es都需要重新索引数据,所以ES重新索引是需要重视的一个功能.

参考: https://www.daimajiaoliu.com/daima/4ed62ea791003fc (教你如何在 elasticsearch 中重建索引)

重新索引准备

我的是使用hanlp的分析器,自己根据自己所需修改参数.
注: number_of_replicas 设置为0,这会加快重新索引


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
1PUT /new_index
2{
3  "settings": {
4    "number_of_shards": 12,
5    "number_of_replicas": 0,
6    "refresh_interval" : -1,
7    "analysis": {
8      "analyzer": {
9        "caseSensitive": {
10          "filter": "lowercase",
11          "type": "custom",
12          "tokenizer": "keyword",
13          "ignore_above": 256
14        },
15        "my_hanlp_analyzer": {
16          "filter": "lowercase",
17          "tokenizer": "my_hanlp"
18        }
19      },
20      "tokenizer": {
21        "my_hanlp": {
22          "type": "hanlp",
23          "enable_stop_dictionary": true
24        }
25      }
26    }
27  }
28}
29

修改相应属性


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
1PUT /new_index/_mapping
2{
3  "properties":{
4    "title":{
5      "type":"text",
6      "analyzer": "my_hanlp_analyzer"
7    },
8    "summary":{
9      "type":"text",
10      "analyzer": "my_hanlp_analyzer"
11    },
12    "key_words":{
13      "type":"text",
14      "analyzer": "caseSensitive"
15    },
16    "content":{
17      "type":"text",
18      "analyzer": "my_hanlp_analyzer"
19    },
20    "id":{
21      "type": "long"
22    },
23    "con_md5": {
24      "type": "keyword"
25    },
26    "portrait": {
27      "type": "keyword"
28    },
29    "url": {
30      "type": "keyword"
31    },
32    "generate":{
33      "type": "integer"
34    },
35    "contype":{
36      "type": "integer"
37    },
38    "extra": {
39      "type": "keyword"
40    },
41    "images": {
42      "type": "keyword"
43    },
44    "codes": {
45      "type": "text"
46    }
47  }
48}
49

如果一直运行服务,可以通过使用别名(alias)来访问索引(index)


1
2
3
4
5
6
7
8
9
10
11
12
1POST /_aliases
2{
3  "actions": [
4    {
5      "add": {
6        "index": "old_index", // 原有索引
7        "alias": "old_index_latest" // 服务的别名
8      }
9    }
10  ]
11}
12

重新索引脚本

同一机器版本


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
1#!/bin/bash
2
3if [ "$1" == "" ] || [ "$2" == "" ]; then
4  echo "Usage: ./reindex.sh [OLD_INDEX] [NEW_INDEX] [LOCAL_HOST:LOCAL_PORT]"
5  exit 1
6fi
7
8OLD_INDEX=$1
9NEW_INDEX=$2
10if [ "$3" == "" ]; then
11  LOCAL_HOST="localhost:9200"
12else
13  LOCAL_HOST=$3
14fi
15
16echo "---------------------------- NOTICE ----------------------------------"
17echo "You must ensur you have the following setting in your local ES host's:"
18echo "elasticsearch.yml config (the one re-indexing to):"
19echo "    reindex.remote.whitelist: $REMOTE_HOST"
20echo "Also, if an index template is necessary for this data, you must create"
21echo "locally before you start the re-indexing process"
22echo "----------------------------------------------------------------------"
23sleep 3
24
25  TOTAL_DOCS_REMOTE=$(curl --silent "http://$LOCAL_HOST/_cat/indices/$OLD_INDEX?h=docs.count")
26  echo "Attempting to re-indexing $OLD_INDEX ($TOTAL_DOCS_REMOTE docs total) from remote ES server..."
27  SECONDS=0
28  curl -H "Content-Type: application/json" -XPOST "http://$LOCAL_HOST/_reindex?wait_for_completion=true&pretty=true" -d "{
29    \"conflicts\": \"proceed\",
30    \"source\": {
31      \"index\": \"${OLD_INDEX}\"
32    },
33    \"dest\": {
34      \"index\": \"${NEW_INDEX}\"
35    }
36  }"
37
38  duration=$SECONDS
39
40  LOCAL_INDEX_EXISTS=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "http://$LOCAL_HOST/$OLD_INDEX")
41  if [ "$LOCAL_INDEX_EXISTS" == "200" ]; then
42    TOTAL_DOCS_REINDEXED=$(curl --silent "http://$LOCAL_HOST/_cat/indices/$NEW_INDEX?h=docs.count")
43  else
44    TOTAL_DOCS_REINDEXED=0
45  fi
46
47  echo "    Re-indexing results:"
48  echo "     -> Time taken: $(($duration / 60)) minutes and $(($duration % 60)) seconds"
49  echo "     -> Docs indexed: $TOTAL_DOCS_REINDEXED out of $TOTAL_DOCS_REMOTE"
50  echo ""
51
52  TOTAL_DURATION=$(($TOTAL_DURATION+$duration))
53
54  if [ "$TOTAL_DOCS_REMOTE" -ne "$TOTAL_DOCS_REINDEXED" ]; then
55    echo " INCOMPPLET $TOTAL_DOCS_REMOTE not equal $TOTAL_DOCS_REINDEXED"
56  else
57    echo "  "
58  fi
59
60
61

用法:


1
2
1./reindex2.sh old_index new_index 192.168.1.155:9200 2>&1 > ./_redeinx.log &
2

不同机器版本:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
1#!/bin/bash
2
3if [ "$1" == "" ] || [ "$2" == "" ]; then
4  echo "Usage: ./reindex.sh [REMOTE_HOST:REMOTE_PORT] [INDEX_PATTERN] [LOCAL_HOST:LOCAL_PORT]"
5  exit 1
6fi
7
8REMOTE_HOST=$1
9PATTERN=$2
10if [ "$3" == "" ]; then
11  LOCAL_HOST="localhost:9200"
12else
13  LOCAL_HOST=$3
14fi
15
16echo "---------------------------- NOTICE ----------------------------------"
17echo "You must ensur you have the following setting in your local ES host's:"
18echo "elasticsearch.yml config (the one re-indexing to):"
19echo "    reindex.remote.whitelist: $REMOTE_HOST"
20echo "Also, if an index template is necessary for this data, you must create"
21echo "locally before you start the re-indexing process"
22echo "----------------------------------------------------------------------"
23sleep 3
24
25INDICES=$(curl --silent "$REMOTE_HOST/_cat/indices/$PATTERN?h=index")
26TOTAL_INCOMPLETE_INDICES=0
27TOTAL_INDICES=0
28TOTAL_DURATION=0
29INCOMPLETE_INDICES=()
30
31for INDEX in $INDICES; do
32
33  TOTAL_DOCS_REMOTE=$(curl --silent "http://$REMOTE_HOST/_cat/indices/$INDEX?h=docs.count")
34  echo "Attempting to re-indexing $INDEX ($TOTAL_DOCS_REMOTE docs total) from remote ES server..."
35  SECONDS=0
36  curl -H "Content-Type: application/json" -XPOST "http://$LOCAL_HOST/_reindex?wait_for_completion=true&pretty=true" -d "{
37    \"conflicts\": \"proceed\",
38    \"source\": {
39      \"remote\": {
40        \"host\": \"http://$REMOTE_HOST\"
41      },
42      \"index\": \"${INDEX}\"
43    },
44    \"dest\": {
45      \"index\": \"${INDEX}\"
46    }
47  }"
48
49  duration=$SECONDS
50
51  LOCAL_INDEX_EXISTS=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "http://$LOCAL_HOST/$INDEX")
52  if [ "$LOCAL_INDEX_EXISTS" == "200" ]; then
53    TOTAL_DOCS_REINDEXED=$(curl --silent "http://$LOCAL_HOST/_cat/indices/$INDEX?h=docs.count")
54  else
55    TOTAL_DOCS_REINDEXED=0
56  fi
57
58  echo "    Re-indexing results:"
59  echo "     -> Time taken: $(($duration / 60)) minutes and $(($duration % 60)) seconds"
60  echo "     -> Docs indexed: $TOTAL_DOCS_REINDEXED out of $TOTAL_DOCS_REMOTE"
61  echo ""
62
63  TOTAL_DURATION=$(($TOTAL_DURATION+$duration))
64
65  if [ "$TOTAL_DOCS_REMOTE" -ne "$TOTAL_DOCS_REINDEXED" ]; then
66    TOTAL_INCOMPLETE_INDICES=$(($TOTAL_INCOMPLETE_INDICES+1))
67    INCOMPLETE_INDICES+=($INDEX)
68  fi
69
70  TOTAL_INDICES=$((TOTAL_INDICES+1))
71
72done
73
74echo "---------------------- STATS --------------------------"
75echo "Total Duration of Re-Indexing Process: $((TOTAL_DURATION / 60))m $((TOTAL_DURATION % 60))"
76echo "Total Indices: $TOTAL_INDICES"
77echo "Total Incomplete Re-Indexed Indices: $TOTAL_INCOMPLETE_INDICES"
78if [ "$TOTAL_INCOMPLETE_INDICES" -ne "0" ]; then
79  printf '%s\n' "${INCOMPLETE_INDICES[@]}"
80fi
81echo "-------------------------------------------------------"
82echo ""
83

用法(可以用作迁移数据):


1
2
1./reindex.sh old_index 192.168.1.155:9200 192.168.1.144:9200 2>&1 > ./_redeinx.log &
2

重建索引之后

重新修改别名


1
2
3
4
5
6
7
8
9
10
11
1POST _aliases
2{
3  "actions": [{"add": {
4    "index": "new_index",
5    "alias": "old_index_latest"
6  }}, {"remove": {
7    "index": "old_index",
8    "alias": "old_index_latest"
9  }}]
10}
11

删除旧索引


1
2
1DELETE old_index
2

我的直接修改别名解决方法: 我的索引服务是可以停止,所以我到最后修改别名匹配原来的旧索引名字


1
2
3
4
5
6
7
8
9
10
11
12
1POST /_aliases
2{
3  "actions": [
4    {
5      "add": {
6        "index": "new_index", // 新索引
7        "alias": "old_index" // 旧索引的名字
8      }
9    }
10  ]
11}
12

恢复设置


1
2
3
4
5
6
7
8
1PUT /new_index/_settings
2{
3  "index" : {
4    "number_of_replicas" : 2,
5    "refresh_interval" : null
6  }
7}
8

给TA打赏
共{{data.count}}人
人已打赏
安全经验

职场中的那些话那些事

2021-9-24 20:41:29

安全经验

网站日志统计案例分析与实现

2021-11-28 16:36:11

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索