20
20
* https://chatgpt.com/
21
21
* https://www.elastic.co/docs/manage-data/data-store/mapping/removal-of-mapping-types
22
22
* https://www.elastic.co/docs/manage-data/data-store/mapping
23
+ * https://www.elastic.co/blog/what-is-an-elasticsearch-index
23
24
24
25
## preface
25
26
* goals of this workshop
60
61
61
62
### mapping
62
63
* is the schema definition for the documents in an index
63
- * ` GET /your-index-name/_mapping `
64
+ * document = collection of fields + data types
65
+ * includes metadata fields
66
+ * example
67
+ * ` _index ` - index to which the document belongs
68
+ * ` _id ` - document’s ID
69
+ * ` _source ` - original JSON representing the body of the document
70
+ * others
64
71
* example
72
+ ```
73
+ GET /your-index-name/_mapping
74
+
75
+ {
76
+ "people": {
77
+ "mappings": {
78
+ "_source": { // by default ommitted
79
+ "enabled": true
80
+ },
81
+ "_meta": { // custom metadata (ex.: for documentation or tooling)
82
+ "version": "1.0",
83
+ "description": "People index for sanction screening"
84
+ },
85
+ "properties": {
86
+ "name": { "type": "text" },
87
+ "birthdate": { "type": "date" },
88
+ "country": { "type": "keyword" },
89
+ "bio_vector": {
90
+ "type": "dense_vector",
91
+ "dims": 384
92
+ }
93
+ }
94
+ }
95
+ }
96
+ }
97
+ ```
98
+ * two types
99
+ * dynamic mapping
100
+ * automatically detects the data types of fields
101
+ * might yield suboptimal results for specific use cases
102
+ * default mappings: defined using dynamic templates
103
+ * example: map `app_*.code` as keywords and `app_*.message` as text
104
+ ```
105
+ PUT /logs
106
+ {
107
+ "mappings": {
108
+ "dynamic_templates": [
109
+ {
110
+ "map_app_codes_as_keyword": {
111
+ "path_match": "app_*.code",
112
+ "mapping": {
113
+ "type": "keyword"
114
+ }
115
+ }
116
+ }
117
+ ]
118
+ }
119
+ }
120
+ ```
121
+ will produce types
122
+ ```
123
+ "app_error.code": { "type": "keyword" }
124
+ "app_warning.code": { "type": "keyword" }
125
+ "app_error.message": { "type": "text" }
126
+ ```
127
+ * add new fields automatically
128
+ * use case: don’t know all the field names in advance
129
+ * some data types that cannot be automatically detected
130
+ * example: `geo_point`, `geo_shape`
131
+ * explicit mapping
132
+ * used to have greater control over which fields are created
133
+ * recommended for production use cases
134
+ * can’t change mappings for fields that are already mapped
135
+ * requires reindexing
136
+ * sometimes adding multi-fields (index same field in different ways) is an option, but old documents will not have them
137
+ ```
138
+ "city": {
139
+ "type": "text",
140
+ "fields": {
141
+ "raw": {
142
+ "type": "keyword"
143
+ }
144
+ }
145
+ }
146
+ ```
147
+ * mapping explosion
148
+ * too many fields in an index an cause out of memory errors
149
+ * can be caused by lack of control over dynamic mapping
150
+ * example: every new document inserted introduces new fields
151
+ * use the mapping limit settings to limit the number of field mappings
65
152
66
153
### indices
154
+ * logical namespace that holds a collection of documents
155
+ * can be considered as a table
156
+ * logical abstraction over one or more Lucene indices (called shards)
67
157
* can be thought of as an optimized collection of documents
68
- * Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data
69
- structure
70
- * for example, text fields -> inverted indices, numeric and geo fields -> BKD trees
71
- * useful to index the same field in different ways for different purposes
72
- * like a relational database: each index is stored on the disk in the same set of files
73
- * you can search across types and search across indices
74
- * near-real time
158
+ * each indexed field has optimized data structure
159
+ * example
160
+ * text fields -> inverted indices
161
+ * numeric and geo fields -> BKD trees
162
+ * near-real time search
75
163
* searches not run on the latest indexed data
76
- * a point-in-time view of the index - multiple searches hit the same files and reuse the same caches
77
- * newly indexed documents are not visible until refresh
78
- * refresh - refreshes point-in-time view
79
- * default: every 1s
80
- * process of refreshing and process of committing in-memory segments to disk are independent
81
- * data is indexed first in memory (search goes through disk and in-memory segments as well)
82
- * flush: the process of committing in-memory segments to disk (Lucene index)
83
- * transaction log: in case of a node goes down or a shard is relocated - track of not flushed operations
84
- * flush also clears the transaction log
85
- * a flush is triggered by
86
- * the memory buffer is full
87
- * time since last flush
88
- * the transaction log hit a threshold
164
+ * indexing a doc ≠ instant search visibility
165
+ * document can be retrieved by ID immediately
166
+ * but a search query won’t return it until a refresh happens
167
+ * point-in-time view of the index
168
+ * multiple searches hit the same files and reuse the same caches
169
+ * processes
170
+ * indexing = storing
171
+ * document is put in two places
172
+ * in-memory buffer (Lucene memory buffer)
173
+ * transaction log (called translog) on disk
174
+ * crash recovery log
175
+ * translog is not searchable
176
+ * when
177
+ * document sent to Elasticsearch
178
+ * after indexing
179
+ * document is durable (even if node crashes)
180
+ * not yet searchable
181
+ * refresh
182
+ * makes newly indexed documents searchable
183
+ * writes in-memory buffer into a new Lucene segment
184
+ * usually reside in the OS page cache (memory)
185
+ * aren’t guaranteed to be persisted until an fsync or flush
186
+ * in particular: files may never hit the actual disk
187
+ * Lucene will ignore them if there's no updated `segments_N`
188
+ * => update is done during commit
189
+ * opens a new searcher
190
+ * sees all committed segments
191
+ * sees any new segments created by a refresh
192
+ * does not see uncommitted in-memory data
193
+ * example: buffer
194
+ * every search request is handled by
195
+ * grabbing the current active searcher
196
+ * executing the query against that consistent view
197
+ * writes don’t interfere with ongoing searches
198
+ * when
199
+ * automatically every 1 second (default)
200
+ * manually: `POST /my-index/_refresh`
201
+ * after refresh
202
+ * documents are searchable
203
+ * commit
204
+ * it is not about search
205
+ * does not affect search => searchers see segments based on refresh, not commit
206
+ * uses fsync
207
+ * the only way to guarantee that the operating system has actually written data to disk
208
+ * pauses index writers briefly
209
+ * to ensure that commit reflects a consistent index state
210
+ * clears the translog (since changes are now safely in Lucene index)
211
+ * each commit creates a new `segments_N` file with an incremented generation number `(N)`
212
+ * represents current state of the index
213
+ * lists all the active segments
214
+ * older `segments_N` files are effectively obsolete after a new one is committed
215
+ * binary file
216
+ * textual example
217
+ ```
218
+ Segments:
219
+ ----------
220
+ Segment: _0
221
+ - Uses compound file: true
222
+ - Doc count: 1,000
223
+ - Deleted docs: 0
224
+ - Files:
225
+ _0.cfs
226
+ _0.cfe
227
+ _0.si
228
+ - Codec: Lucene90
229
+ - Segment created with Lucene 9.9.0
230
+ ```
231
+ * Lucene reads this file on startup
232
+ * tells which `.cfs` segment files to load and use
233
+ * reads `segments.gen` to find the latest `segments_N` file
234
+ * when
235
+ * the memory buffer is full
236
+ * time since last flush
237
+ * the transaction log hit a threshold
238
+ * in particular: refreshing and committing are independent
239
+
89
240
### inverted indexing
90
- * Lucene data structure where it keeps a list of where each word belong
241
+ * Lucene data structure where it keeps a list of where each word belongs
91
242

92
243
* example: index in the book with words and what pages they appear
93
244

114
265
* sent to Lucene to be indexed for the document
115
266
* make up the inverted index
116
267
* the query text undergoes the same analysis before the terms are looked up in the index
268
+
117
269
### node
118
270
* node is an instance of Elasticsearch
119
271
* multiple nodes can join the same cluster
120
272
* with a cluster of multiple nodes, the same data can be spread across multiple servers
121
- * helps performance because Elasticsearch has more resources to work with
273
+ * helps performance: because Elasticsearch has more resources to work with
122
274
* helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch
123
275
will still serve you all the data
124
- * for performance reasons, the nodes within a cluster need to be on the same network
125
- * balancing shards in a cluster across nodes in different data centers simply takes too long
126
- * cross-cluster replication (CCR)
127
- * given node then receives the request is responsible for coordinating the rest of the work
128
- * node within the cluster knows about every node in the cluster and is able to forward requests
129
- to a given node by using a transport layer
130
- * HTTP layer is exclusively used for communicating with external clients
131
- * master node is the node that is responsible for coordinating changes to the cluster, such as
132
- adding or removing nodes, creating or removing indices, etc
276
+ * for performance reasons, the nodes within a cluster need to be on the same network
277
+ * balancing shards in a cluster across nodes in different data centers simply takes too long
278
+ * cross-cluster replication (CCR)
279
+ * allows you to replicate data from one cluster (leader cluster) to another (follower cluster)
280
+ * example: across data centers, regions, or cloud availability zones
281
+ * roles
282
+ * master
283
+ * election: Raft-inspired
284
+ * maintains the cluster state (node joins/leaves, index creation, shard allocation)
285
+ * assign shards to nodes
286
+ * example: when new index is created
287
+ * based on node capabilities
288
+ * data
289
+ * stores actual index data (primary and replica shards)
290
+ * coordinating
291
+ * maintains a local copy of the cluster state
292
+ * only the master node updates the cluster state, but all nodes subscribe to it
293
+ * routes client requests
294
+ * hash of id % number_of_primary_shards => picks the target shard
295
+ * returns final result
296
+ * example: merges responses to aggregate results
297
+ * every node in Elasticsearch can act as a coordinating node
298
+
133
299
### shard
134
- * is a Lucene index: a directory of files containing an inverted index
300
+ * is a Lucene shard: a directory of files containing an inverted index
301
+ * cannot be split or merged easily
135
302
* index is just a logical grouping of physical shards
136
303
* each shard is actually a self-contained index
137
304
* example
@@ -140,6 +307,14 @@ adding or removing nodes, creating or removing indices, etc
140
307
* the entire index will not fit on either of the nodes
141
308
* we need some way of splitting the index
142
309
* sharding comes to the rescue
310
+ * contains
311
+ * segment files
312
+ * metadata files (how to read, decode, and interpret the raw data files in a segment)
313
+ * example: `.fnm` (field names and types)
314
+ * commit files (which segments to load after a crash or restart)
315
+ * segments_N (snapshot of all current segments)
316
+ * segments.gen (tracks the latest segments_N file)
317
+ * write.lock (prevent concurrent writers)
143
318
* stores documents plus additional information (term dictionary, term frequencies)
144
319
* term dictionary: maps each term to identifiers of documents containing that term
145
320
* term frequencies: number of appearances of a term in a document
@@ -181,27 +356,68 @@ adding or removing nodes, creating or removing indices, etc
181
356
* commonly used when indexing date-based information (like log files)
182
357
* number of replica shards can be changed at any time
183
358
* is customizable, for example: shard based on the customer’s country
359
+
184
360
### segment
185
- * is a chunk of the Lucene index
186
- * segments are immutable
361
+ * contains
362
+ * inverted index
363
+ * term dictionary
364
+ * maps term to offset in a posting list
365
+ * contains document frequency
366
+ * example
367
+ * term: "shoes" → df = 2, offset = 842
368
+ * Lucene know that from offset X should read exactly 2 document entries
369
+ * postings lists
370
+ * stores all the information Lucene needs to retrieve and rank documents
371
+ * example: document where the term appears, term frequency
372
+ * stored fields (original document fields)
373
+ * doc values (columnar storage for sorting, aggregations, faceting)
374
+ * example
375
+ ```
376
+ DocID price (doc value)
377
+ --------------------------
378
+ 0 59.99
379
+ 1 19.99
380
+ 2 129.99
381
+ ```
382
+ * norms (field-level stats for scoring)
383
+ * example: length of field
384
+ * longer fields tend to be less precise
385
+ * involves 10+ small files per segment
386
+ * in particular: 100 segments => 1000+ files
387
+ * problem: file handle exhaustion
388
+ * solution: .cfs (compound file format)
389
+ * Lucene can read them as if they were separate files (using random-access lookups inside .cfs)
390
+ * not compressed
391
+ * just a flat concatenation of multiple Lucene data files
392
+ * is immutable
187
393
* new ones are created as you index new documents
188
394
* deleting only marks documents as deleted
395
+ * Lucene supports deletes via live docs bitmap, not by physically removing the data immediately
396
+ * cleaned up during segment merges
189
397
* updating documents implies re-indexing
190
398
* updating a document can’t change the actual document; it can only index a new one
191
399
* are easily cached, making searches fast
192
- * when query on a shard
193
- * Lucene queries all its segments, merge the results, and send them back
194
- * the more segments you have to go though, the slower the search
195
- * merging
400
+ * Lucene queries all its segments, merge the results, and send them back
196
401
* normal indexing operations create many such small segments
197
- * Lucene merges them from time to time
198
- * implies reading contents, excluding the deleted documents, and creating new and bigger segments with combined
199
- content
200
- * process requires resources: CPU and disk I/O
201
- * merges run asynchronously
202
- * tiered - default merge policy
203
- * segments divided into tiers
204
- * if threshold hit in a tier, merge is triggered in that tier
402
+ * the more segments you have to go though, the slower the search
403
+ * solution: merging
404
+ * creating new and bigger segments with combined content
405
+ * commit: writes a new segments_N listing new merged segment and not segments that were merged
406
+ * example: excluding the deleted documents
407
+ * tiered - default merge policy
408
+ * segments divided into tiers by size
409
+ * example
410
+ ```
411
+ Tier 1: segments ≤ 5 MB
412
+ Tier 2: segments ≤ 25 MB
413
+ Tier 3: segments ≤ 150 MB
414
+ ...
415
+ ```
416
+ * each tier has a threshold number of segments
417
+ * if threshold hit in a tier => merge in that tier
418
+ * prioritizes merging small segments first (cheap, fast)
419
+ * avoids merging huge segments unless necessary
420
+
205
421
### scoring
206
422
* TF: how often a term occurs in the text
207
423
* IDF: the token's importance is inversely proportional to the number of occurrences across all of the documents
0 commit comments