|
99 | 99 | * dynamic mapping
|
100 | 100 | * automatically detects the data types of fields
|
101 | 101 | * might yield suboptimal results for specific use cases
|
102 |
| - * default mappings: defined using dynamic templates |
103 |
| - * example: map `app_*.code` as keywords and `app_*.message` as text |
| 102 | + * default mappings |
| 103 | + * defined using dynamic templates |
| 104 | + * example: map `app_*.code` as keywords |
104 | 105 | ```
|
105 | 106 | PUT /logs
|
106 | 107 | {
|
|
122 | 123 | ```
|
123 | 124 | "app_error.code": { "type": "keyword" }
|
124 | 125 | "app_warning.code": { "type": "keyword" }
|
125 |
| - "app_error.message": { "type": "text" } |
126 | 126 | ```
|
127 | 127 | * add new fields automatically
|
128 |
| - * use case: don’t know all the field names in advance |
129 |
| - * some data types that cannot be automatically detected |
130 |
| - * example: `geo_point`, `geo_shape` |
| 128 | + * use case: some fields cannot be known in advance |
| 129 | + * some data types cannot be automatically detected |
| 130 | + * example: `geo_point` |
| 131 | + * can be represented in multiple ways |
| 132 | + * string: `"41.12,-71.34"` |
| 133 | + * looks like text |
| 134 | + * what is first - latitude or longitude? |
| 135 | + * array: `[ -71.34, 41.12 ]` |
| 136 | + * looks like numeric array |
| 137 | + * object: `{ "lat": 41.12, "lon": -71.34 }` |
| 138 | + * looks like JSON |
| 139 | + * so Elasticsearch requires to explicitly declare `geo_point` fields in mapping |
131 | 140 | * explicit mapping
|
132 |
| - * used to have greater control over which fields are created |
| 141 | + * used to have greater control over fields |
133 | 142 | * recommended for production use cases
|
134 | 143 | * can’t change mappings for fields that are already mapped
|
135 | 144 | * requires reindexing
|
136 |
| - * sometimes adding multi-fields (index same field in different ways) is an option, but old documents will not have them |
| 145 | + * sometimes multi-fields are solution (index same field in different ways) |
| 146 | + * drawback: old documents will not have them |
| 147 | + * example |
137 | 148 | ```
|
138 | 149 | "city": {
|
139 | 150 | "type": "text",
|
|
145 | 156 | }
|
146 | 157 | ```
|
147 | 158 | * mapping explosion
|
148 |
| - * too many fields in an index an cause out of memory errors |
| 159 | + * too many fields in an index => risk of out of memory errors |
149 | 160 | * can be caused by lack of control over dynamic mapping
|
150 | 161 | * example: every new document inserted introduces new fields
|
151 |
| - * use the mapping limit settings to limit the number of field mappings |
| 162 | + * solution: use the mapping limit settings to limit the number of field mappings |
152 | 163 |
|
153 | 164 | ### indices
|
154 | 165 | * logical namespace that holds a collection of documents
|
155 | 166 | * can be considered as a table
|
156 | 167 | * logical abstraction over one or more Lucene indices (called shards)
|
| 168 | + * by default: all shards are queried |
| 169 | + * solution: create logical groups of data in separate indices |
| 170 | + * example |
| 171 | + ``` |
| 172 | + customers-switzerland → 2 shards |
| 173 | + customers-germany → 2 shards |
| 174 | + customers-rest → 1 shard |
| 175 | + ``` |
157 | 176 | * can be thought of as an optimized collection of documents
|
158 | 177 | * each indexed field has optimized data structure
|
159 | 178 | * example
|
160 | 179 | * text fields -> inverted indices
|
161 | 180 | * numeric and geo fields -> BKD trees
|
162 | 181 | * near-real time search
|
163 | 182 | * searches not run on the latest indexed data
|
164 |
| - * indexing a doc ≠ instant search visibility |
165 |
| - * document can be retrieved by ID immediately |
| 183 | + * indexing ≠ search visibility |
| 184 | + * however, document can be retrieved by ID immediately |
166 | 185 | * but a search query won’t return it until a refresh happens
|
167 | 186 | * point-in-time view of the index
|
168 | 187 | * multiple searches hit the same files and reuse the same caches
|
|
182 | 201 | * makes newly indexed documents searchable
|
183 | 202 | * writes in-memory buffer into a new Lucene segment
|
184 | 203 | * usually reside in the OS page cache (memory)
|
185 |
| - * aren’t guaranteed to be persisted until an fsync or flush |
| 204 | + * aren’t guaranteed to be persisted until `fsync` or `flush` |
186 | 205 | * in particular: files may never hit the actual disk
|
187 | 206 | * Lucene will ignore them if there's no updated `segments_N`
|
188 | 207 | * => update is done during commit
|
|
193 | 212 | * example: buffer
|
194 | 213 | * every search request is handled by
|
195 | 214 | * grabbing the current active searcher
|
| 215 | + * each shard knows its current searcher |
196 | 216 | * executing the query against that consistent view
|
197 | 217 | * writes don’t interfere with ongoing searches
|
198 | 218 | * when
|
|
203 | 223 | * commit
|
204 | 224 | * it is not about search
|
205 | 225 | * does not affect search => searchers see segments based on refresh, not commit
|
206 |
| - * uses fsync |
| 226 | + * uses `fsync` |
207 | 227 | * the only way to guarantee that the operating system has actually written data to disk
|
208 | 228 | * pauses index writers briefly
|
209 | 229 | * to ensure that commit reflects a consistent index state
|
|
262 | 282 | * note that sometimes (very rarely) stopwords are important and can be helpful: "to be, or not to be"
|
263 | 283 | * adding synonyms
|
264 | 284 | * token indexing — stores those tokens into the index
|
265 |
| - * sent to Lucene to be indexed for the document |
266 |
| - * make up the inverted index |
267 | 285 | * the query text undergoes the same analysis before the terms are looked up in the index
|
268 | 286 |
|
269 | 287 | ### node
|
270 | 288 | * node is an instance of Elasticsearch
|
271 | 289 | * multiple nodes can join the same cluster
|
272 |
| -* with a cluster of multiple nodes, the same data can be spread across multiple servers |
273 |
| - * helps performance: because Elasticsearch has more resources to work with |
274 |
| - * helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch |
275 |
| - will still serve you all the data |
276 |
| - * for performance reasons, the nodes within a cluster need to be on the same network |
277 |
| - * balancing shards in a cluster across nodes in different data centers simply takes too long |
| 290 | +* cluster |
| 291 | + * same data can be spread across multiple servers (replication) |
| 292 | + * helps performance: adds resources to work with |
| 293 | + * helps reliability: data is replicated |
| 294 | + * all nodes need to be on the same network |
| 295 | + * balancing shards across data centers simply takes too long |
| 296 | + * example: master issues relocation commands if it detects unbalanced shard distribution |
278 | 297 | * cross-cluster replication (CCR)
|
279 | 298 | * allows you to replicate data from one cluster (leader cluster) to another (follower cluster)
|
280 | 299 | * example: across data centers, regions, or cloud availability zones
|
|
284 | 303 | * maintains the cluster state (node joins/leaves, index creation, shard allocation)
|
285 | 304 | * assign shards to nodes
|
286 | 305 | * example: when new index is created
|
287 |
| - * based on node capabilities |
| 306 | + * based on node capabilities and existing shard distribution |
288 | 307 | * data
|
289 | 308 | * stores actual index data (primary and replica shards)
|
290 | 309 | * coordinating
|
291 | 310 | * maintains a local copy of the cluster state
|
292 | 311 | * only the master node updates the cluster state, but all nodes subscribe to it
|
293 | 312 | * routes client requests
|
294 |
| - * hash of id % number_of_primary_shards => picks the target shard |
| 313 | + * formula: hash of id % number_of_primary_shards => picks the target shard |
| 314 | + * number of primary shards in an index is fixed at the time that an index is created |
| 315 | + * in particular: Elasticsearch always maps a routing value to a single shard |
295 | 316 | * returns final result
|
296 | 317 | * example: merges responses to aggregate results
|
297 | 318 | * every node in Elasticsearch can act as a coordinating node
|
|
312 | 333 | * metadata files (how to read, decode, and interpret the raw data files in a segment)
|
313 | 334 | * example: `.fnm` (field names and types)
|
314 | 335 | * commit files (which segments to load after a crash or restart)
|
315 |
| - * segments_N (snapshot of all current segments) |
316 |
| - * segments.gen (tracks the latest segments_N file) |
317 |
| - * write.lock (prevent concurrent writers) |
318 |
| -* stores documents plus additional information (term dictionary, term frequencies) |
319 |
| - * term dictionary: maps each term to identifiers of documents containing that term |
320 |
| - * term frequencies: number of appearances of a term in a document |
321 |
| - * important for calculating the relevancy score of results |
| 336 | + * `segments_N` (snapshot of all current segments) |
| 337 | + * `segments.gen` (tracks the latest segments_N file) |
| 338 | + * `write.lock` (prevent concurrent writers) |
| 339 | +* can be hosted on any node within the cluster |
| 340 | + * not necessarily be distributed across multiple physical or virtual machines |
| 341 | + * example |
| 342 | + * 1 terabyte index into four shards (256 gb each) |
| 343 | + * shards could be distributed across the two nodes (2 per node) |
| 344 | + * as you add more nodes to the same cluster, existing shards get balanced between all nodes |
322 | 345 | * two types of shards: primaries and replicas
|
323 |
| - * all operations that affect the index — such as adding, updating, or removing documents — are sent to the |
324 |
| - primary shard |
325 |
| - * when the operation completes, the operation will be forwarded to each of the replica shards |
326 |
| - * when the operation has completed successfully on every replica and responded to the primary shard, |
327 |
| - the primary shard will respond to the client that the operation has completed successfully |
| 346 | + * primary shard: all operations that affect the index |
| 347 | + * example: adding, updating, or removing documents |
| 348 | + * flow |
| 349 | + 1. operation completes on primary shard => it is forwarded to each of the replica shards |
| 350 | + 1. operation completes on every replica => responds to the primary shard |
| 351 | + 1. primary shard responds to the client |
328 | 352 | * each document is stored in a single primary shard
|
329 |
| - * it is indexed first on the primary shard, then on all replicas of the primary shard |
330 | 353 | * replica shard is a copy of a primary shard
|
331 | 354 | * are never allocated to the same nodes as the primary shards
|
332 | 355 | * serves two purposes
|
|
335 | 358 | * documents are distributed evenly between shards
|
336 | 359 | * the shard is determined by hashing document id
|
337 | 360 | * each shard has an equal hash range
|
338 |
| - * the current node forwards the document to the node holding that shard |
339 |
| - * indexing operation is replayed by all the replicas of that shard |
340 |
| -* can be hosted on any node within the cluster |
341 |
| - * not necessarily be distributed across multiple physical or virtual machines |
342 |
| - * example |
343 |
| - * 1 terabyte index into four shards (256 gb each) |
344 |
| - * shards could be distributed across the two nodes (2 per node) |
345 |
| - * as you add more nodes to the same cluster, existing shards get balanced between all nodes |
346 | 361 | * two main reasons why sharding is important
|
347 | 362 | * allows you to split and thereby scale volumes of data
|
348 | 363 | * operations can be distributed across multiple nodes and thereby parallelized
|
349 | 364 | * multiple machines can potentially work on the same query
|
350 |
| -* routing: determining which primary shard a given document should be stored in or has been stored in |
351 |
| - * standard: `shard = hash(routing) % total_primary_shards` |
352 |
| - * number of primary shards in an index is fixed at the time that an index is created |
353 |
| - * you could segment the data by date, creating an index for each year: 2014, 2015, 2016, and so on |
354 |
| - * possibility to adjust the number of primary shards based on load and performance of the |
355 |
| - previous indexes |
356 |
| - * commonly used when indexing date-based information (like log files) |
357 |
| - * number of replica shards can be changed at any time |
358 |
| - * is customizable, for example: shard based on the customer’s country |
359 | 365 |
|
360 | 366 | ### segment
|
361 | 367 | * contains
|
|
402 | 408 | * the more segments you have to go though, the slower the search
|
403 | 409 | * solution: merging
|
404 | 410 | * creating new and bigger segments with combined content
|
405 |
| - * commit: writes a new segments_N listing new merged segment and not segments that were merged |
| 411 | + * commit: writes a new `segments_N` listing new merged segment and not segments that were merged |
406 | 412 | * example: excluding the deleted documents
|
407 | 413 | * tiered - default merge policy
|
408 | 414 | * segments divided into tiers by size
|
|
419 | 425 | * avoids merging huge segments unless necessary
|
420 | 426 |
|
421 | 427 | ### scoring
|
422 |
| -* TF: how often a term occurs in the text |
| 428 | +* TF: how often a term occurs in the document |
423 | 429 | * IDF: the token's importance is inversely proportional to the number of occurrences across all of the documents
|
| 430 | + * `IDF = log(N / df)` |
| 431 | + * N = total number of documents |
| 432 | + * df = number of documents containing the term |
424 | 433 | * Lucene’s default scoring formula, known as TF-IDF
|
425 |
| - * apart from normalization & other factors, in general, it is simply: `TF * 1/IDF` |
| 434 | + * apart from normalization & other factors, in general, it is simply: `TF * IDF` |
0 commit comments