Skip to content

Commit 4417037

Browse files
committed
doc update
1 parent 02c0f9d commit 4417037

File tree

1 file changed

+66
-57
lines changed

1 file changed

+66
-57
lines changed

README.md

+66-57
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,9 @@
9999
* dynamic mapping
100100
* automatically detects the data types of fields
101101
* might yield suboptimal results for specific use cases
102-
* default mappings: defined using dynamic templates
103-
* example: map `app_*.code` as keywords and `app_*.message` as text
102+
* default mappings
103+
* defined using dynamic templates
104+
* example: map `app_*.code` as keywords
104105
```
105106
PUT /logs
106107
{
@@ -122,18 +123,28 @@
122123
```
123124
"app_error.code": { "type": "keyword" }
124125
"app_warning.code": { "type": "keyword" }
125-
"app_error.message": { "type": "text" }
126126
```
127127
* add new fields automatically
128-
* use case: don’t know all the field names in advance
129-
* some data types that cannot be automatically detected
130-
* example: `geo_point`, `geo_shape`
128+
* use case: some fields cannot be known in advance
129+
* some data types cannot be automatically detected
130+
* example: `geo_point`
131+
* can be represented in multiple ways
132+
* string: `"41.12,-71.34"`
133+
* looks like text
134+
* what is first - latitude or longitude?
135+
* array: `[ -71.34, 41.12 ]`
136+
* looks like numeric array
137+
* object: `{ "lat": 41.12, "lon": -71.34 }`
138+
* looks like JSON
139+
* so Elasticsearch requires to explicitly declare `geo_point` fields in mapping
131140
* explicit mapping
132-
* used to have greater control over which fields are created
141+
* used to have greater control over fields
133142
* recommended for production use cases
134143
* can’t change mappings for fields that are already mapped
135144
* requires reindexing
136-
* sometimes adding multi-fields (index same field in different ways) is an option, but old documents will not have them
145+
* sometimes multi-fields are solution (index same field in different ways)
146+
* drawback: old documents will not have them
147+
* example
137148
```
138149
"city": {
139150
"type": "text",
@@ -145,24 +156,32 @@
145156
}
146157
```
147158
* mapping explosion
148-
* too many fields in an index an cause out of memory errors
159+
* too many fields in an index => risk of out of memory errors
149160
* can be caused by lack of control over dynamic mapping
150161
* example: every new document inserted introduces new fields
151-
* use the mapping limit settings to limit the number of field mappings
162+
* solution: use the mapping limit settings to limit the number of field mappings
152163
153164
### indices
154165
* logical namespace that holds a collection of documents
155166
* can be considered as a table
156167
* logical abstraction over one or more Lucene indices (called shards)
168+
* by default: all shards are queried
169+
* solution: create logical groups of data in separate indices
170+
* example
171+
```
172+
customers-switzerland → 2 shards
173+
customers-germany → 2 shards
174+
customers-rest → 1 shard
175+
```
157176
* can be thought of as an optimized collection of documents
158177
* each indexed field has optimized data structure
159178
* example
160179
* text fields -> inverted indices
161180
* numeric and geo fields -> BKD trees
162181
* near-real time search
163182
* searches not run on the latest indexed data
164-
* indexing a doc ≠ instant search visibility
165-
* document can be retrieved by ID immediately
183+
* indexing search visibility
184+
* however, document can be retrieved by ID immediately
166185
* but a search query won’t return it until a refresh happens
167186
* point-in-time view of the index
168187
* multiple searches hit the same files and reuse the same caches
@@ -182,7 +201,7 @@
182201
* makes newly indexed documents searchable
183202
* writes in-memory buffer into a new Lucene segment
184203
* usually reside in the OS page cache (memory)
185-
* aren’t guaranteed to be persisted until an fsync or flush
204+
* aren’t guaranteed to be persisted until `fsync` or `flush`
186205
* in particular: files may never hit the actual disk
187206
* Lucene will ignore them if there's no updated `segments_N`
188207
* => update is done during commit
@@ -193,6 +212,7 @@
193212
* example: buffer
194213
* every search request is handled by
195214
* grabbing the current active searcher
215+
* each shard knows its current searcher
196216
* executing the query against that consistent view
197217
* writes don’t interfere with ongoing searches
198218
* when
@@ -203,7 +223,7 @@
203223
* commit
204224
* it is not about search
205225
* does not affect search => searchers see segments based on refresh, not commit
206-
* uses fsync
226+
* uses `fsync`
207227
* the only way to guarantee that the operating system has actually written data to disk
208228
* pauses index writers briefly
209229
* to ensure that commit reflects a consistent index state
@@ -262,19 +282,18 @@
262282
* note that sometimes (very rarely) stopwords are important and can be helpful: "to be, or not to be"
263283
* adding synonyms
264284
* token indexing — stores those tokens into the index
265-
* sent to Lucene to be indexed for the document
266-
* make up the inverted index
267285
* the query text undergoes the same analysis before the terms are looked up in the index
268286
269287
### node
270288
* node is an instance of Elasticsearch
271289
* multiple nodes can join the same cluster
272-
* with a cluster of multiple nodes, the same data can be spread across multiple servers
273-
* helps performance: because Elasticsearch has more resources to work with
274-
* helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch
275-
will still serve you all the data
276-
* for performance reasons, the nodes within a cluster need to be on the same network
277-
* balancing shards in a cluster across nodes in different data centers simply takes too long
290+
* cluster
291+
* same data can be spread across multiple servers (replication)
292+
* helps performance: adds resources to work with
293+
* helps reliability: data is replicated
294+
* all nodes need to be on the same network
295+
* balancing shards across data centers simply takes too long
296+
* example: master issues relocation commands if it detects unbalanced shard distribution
278297
* cross-cluster replication (CCR)
279298
* allows you to replicate data from one cluster (leader cluster) to another (follower cluster)
280299
* example: across data centers, regions, or cloud availability zones
@@ -284,14 +303,16 @@
284303
* maintains the cluster state (node joins/leaves, index creation, shard allocation)
285304
* assign shards to nodes
286305
* example: when new index is created
287-
* based on node capabilities
306+
* based on node capabilities and existing shard distribution
288307
* data
289308
* stores actual index data (primary and replica shards)
290309
* coordinating
291310
* maintains a local copy of the cluster state
292311
* only the master node updates the cluster state, but all nodes subscribe to it
293312
* routes client requests
294-
* hash of id % number_of_primary_shards => picks the target shard
313+
* formula: hash of id % number_of_primary_shards => picks the target shard
314+
* number of primary shards in an index is fixed at the time that an index is created
315+
* in particular: Elasticsearch always maps a routing value to a single shard
295316
* returns final result
296317
* example: merges responses to aggregate results
297318
* every node in Elasticsearch can act as a coordinating node
@@ -312,21 +333,23 @@
312333
* metadata files (how to read, decode, and interpret the raw data files in a segment)
313334
* example: `.fnm` (field names and types)
314335
* commit files (which segments to load after a crash or restart)
315-
* segments_N (snapshot of all current segments)
316-
* segments.gen (tracks the latest segments_N file)
317-
* write.lock (prevent concurrent writers)
318-
* stores documents plus additional information (term dictionary, term frequencies)
319-
* term dictionary: maps each term to identifiers of documents containing that term
320-
* term frequencies: number of appearances of a term in a document
321-
* important for calculating the relevancy score of results
336+
* `segments_N` (snapshot of all current segments)
337+
* `segments.gen` (tracks the latest segments_N file)
338+
* `write.lock` (prevent concurrent writers)
339+
* can be hosted on any node within the cluster
340+
* not necessarily be distributed across multiple physical or virtual machines
341+
* example
342+
* 1 terabyte index into four shards (256 gb each)
343+
* shards could be distributed across the two nodes (2 per node)
344+
* as you add more nodes to the same cluster, existing shards get balanced between all nodes
322345
* two types of shards: primaries and replicas
323-
* all operations that affect the index — such as adding, updating, or removing documents — are sent to the
324-
primary shard
325-
* when the operation completes, the operation will be forwarded to each of the replica shards
326-
* when the operation has completed successfully on every replica and responded to the primary shard,
327-
the primary shard will respond to the client that the operation has completed successfully
346+
* primary shard: all operations that affect the index
347+
* example: adding, updating, or removing documents
348+
* flow
349+
1. operation completes on primary shard => it is forwarded to each of the replica shards
350+
1. operation completes on every replica => responds to the primary shard
351+
1. primary shard responds to the client
328352
* each document is stored in a single primary shard
329-
* it is indexed first on the primary shard, then on all replicas of the primary shard
330353
* replica shard is a copy of a primary shard
331354
* are never allocated to the same nodes as the primary shards
332355
* serves two purposes
@@ -335,27 +358,10 @@
335358
* documents are distributed evenly between shards
336359
* the shard is determined by hashing document id
337360
* each shard has an equal hash range
338-
* the current node forwards the document to the node holding that shard
339-
* indexing operation is replayed by all the replicas of that shard
340-
* can be hosted on any node within the cluster
341-
* not necessarily be distributed across multiple physical or virtual machines
342-
* example
343-
* 1 terabyte index into four shards (256 gb each)
344-
* shards could be distributed across the two nodes (2 per node)
345-
* as you add more nodes to the same cluster, existing shards get balanced between all nodes
346361
* two main reasons why sharding is important
347362
* allows you to split and thereby scale volumes of data
348363
* operations can be distributed across multiple nodes and thereby parallelized
349364
* multiple machines can potentially work on the same query
350-
* routing: determining which primary shard a given document should be stored in or has been stored in
351-
* standard: `shard = hash(routing) % total_primary_shards`
352-
* number of primary shards in an index is fixed at the time that an index is created
353-
* you could segment the data by date, creating an index for each year: 2014, 2015, 2016, and so on
354-
* possibility to adjust the number of primary shards based on load and performance of the
355-
previous indexes
356-
* commonly used when indexing date-based information (like log files)
357-
* number of replica shards can be changed at any time
358-
* is customizable, for example: shard based on the customer’s country
359365
360366
### segment
361367
* contains
@@ -402,7 +408,7 @@
402408
* the more segments you have to go though, the slower the search
403409
* solution: merging
404410
* creating new and bigger segments with combined content
405-
* commit: writes a new segments_N listing new merged segment and not segments that were merged
411+
* commit: writes a new `segments_N` listing new merged segment and not segments that were merged
406412
* example: excluding the deleted documents
407413
* tiered - default merge policy
408414
* segments divided into tiers by size
@@ -419,7 +425,10 @@
419425
* avoids merging huge segments unless necessary
420426
421427
### scoring
422-
* TF: how often a term occurs in the text
428+
* TF: how often a term occurs in the document
423429
* IDF: the token's importance is inversely proportional to the number of occurrences across all of the documents
430+
* `IDF = log(N / df)`
431+
* N = total number of documents
432+
* df = number of documents containing the term
424433
* Lucene’s default scoring formula, known as TF-IDF
425-
* apart from normalization & other factors, in general, it is simply: `TF * 1/IDF`
434+
* apart from normalization & other factors, in general, it is simply: `TF * IDF`

0 commit comments

Comments
 (0)