Skip to content

Commit 02c0f9d

Browse files
committed
doc update
1 parent 28484b0 commit 02c0f9d

File tree

1 file changed

+263
-47
lines changed

1 file changed

+263
-47
lines changed

README.md

+263-47
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
* https://chatgpt.com/
2121
* https://www.elastic.co/docs/manage-data/data-store/mapping/removal-of-mapping-types
2222
* https://www.elastic.co/docs/manage-data/data-store/mapping
23+
* https://www.elastic.co/blog/what-is-an-elasticsearch-index
2324

2425
## preface
2526
* goals of this workshop
@@ -60,34 +61,184 @@
6061

6162
### mapping
6263
* is the schema definition for the documents in an index
63-
* `GET /your-index-name/_mapping`
64+
* document = collection of fields + data types
65+
* includes metadata fields
66+
* example
67+
* `_index` - index to which the document belongs
68+
* `_id` - document’s ID
69+
* `_source` - original JSON representing the body of the document
70+
* others
6471
* example
72+
```
73+
GET /your-index-name/_mapping
74+
75+
{
76+
"people": {
77+
"mappings": {
78+
"_source": { // by default ommitted
79+
"enabled": true
80+
},
81+
"_meta": { // custom metadata (ex.: for documentation or tooling)
82+
"version": "1.0",
83+
"description": "People index for sanction screening"
84+
},
85+
"properties": {
86+
"name": { "type": "text" },
87+
"birthdate": { "type": "date" },
88+
"country": { "type": "keyword" },
89+
"bio_vector": {
90+
"type": "dense_vector",
91+
"dims": 384
92+
}
93+
}
94+
}
95+
}
96+
}
97+
```
98+
* two types
99+
* dynamic mapping
100+
* automatically detects the data types of fields
101+
* might yield suboptimal results for specific use cases
102+
* default mappings: defined using dynamic templates
103+
* example: map `app_*.code` as keywords and `app_*.message` as text
104+
```
105+
PUT /logs
106+
{
107+
"mappings": {
108+
"dynamic_templates": [
109+
{
110+
"map_app_codes_as_keyword": {
111+
"path_match": "app_*.code",
112+
"mapping": {
113+
"type": "keyword"
114+
}
115+
}
116+
}
117+
]
118+
}
119+
}
120+
```
121+
will produce types
122+
```
123+
"app_error.code": { "type": "keyword" }
124+
"app_warning.code": { "type": "keyword" }
125+
"app_error.message": { "type": "text" }
126+
```
127+
* add new fields automatically
128+
* use case: don’t know all the field names in advance
129+
* some data types that cannot be automatically detected
130+
* example: `geo_point`, `geo_shape`
131+
* explicit mapping
132+
* used to have greater control over which fields are created
133+
* recommended for production use cases
134+
* can’t change mappings for fields that are already mapped
135+
* requires reindexing
136+
* sometimes adding multi-fields (index same field in different ways) is an option, but old documents will not have them
137+
```
138+
"city": {
139+
"type": "text",
140+
"fields": {
141+
"raw": {
142+
"type": "keyword"
143+
}
144+
}
145+
}
146+
```
147+
* mapping explosion
148+
* too many fields in an index an cause out of memory errors
149+
* can be caused by lack of control over dynamic mapping
150+
* example: every new document inserted introduces new fields
151+
* use the mapping limit settings to limit the number of field mappings
65152
66153
### indices
154+
* logical namespace that holds a collection of documents
155+
* can be considered as a table
156+
* logical abstraction over one or more Lucene indices (called shards)
67157
* can be thought of as an optimized collection of documents
68-
* Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data
69-
structure
70-
* for example, text fields -> inverted indices, numeric and geo fields -> BKD trees
71-
* useful to index the same field in different ways for different purposes
72-
* like a relational database: each index is stored on the disk in the same set of files
73-
* you can search across types and search across indices
74-
* near-real time
158+
* each indexed field has optimized data structure
159+
* example
160+
* text fields -> inverted indices
161+
* numeric and geo fields -> BKD trees
162+
* near-real time search
75163
* searches not run on the latest indexed data
76-
* a point-in-time view of the index - multiple searches hit the same files and reuse the same caches
77-
* newly indexed documents are not visible until refresh
78-
* refresh - refreshes point-in-time view
79-
* default: every 1s
80-
* process of refreshing and process of committing in-memory segments to disk are independent
81-
* data is indexed first in memory (search goes through disk and in-memory segments as well)
82-
* flush: the process of committing in-memory segments to disk (Lucene index)
83-
* transaction log: in case of a node goes down or a shard is relocated - track of not flushed operations
84-
* flush also clears the transaction log
85-
* a flush is triggered by
86-
* the memory buffer is full
87-
* time since last flush
88-
* the transaction log hit a threshold
164+
* indexing a doc ≠ instant search visibility
165+
* document can be retrieved by ID immediately
166+
* but a search query won’t return it until a refresh happens
167+
* point-in-time view of the index
168+
* multiple searches hit the same files and reuse the same caches
169+
* processes
170+
* indexing = storing
171+
* document is put in two places
172+
* in-memory buffer (Lucene memory buffer)
173+
* transaction log (called translog) on disk
174+
* crash recovery log
175+
* translog is not searchable
176+
* when
177+
* document sent to Elasticsearch
178+
* after indexing
179+
* document is durable (even if node crashes)
180+
* not yet searchable
181+
* refresh
182+
* makes newly indexed documents searchable
183+
* writes in-memory buffer into a new Lucene segment
184+
* usually reside in the OS page cache (memory)
185+
* aren’t guaranteed to be persisted until an fsync or flush
186+
* in particular: files may never hit the actual disk
187+
* Lucene will ignore them if there's no updated `segments_N`
188+
* => update is done during commit
189+
* opens a new searcher
190+
* sees all committed segments
191+
* sees any new segments created by a refresh
192+
* does not see uncommitted in-memory data
193+
* example: buffer
194+
* every search request is handled by
195+
* grabbing the current active searcher
196+
* executing the query against that consistent view
197+
* writes don’t interfere with ongoing searches
198+
* when
199+
* automatically every 1 second (default)
200+
* manually: `POST /my-index/_refresh`
201+
* after refresh
202+
* documents are searchable
203+
* commit
204+
* it is not about search
205+
* does not affect search => searchers see segments based on refresh, not commit
206+
* uses fsync
207+
* the only way to guarantee that the operating system has actually written data to disk
208+
* pauses index writers briefly
209+
* to ensure that commit reflects a consistent index state
210+
* clears the translog (since changes are now safely in Lucene index)
211+
* each commit creates a new `segments_N` file with an incremented generation number `(N)`
212+
* represents current state of the index
213+
* lists all the active segments
214+
* older `segments_N` files are effectively obsolete after a new one is committed
215+
* binary file
216+
* textual example
217+
```
218+
Segments:
219+
----------
220+
Segment: _0
221+
- Uses compound file: true
222+
- Doc count: 1,000
223+
- Deleted docs: 0
224+
- Files:
225+
_0.cfs
226+
_0.cfe
227+
_0.si
228+
- Codec: Lucene90
229+
- Segment created with Lucene 9.9.0
230+
```
231+
* Lucene reads this file on startup
232+
* tells which `.cfs` segment files to load and use
233+
* reads `segments.gen` to find the latest `segments_N` file
234+
* when
235+
* the memory buffer is full
236+
* time since last flush
237+
* the transaction log hit a threshold
238+
* in particular: refreshing and committing are independent
239+
89240
### inverted indexing
90-
* Lucene data structure where it keeps a list of where each word belong
241+
* Lucene data structure where it keeps a list of where each word belongs
91242
![alt text](img/inverted-index.jpg)
92243
* example: index in the book with words and what pages they appear
93244
![alt text](img/book-index.jpg)
@@ -114,24 +265,40 @@
114265
* sent to Lucene to be indexed for the document
115266
* make up the inverted index
116267
* the query text undergoes the same analysis before the terms are looked up in the index
268+
117269
### node
118270
* node is an instance of Elasticsearch
119271
* multiple nodes can join the same cluster
120272
* with a cluster of multiple nodes, the same data can be spread across multiple servers
121-
* helps performance because Elasticsearch has more resources to work with
273+
* helps performance: because Elasticsearch has more resources to work with
122274
* helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch
123275
will still serve you all the data
124-
* for performance reasons, the nodes within a cluster need to be on the same network
125-
* balancing shards in a cluster across nodes in different data centers simply takes too long
126-
* cross-cluster replication (CCR)
127-
* given node then receives the request is responsible for coordinating the rest of the work
128-
* node within the cluster knows about every node in the cluster and is able to forward requests
129-
to a given node by using a transport layer
130-
* HTTP layer is exclusively used for communicating with external clients
131-
* master node is the node that is responsible for coordinating changes to the cluster, such as
132-
adding or removing nodes, creating or removing indices, etc
276+
* for performance reasons, the nodes within a cluster need to be on the same network
277+
* balancing shards in a cluster across nodes in different data centers simply takes too long
278+
* cross-cluster replication (CCR)
279+
* allows you to replicate data from one cluster (leader cluster) to another (follower cluster)
280+
* example: across data centers, regions, or cloud availability zones
281+
* roles
282+
* master
283+
* election: Raft-inspired
284+
* maintains the cluster state (node joins/leaves, index creation, shard allocation)
285+
* assign shards to nodes
286+
* example: when new index is created
287+
* based on node capabilities
288+
* data
289+
* stores actual index data (primary and replica shards)
290+
* coordinating
291+
* maintains a local copy of the cluster state
292+
* only the master node updates the cluster state, but all nodes subscribe to it
293+
* routes client requests
294+
* hash of id % number_of_primary_shards => picks the target shard
295+
* returns final result
296+
* example: merges responses to aggregate results
297+
* every node in Elasticsearch can act as a coordinating node
298+
133299
### shard
134-
* is a Lucene index: a directory of files containing an inverted index
300+
* is a Lucene shard: a directory of files containing an inverted index
301+
* cannot be split or merged easily
135302
* index is just a logical grouping of physical shards
136303
* each shard is actually a self-contained index
137304
* example
@@ -140,6 +307,14 @@ adding or removing nodes, creating or removing indices, etc
140307
* the entire index will not fit on either of the nodes
141308
* we need some way of splitting the index
142309
* sharding comes to the rescue
310+
* contains
311+
* segment files
312+
* metadata files (how to read, decode, and interpret the raw data files in a segment)
313+
* example: `.fnm` (field names and types)
314+
* commit files (which segments to load after a crash or restart)
315+
* segments_N (snapshot of all current segments)
316+
* segments.gen (tracks the latest segments_N file)
317+
* write.lock (prevent concurrent writers)
143318
* stores documents plus additional information (term dictionary, term frequencies)
144319
* term dictionary: maps each term to identifiers of documents containing that term
145320
* term frequencies: number of appearances of a term in a document
@@ -181,27 +356,68 @@ adding or removing nodes, creating or removing indices, etc
181356
* commonly used when indexing date-based information (like log files)
182357
* number of replica shards can be changed at any time
183358
* is customizable, for example: shard based on the customer’s country
359+
184360
### segment
185-
* is a chunk of the Lucene index
186-
* segments are immutable
361+
* contains
362+
* inverted index
363+
* term dictionary
364+
* maps term to offset in a posting list
365+
* contains document frequency
366+
* example
367+
* term: "shoes" → df = 2, offset = 842
368+
* Lucene know that from offset X should read exactly 2 document entries
369+
* postings lists
370+
* stores all the information Lucene needs to retrieve and rank documents
371+
* example: document where the term appears, term frequency
372+
* stored fields (original document fields)
373+
* doc values (columnar storage for sorting, aggregations, faceting)
374+
* example
375+
```
376+
DocID price (doc value)
377+
--------------------------
378+
0 59.99
379+
1 19.99
380+
2 129.99
381+
```
382+
* norms (field-level stats for scoring)
383+
* example: length of field
384+
* longer fields tend to be less precise
385+
* involves 10+ small files per segment
386+
* in particular: 100 segments => 1000+ files
387+
* problem: file handle exhaustion
388+
* solution: .cfs (compound file format)
389+
* Lucene can read them as if they were separate files (using random-access lookups inside .cfs)
390+
* not compressed
391+
* just a flat concatenation of multiple Lucene data files
392+
* is immutable
187393
* new ones are created as you index new documents
188394
* deleting only marks documents as deleted
395+
* Lucene supports deletes via live docs bitmap, not by physically removing the data immediately
396+
* cleaned up during segment merges
189397
* updating documents implies re-indexing
190398
* updating a document can’t change the actual document; it can only index a new one
191399
* are easily cached, making searches fast
192-
* when query on a shard
193-
* Lucene queries all its segments, merge the results, and send them back
194-
* the more segments you have to go though, the slower the search
195-
* merging
400+
* Lucene queries all its segments, merge the results, and send them back
196401
* normal indexing operations create many such small segments
197-
* Lucene merges them from time to time
198-
* implies reading contents, excluding the deleted documents, and creating new and bigger segments with combined
199-
content
200-
* process requires resources: CPU and disk I/O
201-
* merges run asynchronously
202-
* tiered - default merge policy
203-
* segments divided into tiers
204-
* if threshold hit in a tier, merge is triggered in that tier
402+
* the more segments you have to go though, the slower the search
403+
* solution: merging
404+
* creating new and bigger segments with combined content
405+
* commit: writes a new segments_N listing new merged segment and not segments that were merged
406+
* example: excluding the deleted documents
407+
* tiered - default merge policy
408+
* segments divided into tiers by size
409+
* example
410+
```
411+
Tier 1: segments ≤ 5 MB
412+
Tier 2: segments ≤ 25 MB
413+
Tier 3: segments ≤ 150 MB
414+
...
415+
```
416+
* each tier has a threshold number of segments
417+
* if threshold hit in a tier => merge in that tier
418+
* prioritizes merging small segments first (cheap, fast)
419+
* avoids merging huge segments unless necessary
420+
205421
### scoring
206422
* TF: how often a term occurs in the text
207423
* IDF: the token's importance is inversely proportional to the number of occurrences across all of the documents

0 commit comments

Comments
 (0)