2
2
3
3
## What is a data model?
4
4
5
- A ** data model** refers to a conceptual framework for thinking about data and about
6
- operations on data.
7
- A data model defines the mental toolbox of the data scientist; it has less to do with
8
- the architecture of the data systems, although architectures are often intertwined with
9
- data models.
5
+ A ** data model** is a conceptual framework that defines how data is organized,
6
+ represented, and transformed. It gives us the components for creating blueprints for the
7
+ structure and operations of data management systems, ensuring consistency and efficiency
8
+ in data handling.
9
+
10
+ Data management systems are built to accommodate these models, allowing us to manage
11
+ data according to the principles laid out by the model. If you’re studying data science
12
+ or engineering, you’ve likely encountered different data models, each providing a unique
13
+ approach to organizing and manipulating data.
14
+
15
+ A data model is defined by considering the following key aspects:
16
+
17
+ + What are the fundamental elements used to structure the data?
18
+ + What operations are available for defining, creating, and manipulating the data?
19
+ + What mechanisms exist to enforce the structure and rules governing valid data interactions?
20
+
21
+ ## Types of data models
10
22
11
23
Among the most familiar data models are those based on files and folders: data of any
12
24
kind are lumped together into binary strings called ** files** , files are collected into
@@ -24,17 +36,16 @@ objects in memory with properties and methods for transformations of such data.
24
36
## Relational data model
25
37
26
38
The ** relational model** is a way of thinking about data as sets and operations on sets.
27
- Formalized almost a half-century ago
28
- ([ Codd, 1969] ( https://dl.acm.org/citation.cfm?doid=362384.362685 ) ), the relational data
29
- model provides the most rigorous approach to structured data storage and the most
30
- precise approach to data querying.
31
- The model is defined by the principles of data representation, domain constraints,
32
- uniqueness constraints, referential constraints, and declarative queries as summarized
33
- below.
39
+ Formalized almost a half-century ago ([ Codd,
40
+ 1969] ( https://dl.acm.org/citation.cfm?doid=362384.362685 ) ). The relational data model is
41
+ one of the most powerful and precise ways to store and manage structured data. At its
42
+ core, this model organizes all data into tables--representing mathematical
43
+ relations---where each table consists of rows (representing mathematical tuples) and
44
+ columns (often called attributes).
34
45
35
46
### Core principles of the relational data model
36
47
37
- ** Data representation**
48
+ ** Data representation: **
38
49
Data are represented and manipulated in the form of relations.
39
50
A relation is a set (i.e. an unordered collection) of entities of values for each of
40
51
the respective named attributes of the relation.
@@ -43,27 +54,27 @@ below.
43
54
A collection of base relations with their attributes, domain constraints, uniqueness
44
55
constraints, and referential constraints is called a schema.
45
56
46
- ** Domain constraints**
47
- Attribute values are drawn from corresponding attribute domains, i.e. predefined sets
48
- of values.
49
- Attribute domains may not include relations, which keeps the data model flat, i.e.
50
- free of nested structures.
57
+ ** Domain constraints: **
58
+ Each attribute (column) in a table is associated with a specific attribute domain (or
59
+ datatype, a set of possible values), ensuring that the data entered is valid .
60
+ Attribute domains may not include relations, which keeps the data model
61
+ flat, i.e. free of nested structures.
51
62
52
- ** Uniqueness constraints**
63
+ ** Uniqueness constraints: **
53
64
Entities within relations are addressed by values of their attributes.
54
65
To identify and relate data elements, uniqueness constraints are imposed on subsets
55
66
of attributes.
56
67
Such subsets are then referred to as keys.
57
68
One key in a relation is designated as the primary key used for referencing its elements.
58
69
59
- ** Referential constraints**
60
- Associations among data are established by means of referential constraints with the
70
+ ** Referential constraints: **
71
+ Associations among data are established by means of referential constraints with the
61
72
help of foreign keys.
62
73
A referential constraint on relation A referencing relation B allows only those
63
74
entities in A whose foreign key attributes match the key attributes of an entity in B.
64
75
65
- ** Declarative queries**
66
- Data queries are formulated through declarative, as opposed to imperative,
76
+ ** Declarative queries: **
77
+ Data queries are formulated through declarative, as opposed to imperative,
67
78
specifications of sought results.
68
79
This means that query expressions convey the logic for the result rather than the
69
80
procedure for obtaining it.
@@ -86,32 +97,76 @@ Similar to spreadsheets, relations are often visualized as tables with *attribut
86
97
corresponding to * columns* and * entities* corresponding to * rows* .
87
98
In particular, SQL uses the terms * table* , * column* , and * row* .
88
99
89
- ## DataJoint is a refinement of the relational data model
100
+ ## The DataJoint Model
90
101
91
102
DataJoint is a conceptual refinement of the relational data model offering a more
92
- expressive and rigorous framework for database programming
93
- ([ Yatsenko et al., 2018] ( https://arxiv.org/abs/1807.11104 ) ).
94
- The DataJoint model facilitates clear conceptual modeling, efficient schema design, and
95
- precise and flexible data queries.
96
- The model has emerged over a decade of continuous development of complex data pipelines
97
- for neuroscience experiments
98
- ([ Yatsenko et al., 2015] ( https://www.biorxiv.org/content/early/2015/11/14/031658 ) ).
99
- DataJoint has allowed researchers with no prior knowledge of databases to collaborate
100
- effectively on common data pipelines sustaining data integrity and supporting flexible
101
- access.
102
- DataJoint is currently implemented as client libraries in MATLAB and Python.
103
- These libraries work by transpiling DataJoint queries into SQL before passing them on
104
- to conventional relational database systems that serve as the backend, in combination
105
- with bulk storage systems for storing large contiguous data objects.
103
+ expressive and rigorous framework for database programming ([ Yatsenko et al.,
104
+ 2018] ( https://arxiv.org/abs/1807.11104 ) ). The DataJoint model facilitates conceptual
105
+ clarity, efficiency, workflow management, and precise and flexible data
106
+ queries. By enforcing entity normalization,
107
+ simplifying dependency declarations, offering a rich query algebra, and visualizing
108
+ relationships through schema diagrams, DataJoint makes relational database programming
109
+ more intuitive and robust for complex data pipelines.
110
+
111
+ The model has emerged over a decade of continuous development of complex data
112
+ pipelines for neuroscience experiments ([ Yatsenko et al.,
113
+ 2015] ( https://www.biorxiv.org/content/early/2015/11/14/031658 ) ). DataJoint has allowed
114
+ researchers with no prior knowledge of databases to collaborate effectively on common
115
+ data pipelines sustaining data integrity and supporting flexible access. DataJoint is
116
+ currently implemented as client libraries in MATLAB and Python. These libraries work by
117
+ transpiling DataJoint queries into SQL before passing them on to conventional relational
118
+ database systems that serve as the backend, in combination with bulk storage systems for
119
+ storing large contiguous data objects.
106
120
107
121
DataJoint comprises:
108
122
109
- - a schema [ definition] ( ../design/tables/declare.md ) language
110
- - a data [ manipulation] ( ../manipulation/index.md ) language
111
- - a data [ query] ( ../query/principles.md ) language
112
- - a [ diagramming] ( ../design/diagrams.md ) notation for visualizing relationships between
123
+ + a schema [ definition] ( ../design/tables/declare.md ) language
124
+ + a data [ manipulation] ( ../manipulation/index.md ) language
125
+ + a data [ query] ( ../query/principles.md ) language
126
+ + a [ diagramming] ( ../design/diagrams.md ) notation for visualizing relationships between
113
127
modeled entities
114
128
115
129
The key refinement of DataJoint over other relational data models and their
116
130
implementations is DataJoint's support of
117
131
[ entity normalization] ( ../design/normalization.md ) .
132
+
133
+ ### Core principles of the DataJoint model
134
+
135
+ ** Entity Normalization**
136
+ DataJoint enforces entity normalization, ensuring that every entity set (table) is
137
+ well-defined, with each element belonging to the same type, sharing the same
138
+ attributes, and distinguished by the same primary key. This principle reduces
139
+ redundancy and avoids data anomalies, similar to Boyce-Codd Normal Form, but with a
140
+ more intuitive structure than traditional SQL.
141
+
142
+ ** Simplified Schema Definition and Dependency Management**
143
+ DataJoint introduces a schema definition language that is more expressive and less
144
+ error-prone than SQL. Dependencies are explicitly declared using arrow notation
145
+ (->), making referential constraints easier to understand and visualize. The
146
+ dependency structure is enforced as an acyclic directed graph, which simplifies
147
+ workflows by preventing circular dependencies.
148
+
149
+ ** Integrated Query Operators producing a Relational Algebra**
150
+ DataJoint introduces five query operators (restrict, join, project, aggregate, and
151
+ union) with algebraic closure, allowing them to be combined seamlessly. These
152
+ operators are designed to maintain operational entity normalization, ensuring query
153
+ outputs remain valid entity sets.
154
+
155
+ ** Diagramming Notation for Conceptual Clarity**
156
+ DataJoint’s schema diagrams simplify the representation of relationships between
157
+ entity sets compared to ERM diagrams. Relationships are expressed as dependencies
158
+ between entity sets, which are visualized using solid or dashed lines for primary
159
+ and secondary dependencies, respectively.
160
+
161
+ ** Unified Logic for Binary Operators**
162
+ DataJoint simplifies binary operations by requiring attributes involved in joins or
163
+ comparisons to be homologous (i.e., sharing the same origin). This avoids the
164
+ ambiguity and pitfalls of natural joins in SQL, ensuring more predictable query
165
+ results.
166
+
167
+ ** Optimized Data Pipelines for Scientific Workflows**
168
+ DataJoint treats the database as a data pipeline where each entity set defines a
169
+ step in the workflow. This makes it ideal for scientific experiments and complex
170
+ data processing, such as in neuroscience. Its MATLAB and Python libraries transpile
171
+ DataJoint queries into SQL, bridging the gap between scientific programming and
172
+ relational databases.
0 commit comments