Skip to content

Commit 3a359b1

Browse files
Merge pull request #1197 from kushalbakshi/master
Document `dj.Top()` and add missing pages
2 parents ea0e9c4 + aaa1dbe commit 3a359b1

20 files changed

+1473
-146
lines changed

docs/src/client/stores.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

docs/src/concepts/data-model.md

Lines changed: 97 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,23 @@
22

33
## What is a data model?
44

5-
A **data model** refers to a conceptual framework for thinking about data and about
6-
operations on data.
7-
A data model defines the mental toolbox of the data scientist; it has less to do with
8-
the architecture of the data systems, although architectures are often intertwined with
9-
data models.
5+
A **data model** is a conceptual framework that defines how data is organized,
6+
represented, and transformed. It gives us the components for creating blueprints for the
7+
structure and operations of data management systems, ensuring consistency and efficiency
8+
in data handling.
9+
10+
Data management systems are built to accommodate these models, allowing us to manage
11+
data according to the principles laid out by the model. If you’re studying data science
12+
or engineering, you’ve likely encountered different data models, each providing a unique
13+
approach to organizing and manipulating data.
14+
15+
A data model is defined by considering the following key aspects:
16+
17+
+ What are the fundamental elements used to structure the data?
18+
+ What operations are available for defining, creating, and manipulating the data?
19+
+ What mechanisms exist to enforce the structure and rules governing valid data interactions?
20+
21+
## Types of data models
1022

1123
Among the most familiar data models are those based on files and folders: data of any
1224
kind are lumped together into binary strings called **files**, files are collected into
@@ -24,17 +36,16 @@ objects in memory with properties and methods for transformations of such data.
2436
## Relational data model
2537

2638
The **relational model** is a way of thinking about data as sets and operations on sets.
27-
Formalized almost a half-century ago
28-
([Codd, 1969](https://dl.acm.org/citation.cfm?doid=362384.362685)), the relational data
29-
model provides the most rigorous approach to structured data storage and the most
30-
precise approach to data querying.
31-
The model is defined by the principles of data representation, domain constraints,
32-
uniqueness constraints, referential constraints, and declarative queries as summarized
33-
below.
39+
Formalized almost a half-century ago ([Codd,
40+
1969](https://dl.acm.org/citation.cfm?doid=362384.362685)). The relational data model is
41+
one of the most powerful and precise ways to store and manage structured data. At its
42+
core, this model organizes all data into tables--representing mathematical
43+
relations---where each table consists of rows (representing mathematical tuples) and
44+
columns (often called attributes).
3445

3546
### Core principles of the relational data model
3647

37-
**Data representation**
48+
**Data representation:**
3849
Data are represented and manipulated in the form of relations.
3950
A relation is a set (i.e. an unordered collection) of entities of values for each of
4051
the respective named attributes of the relation.
@@ -43,27 +54,27 @@ below.
4354
A collection of base relations with their attributes, domain constraints, uniqueness
4455
constraints, and referential constraints is called a schema.
4556

46-
**Domain constraints**
47-
Attribute values are drawn from corresponding attribute domains, i.e. predefined sets
48-
of values.
49-
Attribute domains may not include relations, which keeps the data model flat, i.e.
50-
free of nested structures.
57+
**Domain constraints:**
58+
Each attribute (column) in a table is associated with a specific attribute domain (or
59+
datatype, a set of possible values), ensuring that the data entered is valid.
60+
Attribute domains may not include relations, which keeps the data model
61+
flat, i.e. free of nested structures.
5162

52-
**Uniqueness constraints**
63+
**Uniqueness constraints:**
5364
Entities within relations are addressed by values of their attributes.
5465
To identify and relate data elements, uniqueness constraints are imposed on subsets
5566
of attributes.
5667
Such subsets are then referred to as keys.
5768
One key in a relation is designated as the primary key used for referencing its elements.
5869

59-
**Referential constraints**
60-
Associations among data are established by means of referential constraints with the
70+
**Referential constraints:**
71+
Associations among data are established by means of referential constraints with the
6172
help of foreign keys.
6273
A referential constraint on relation A referencing relation B allows only those
6374
entities in A whose foreign key attributes match the key attributes of an entity in B.
6475

65-
**Declarative queries**
66-
Data queries are formulated through declarative, as opposed to imperative,
76+
**Declarative queries:**
77+
Data queries are formulated through declarative, as opposed to imperative,
6778
specifications of sought results.
6879
This means that query expressions convey the logic for the result rather than the
6980
procedure for obtaining it.
@@ -86,32 +97,76 @@ Similar to spreadsheets, relations are often visualized as tables with *attribut
8697
corresponding to *columns* and *entities* corresponding to *rows*.
8798
In particular, SQL uses the terms *table*, *column*, and *row*.
8899

89-
## DataJoint is a refinement of the relational data model
100+
## The DataJoint Model
90101

91102
DataJoint is a conceptual refinement of the relational data model offering a more
92-
expressive and rigorous framework for database programming
93-
([Yatsenko et al., 2018](https://arxiv.org/abs/1807.11104)).
94-
The DataJoint model facilitates clear conceptual modeling, efficient schema design, and
95-
precise and flexible data queries.
96-
The model has emerged over a decade of continuous development of complex data pipelines
97-
for neuroscience experiments
98-
([Yatsenko et al., 2015](https://www.biorxiv.org/content/early/2015/11/14/031658)).
99-
DataJoint has allowed researchers with no prior knowledge of databases to collaborate
100-
effectively on common data pipelines sustaining data integrity and supporting flexible
101-
access.
102-
DataJoint is currently implemented as client libraries in MATLAB and Python.
103-
These libraries work by transpiling DataJoint queries into SQL before passing them on
104-
to conventional relational database systems that serve as the backend, in combination
105-
with bulk storage systems for storing large contiguous data objects.
103+
expressive and rigorous framework for database programming ([Yatsenko et al.,
104+
2018](https://arxiv.org/abs/1807.11104)). The DataJoint model facilitates conceptual
105+
clarity, efficiency, workflow management, and precise and flexible data
106+
queries. By enforcing entity normalization,
107+
simplifying dependency declarations, offering a rich query algebra, and visualizing
108+
relationships through schema diagrams, DataJoint makes relational database programming
109+
more intuitive and robust for complex data pipelines.
110+
111+
The model has emerged over a decade of continuous development of complex data
112+
pipelines for neuroscience experiments ([Yatsenko et al.,
113+
2015](https://www.biorxiv.org/content/early/2015/11/14/031658)). DataJoint has allowed
114+
researchers with no prior knowledge of databases to collaborate effectively on common
115+
data pipelines sustaining data integrity and supporting flexible access. DataJoint is
116+
currently implemented as client libraries in MATLAB and Python. These libraries work by
117+
transpiling DataJoint queries into SQL before passing them on to conventional relational
118+
database systems that serve as the backend, in combination with bulk storage systems for
119+
storing large contiguous data objects.
106120

107121
DataJoint comprises:
108122

109-
- a schema [definition](../design/tables/declare.md) language
110-
- a data [manipulation](../manipulation/index.md) language
111-
- a data [query](../query/principles.md) language
112-
- a [diagramming](../design/diagrams.md) notation for visualizing relationships between
123+
+ a schema [definition](../design/tables/declare.md) language
124+
+ a data [manipulation](../manipulation/index.md) language
125+
+ a data [query](../query/principles.md) language
126+
+ a [diagramming](../design/diagrams.md) notation for visualizing relationships between
113127
modeled entities
114128

115129
The key refinement of DataJoint over other relational data models and their
116130
implementations is DataJoint's support of
117131
[entity normalization](../design/normalization.md).
132+
133+
### Core principles of the DataJoint model
134+
135+
**Entity Normalization**
136+
DataJoint enforces entity normalization, ensuring that every entity set (table) is
137+
well-defined, with each element belonging to the same type, sharing the same
138+
attributes, and distinguished by the same primary key. This principle reduces
139+
redundancy and avoids data anomalies, similar to Boyce-Codd Normal Form, but with a
140+
more intuitive structure than traditional SQL.
141+
142+
**Simplified Schema Definition and Dependency Management**
143+
DataJoint introduces a schema definition language that is more expressive and less
144+
error-prone than SQL. Dependencies are explicitly declared using arrow notation
145+
(->), making referential constraints easier to understand and visualize. The
146+
dependency structure is enforced as an acyclic directed graph, which simplifies
147+
workflows by preventing circular dependencies.
148+
149+
**Integrated Query Operators producing a Relational Algebra**
150+
DataJoint introduces five query operators (restrict, join, project, aggregate, and
151+
union) with algebraic closure, allowing them to be combined seamlessly. These
152+
operators are designed to maintain operational entity normalization, ensuring query
153+
outputs remain valid entity sets.
154+
155+
**Diagramming Notation for Conceptual Clarity**
156+
DataJoint’s schema diagrams simplify the representation of relationships between
157+
entity sets compared to ERM diagrams. Relationships are expressed as dependencies
158+
between entity sets, which are visualized using solid or dashed lines for primary
159+
and secondary dependencies, respectively.
160+
161+
**Unified Logic for Binary Operators**
162+
DataJoint simplifies binary operations by requiring attributes involved in joins or
163+
comparisons to be homologous (i.e., sharing the same origin). This avoids the
164+
ambiguity and pitfalls of natural joins in SQL, ensuring more predictable query
165+
results.
166+
167+
**Optimized Data Pipelines for Scientific Workflows**
168+
DataJoint treats the database as a data pipeline where each entity set defines a
169+
step in the workflow. This makes it ideal for scientific experiments and complex
170+
data processing, such as in neuroscience. Its MATLAB and Python libraries transpile
171+
DataJoint queries into SQL, bridging the gap between scientific programming and
172+
relational databases.

docs/src/concepts/data-pipelines.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -157,10 +157,10 @@ with external groups.
157157
## Summary of DataJoint features
158158

159159
1. A free, open-source framework for scientific data pipelines and workflow management
160-
1. Data hosting in cloud or in-house
161-
1. MySQL, filesystems, S3, and Globus for data management
162-
1. Define, visualize, and query data pipelines from MATLAB or Python
163-
1. Enter and view data through GUIs
164-
1. Concurrent access by multiple users and computational agents
165-
1. Data integrity: identification, dependencies, groupings
166-
1. Automated distributed computation
160+
2. Data hosting in cloud or in-house
161+
3. MySQL, filesystems, S3, and Globus for data management
162+
4. Define, visualize, and query data pipelines from MATLAB or Python
163+
5. Enter and view data through GUIs
164+
6. Concurrent access by multiple users and computational agents
165+
7. Data integrity: identification, dependencies, groupings
166+
8. Automated distributed computation

docs/src/concepts/teamwork.md

Lines changed: 33 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@
55
Science labs organize their projects as a sequence of activities of experiment design,
66
data acquisition, and processing and analysis.
77

8-
<figure markdown>
9-
![data science in a science lab](../images/data-science-before.png){: style="width:520px; align:center"}
10-
<figcaption>Workflow and dataflow in a common findings-centered approach to data science in a science lab.</figcaption>
11-
</figure>
8+
![data science in a science lab](../images/data-science-before.png){: style="width:510px; display:block; margin: 0 auto;"}
9+
10+
<figcaption style="text-align: center;">Workflow and dataflow in a common findings-centered approach to data science in a science lab.</figcaption>
1211

1312
Many labs lack a uniform data management strategy that would span longitudinally across
1413
the entire project lifecycle as well as laterally across different projects.
@@ -29,10 +28,9 @@ This approach requires formulating a general data science plan and upfront inves
2928
for setting up resources and processes and training the teams.
3029
The team uses DataJoint to build data pipelines to support multiple projects.
3130

32-
<figure markdown>
33-
![data science in a science lab](../images/data-science-after.png){: style="width:510px; align:center"}
34-
<figcaption>Workflow and dataflow in a data pipeline-centered approach.</figcaption>
35-
</figure>
31+
![data science in a science lab](../images/data-science-after.png){: style="width:510px; display:block; margin: 0 auto;"}
32+
33+
<figcaption style="text-align: center;">Workflow and dataflow in a data pipeline-centered approach.</figcaption>
3634

3735
Data pipelines support project data across their entire lifecycle, including the
3836
following functions
@@ -55,42 +53,41 @@ data integrity.
5553
The adoption of a uniform data management framework allows separation of roles and
5654
division of labor among team members, leading to greater efficiency and better scaling.
5755

58-
<figure markdown>
59-
![data science vs engineering](../images/data-engineering.png){: style="width:350px; align:center"}
60-
<figcaption>Distinct responsibilities of data science and data engineering.</figcaption>
61-
</figure>
56+
![data science in a science lab](../images/data-engineering.png){: style="width:510px; display:block; margin: 0 auto;"}
57+
58+
<figcaption style="text-align: center;">Distinct responsibilities of data science and data engineering.</figcaption>
6259

63-
Scientists
60+
### Scientists
6461

65-
design and conduct experiments, collecting data.
66-
They interact with the data pipeline through graphical user interfaces designed by
67-
others.
68-
They understand what analysis is used to test their hypotheses.
62+
Design and conduct experiments, collecting data.
63+
They interact with the data pipeline through graphical user interfaces designed by
64+
others.
65+
They understand what analysis is used to test their hypotheses.
6966

70-
Data scientists
67+
### Data scientists
7168

72-
have the domain expertise and select and implement the processing and analysis
73-
methods for experimental data.
74-
Data scientists are in charge of defining and managing the data pipeline using
75-
DataJoint's data model, but they may not know the details of the underlying
76-
architecture.
77-
They interact with the pipeline using client programming interfaces directly from
78-
languages such as MATLAB and Python.
69+
Have the domain expertise and select and implement the processing and analysis
70+
methods for experimental data.
71+
Data scientists are in charge of defining and managing the data pipeline using
72+
DataJoint's data model, but they may not know the details of the underlying
73+
architecture.
74+
They interact with the pipeline using client programming interfaces directly from
75+
languages such as MATLAB and Python.
7976

80-
The bulk of this manual is written for working data scientists, except for System
81-
Administration.
77+
The bulk of this manual is written for working data scientists, except for System
78+
Administration.
8279

83-
Data engineers
80+
### Data engineers
8481

85-
work with the data scientists to support the data pipeline.
86-
They rely on their understanding of the DataJoint data model to configure and
87-
administer the required IT resources such as database servers, data storage
88-
servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc.
89-
Data engineers can provide general solutions such as web hosting, data publishing,
90-
interfaces, exports and imports.
82+
Work with the data scientists to support the data pipeline.
83+
They rely on their understanding of the DataJoint data model to configure and
84+
administer the required IT resources such as database servers, data storage
85+
servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc.
86+
Data engineers can provide general solutions such as web hosting, data publishing,
87+
interfaces, exports and imports.
9188

92-
The System Administration section of this tutorial contains materials helpful in
93-
accomplishing these tasks.
89+
The System Administration section of this tutorial contains materials helpful in
90+
accomplishing these tasks.
9491

9592
DataJoint is designed to delineate a clean boundary between **data science** and **data
9693
engineering**.

docs/src/design/alter.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,53 @@
11
# Altering Populated Pipelines
2+
3+
Tables can be altered after they have been declared and populated. This is useful when
4+
you want to add new secondary attributes or change the data type of existing attributes.
5+
Users can use the `definition` property to update a table's attributes and then use
6+
`alter` to apply the changes in the database. Currently, `alter` does not support
7+
changes to primary key attributes.
8+
9+
Let's say we have a table `Student` with the following attributes:
10+
11+
```python
12+
@schema
13+
class Student(dj.Manual):
14+
definition = """
15+
student_id: int
16+
---
17+
first_name: varchar(40)
18+
last_name: varchar(40)
19+
home_address: varchar(100)
20+
"""
21+
```
22+
23+
We can modify the table to include a new attribute `email`:
24+
25+
```python
26+
Student.definition = """
27+
student_id: int
28+
---
29+
first_name: varchar(40)
30+
last_name: varchar(40)
31+
home_address: varchar(100)
32+
email: varchar(100)
33+
"""
34+
Student.alter()
35+
```
36+
37+
The `alter` method will update the table in the database to include the new attribute
38+
`email` added by the user in the table's `definition` property.
39+
40+
Similarly, you can modify the data type or length of an existing attribute. For example,
41+
to alter the `home_address` attribute to have a length of 200 characters:
42+
43+
```python
44+
Student.definition = """
45+
student_id: int
46+
---
47+
first_name: varchar(40)
48+
last_name: varchar(40)
49+
home_address: varchar(200)
50+
email: varchar(100)
51+
"""
52+
Student.alter()
53+
```

docs/src/design/integrity.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Data Integrity
22

3-
The term **data integrity** describes guarantees made by the data management process
4-
that prevent errors and corruption in data due to technical failures and human errors
3+
The term **data integrity** describes guarantees made by the data management process
4+
that prevent errors and corruption in data due to technical failures and human errors
55
arising in the course of continuous use by multiple agents.
66
DataJoint pipelines respect the following forms of data integrity: **entity
77
integrity**, **referential integrity**, and **group integrity** as described in more

docs/src/design/tables/blobs.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,26 @@
1-
# Work in progress
1+
# Blobs
2+
3+
DataJoint provides functionality for serializing and deserializing complex data types
4+
into binary blobs for efficient storage and compatibility with MATLAB's mYm
5+
serialization. This includes support for:
6+
7+
+ Basic Python data types (e.g., integers, floats, strings, dictionaries).
8+
+ NumPy arrays and scalars.
9+
+ Specialized data types like UUIDs, decimals, and datetime objects.
10+
11+
## Serialization and Deserialization Process
12+
13+
Serialization converts Python objects into a binary representation for efficient storage
14+
within the database. Deserialization converts the binary representation back into the
15+
original Python object.
16+
17+
Blobs over 1 KiB are compressed using the zlib library to reduce storage requirements.
18+
19+
## Supported Data Types
20+
21+
DataJoint supports the following data types for serialization:
22+
23+
+ Scalars: Integers, floats, booleans, strings.
24+
+ Collections: Lists, tuples, sets, dictionaries.
25+
+ NumPy: Arrays, structured arrays, and scalars.
26+
+ Custom Types: UUIDs, decimals, datetime objects, MATLAB cell and struct arrays.

0 commit comments

Comments
 (0)