Skip to content

Commit 888f7d8

Browse files
authored
Add components overview and pipeline sequence for Microsoft.Spark <-> Apache Spark integration (#1189)
1 parent 27e8bb5 commit 888f7d8

11 files changed

+340
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
2+
@startuml Spark-dotnet-integration-component-diagram
3+
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Component.puml
4+
HIDE_STEREOTYPE()
5+
skinparam legend {
6+
FontColor #Black
7+
}
8+
skinparam dpi 300
9+
10+
11+
title: Microsoft Spark Component Diagram
12+
13+
AddComponentTag("ApacheSpark", $sprite="img:./spark-logo.png{scale=0.25}", $legendText="Apache Spark components")
14+
AddComponentTag("dotnet", $sprite="img:./dotnet-logo.png{scale=0.25}")
15+
AddComponentTag("package", $sprite="img:./nuget-logo.png{scale=0.1}", $bgColor="#c09fe0")
16+
AddComponentTag("scala", $sprite="img:./scala-logo.png{scale=0.2}", $bgColor="#c09fe0")
17+
AddComponentTag("inThisRepo", $bgColor="#c09fe0", $legendText="Components that are defined in this repository")
18+
19+
System(SparkDriver, "Spark Driver", "Entire system. Entrypoint, driver spark process, cluster manager...", $tags="ApacheSpark")
20+
21+
Boundary(SparkWorkerContainer, "Spark Worker", "Single instance of worker"){
22+
Component(SparkWorker, "Spark Worker Process", "Java process", "Apache Spark worker process, responsible for handling requests from the driver.", $tags="ApacheSpark")
23+
Component(DotnetWorker, "Microsoft.Spark.Worker.exe", ".NET process", ".NET executable present on worker nodes. Started with the first request from the worker, and continuously processes tasks. .NET **UDFs** are executed here.", $tags="inThisRepo+dotnet")
24+
}
25+
26+
Container(SparkMoreWorkers, "Additional Spark Workers", "Multiple instances of Spark Worker", $tags="ApacheSpark")
27+
28+
Rel_D(SparkDriver, SparkWorker, "Manages instance, sends tasks", "")
29+
Rel_D(SparkDriver, SparkMoreWorkers, "Sends tasks to additional workers", "","","", "#blue")
30+
BiRel_D(SparkWorker, DotnetWorker, "Creates instance and sends tasks", "Binary over socket")
31+
32+
note right on link
33+
From Spark's perspective, it communicates with the PySpark worker.
34+
Instead of the path to the Python binary, the path to the .NET worker is provided.
35+
This allows the same API interaction as with PySpark,
36+
missing yet APIs can be added by contributors.
37+
end note
38+
39+
SparkWorker -[dotted,#blue]right- SparkMoreWorkers: Multiple worker instances
40+
Lay_R(SparkWorker, SparkMoreWorkers)
41+
42+
Boundary(UserApp, "User Application"){
43+
Component(MainApp, "User .NET Application", ".NET executable dll", "Application intended to work with Spark. Contains all user-defined code for Spark: pipeline definition, UDFs, ML, streaming, etc.", $tags="dotnet")
44+
Component(DotnetSparkPackage, "Microsoft.Spark", "Nuget package", "Communicates with the Microsoft Spark bridge. Contains wrappers over Spark Java objects and API definitions.", $tags="package+inThisRepo")
45+
Rel(MainApp, DotnetSparkPackage, "Depends on")
46+
}
47+
48+
Component(MicrosoftScalaBridge, "Spark <-> .NET Bridge", "microsoft-spark-xxx.jar", "Entry point for the user app. Started by Spark when spark-submit is invoked. Starts the .NET user app and bridges all API calls to Spark.", $tags="scala+inThisRepo")
49+
50+
Rel_L(MicrosoftScalaBridge, SparkDriver, "Creates Spark objects and controls their lifecycle", "jar loaded to Spark context")
51+
BiRel_L(MicrosoftScalaBridge, UserApp, "Handles all Spark API calls and results retrieval", "Binary over sockets")
52+
53+
Person(user, "User")
54+
Rel_R(user, SparkDriver, "Executes 'spark-submit microsoft-spark-xxx.jar'")
55+
56+
57+
legend right
58+
<#GhostWhite,#black>| |=__Legend__|
59+
|<#c09fe0> | Components that are defined within this repository |
60+
endlegend
61+
62+
@enduml
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
@startuml Spark-dotnet-sequence-diagram-simple
2+
3+
title "Sequence Diagram for Processing Simple Pipeline with Spark .NET\nWithout UDFs or Data Retrieval"
4+
5+
skinparam dpi 300
6+
skinparam BoxPadding 10
7+
8+
actor "User" as user
9+
10+
box "Master Node"
11+
participant "Spark: Master" as spark_master
12+
participant "JVM<->.NET Bridge" as bridge
13+
participant "MyProgram.exe:\nUser .NET App" as dotnet_master
14+
participant "Microsoft.Spark\n(NuGet Package)" as dotnet_nuget
15+
end box
16+
17+
box "Worker Node\n(One of Many)"
18+
participant "Spark: Worker" as spark_worker
19+
participant "Microsoft.Spark.Worker" as dotnet_worker
20+
end box
21+
22+
user -> spark_master: Executes\n**spark-submit** microsoft-spark-xx.jar\n--files MyUdfs.dll MyProgram.zip
23+
activate spark_master
24+
25+
spark_master -> bridge: Load and start executing JAR
26+
activate bridge
27+
bridge -> dotnet_master: Start MyProgram
28+
deactivate bridge
29+
30+
activate dotnet_master
31+
dotnet_master -> dotnet_nuget: Build SparkSession
32+
deactivate dotnet_master
33+
activate dotnet_nuget
34+
35+
dotnet_nuget -> bridge: Connect to socket,\nRequest Spark Session creation
36+
activate bridge
37+
return Reference to JVM object SparkSession
38+
note over dotnet_nuget
39+
Each .NET Spark-related object has a JvmObjectReference.
40+
Whenever a method/property call on these objects is requested,
41+
it is broadcasted over the socket to the bridge,
42+
where actual execution occurs.
43+
end note
44+
return Session
45+
activate dotnet_master
46+
47+
note over dotnet_master
48+
Pipeline execution:
49+
""_spark""
50+
"" .Read()""
51+
"" .Parquet($"C:\data.parquet")""
52+
"" .GroupBy(Col("Faculty"))""
53+
"" .Agg(Avg(Col("Grage")))""
54+
"" .Write()""
55+
"" .Parquet(@"C:\averageGrades.parquet");""
56+
end note
57+
58+
dotnet_master -> dotnet_nuget: Invocations on objects in Microsoft.Spark
59+
deactivate dotnet_master
60+
activate dotnet_nuget
61+
dotnet_nuget -> bridge: In binary, smth similar to: \n ""{ref:123, m: "GroupBy", args:[arg1, arg2]}"" \n \t\t\t\t\t•••••••• \n ""{ref:125, m: "Parquet", args:["path"]}""
62+
deactivate dotnet_nuget
63+
activate bridge
64+
65+
bridge -> spark_master: Invocations on actual\nSpark objects
66+
deactivate bridge
67+
68+
spark_master -> spark_master: Load data,\nGenerate execution graph,\nCreate RDD
69+
70+
spark_master -> spark_worker: Create tasks for processing partitions of RDD in a distributed manner
71+
activate spark_worker
72+
return Processed result
73+
spark_master -> spark_master: Aggregate results from workers\nWrite to Parquet
74+
spark_master --> bridge
75+
activate bridge
76+
bridge --> dotnet_nuget
77+
deactivate bridge
78+
activate dotnet_nuget
79+
dotnet_nuget --> dotnet_master
80+
deactivate dotnet_nuget
81+
82+
activate dotnet_master
83+
dotnet_master --> bridge: Execution complete
84+
deactivate dotnet_master
85+
activate bridge
86+
bridge --> spark_master: Execution complete
87+
deactivate bridge
88+
return Execution complete
89+
@enduml
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
@startuml Spark-dotnet-sequence-diagram-udf-data
2+
title "Sequence Diagram for Processing Pipeline with Spark .NET: UDF & Data retrieval"
3+
4+
skinparam dpi 200
5+
skinparam BoxPadding 10
6+
7+
actor "User" as user
8+
9+
box "Master Node"
10+
participant "Spark: Master" as spark_master
11+
participant "JVM<->.NET Bridge" as bridge
12+
participant "MyProgram.exe:\nUser .NET App" as dotnet_master
13+
participant "Microsoft.Spark\n(NuGet Package)" as dotnet_nuget
14+
end box
15+
16+
box "Worker Node\n(One of Many)"
17+
participant "Spark: Worker" as spark_worker
18+
participant "Microsoft.Spark.Worker" as dotnet_worker
19+
end box
20+
21+
user -> spark_master: Executes \n**spark-submit** microsoft-spark-xx.jar\n--files MyUdfs.dll MyProgram.zip
22+
activate spark_master
23+
24+
spark_master -> bridge: Load and start executing jar
25+
activate bridge
26+
bridge -> dotnet_master: Start MyProgram
27+
deactivate bridge
28+
29+
activate dotnet_master
30+
dotnet_master -> dotnet_nuget: Build SparkSession
31+
deactivate dotnet_master
32+
activate dotnet_nuget
33+
34+
dotnet_nuget -> bridge: Connect to socket,\nRequest Spark Session creation
35+
activate bridge
36+
bridge -> spark_master: Request Spark Session creation
37+
return Reference to JVM object SparkSession
38+
return Session
39+
activate dotnet_master
40+
41+
group "Register UDF"
42+
note over dotnet_master
43+
""var df = LoadDataFromSomeWhere();"" // This part is ommitted
44+
""Func<Column, Column> udfArray =""
45+
""Udf<string, string[]>(str => [str, $"{str}-{str.Length}"]);""
46+
end note
47+
48+
dotnet_master -> dotnet_nuget: Func<> object
49+
deactivate dotnet_master
50+
activate dotnet_nuget
51+
dotnet_nuget -> dotnet_nuget: Serialize Func using binary serializer
52+
dotnet_nuget -> bridge: Invoke UDF creation,\nPass serialized UDF as a parameter
53+
deactivate dotnet_nuget
54+
activate bridge
55+
bridge -> spark_master: Register UDF as a PythonFunction,\nSpecify Microsoft.Spark.Worker.exe instead of Python.exe\nDeclare serialized UDF as an argument
56+
deactivate bridge
57+
58+
spark_master -> spark_master: Register a Python UDF
59+
60+
spark_master --> bridge
61+
activate bridge
62+
bridge --> dotnet_nuget: UDF JVM reference
63+
deactivate bridge
64+
65+
activate dotnet_nuget
66+
dotnet_nuget --> dotnet_master
67+
deactivate dotnet_nuget
68+
activate dotnet_master
69+
end
70+
71+
group "Invoke UDF"
72+
note over dotnet_master
73+
// Cache() needed for immediate invocation,
74+
// otherwise df invoked lazily when needed
75+
""var arrayDF =""
76+
""df.Select(Explode(udfArray(df["value"])))""
77+
"".Cache();""
78+
end note
79+
80+
81+
dotnet_master -> dotnet_nuget
82+
deactivate dotnet_master
83+
activate dotnet_nuget
84+
85+
dotnet_nuget -> bridge: Pass calls to bridge
86+
deactivate dotnet_nuget
87+
activate bridge
88+
bridge -> spark_master: Load data,\nGenerate execution graph,\nCreate RDD
89+
deactivate bridge
90+
91+
spark_master -> spark_worker: Create tasks for processing partitions of RDD
92+
activate spark_worker
93+
spark_worker -> dotnet_worker: Start process,\nInitiate socket connection,\nPass task content and serialized UDF
94+
activate dotnet_worker
95+
96+
dotnet_worker -> dotnet_worker: Deserialize Func and execute it\nPass arguments received from Spark worker
97+
return UDF execution result
98+
return
99+
100+
spark_master -> spark_master: Aggregate results from workers
101+
spark_master --> bridge
102+
activate bridge
103+
bridge --> dotnet_nuget
104+
deactivate bridge
105+
activate dotnet_nuget
106+
dotnet_nuget --> dotnet_master
107+
deactivate dotnet_nuget
108+
activate dotnet_master
109+
end
110+
111+
group "Fetch Dataset in .NET Memory"
112+
note over dotnet_master
113+
""var result =""
114+
""arrayDF.Collect().ToList();""
115+
end note
116+
117+
dotnet_master -> dotnet_nuget: Collect dataset
118+
deactivate dotnet_master
119+
activate dotnet_nuget
120+
dotnet_nuget -> bridge: Request dataset collection
121+
deactivate dotnet_nuget
122+
123+
activate bridge
124+
bridge -> spark_master: .Collect() request
125+
126+
deactivate bridge
127+
128+
spark_master --> bridge: Collected data
129+
activate bridge
130+
131+
bridge --> dotnet_nuget: Collected data
132+
deactivate bridge
133+
activate dotnet_nuget
134+
dotnet_nuget -> bridge: Initiate broadcast of all rows via socket
135+
deactivate dotnet_nuget
136+
activate bridge
137+
138+
bridge -> dotnet_nuget: Entire dataset serialized in Python Pickle format\n**Expensive operation**
139+
140+
deactivate bridge
141+
142+
activate dotnet_nuget
143+
dotnet_nuget --> dotnet_master: Deserialized row collection
144+
deactivate dotnet_nuget
145+
activate dotnet_master
146+
147+
end
148+
149+
activate dotnet_master
150+
dotnet_master --> bridge: Execution complete
151+
deactivate dotnet_master
152+
activate bridge
153+
bridge --> spark_master: Execution complete1
154+
deactivate bridge
155+
return Execution complete.
156+
@enduml
1.33 KB
Loading
10.9 KB
Loading
11.6 KB
Loading
26.4 KB
Loading
Loading
Loading
Loading

docs/understanding-microsoft.spark.md

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
2+
# Components Overview
3+
4+
The following component diagram depicts a high-level overview of the vital components participating in a .NET for Apache Spark application lifecycle:
5+
6+
![Icon](img/Spark-dotnet-integration-component-diagram.png)
7+
8+
This diagram illustrates the interaction between the Apache Spark components and the .NET components. The `Microsoft.Spark` NuGet package contains a number of wrappers over the Scala JVM objects (with reference to internal objects mapped in the bridge), and allows calling various methods on those objects. Unless `Collect()` is called on a dataframe, no data is actually loaded into the .NET app; the solution acts as a proxy for the Scala Spark API. Key components include:
9+
10+
- **Spark Driver**: The main entry point for Spark jobs, responsible for managing the job lifecycle and distributing tasks to worker nodes.
11+
- **Spark Worker**: Executes tasks sent by the driver, processing data and performing computations.
12+
- **Microsoft.Spark.Worker**: A .NET executable that runs on worker nodes, allowing .NET UDFs to be executed as part of the Spark job.
13+
- **User .NET Application**: Contains the user-defined code for interacting with Spark, including data processing pipelines, UDFs, and more.
14+
- **JVM<->.NET Bridge**: Facilitates communication between the .NET application and the JVM-based Spark components.
15+
16+
# Pipeline Processing Sequence
17+
18+
## Basic Sequence Diagram
19+
20+
The basic sequence diagram for the application lifecycle is depicted below:
21+
22+
![Icon](img/Spark-dotnet-sequence-diagram-simple.png)
23+
24+
This diagram shows the flow of control and data during the execution of a simple Spark pipeline with .NET, without UDFs or data retrieval. Note that Worker is never instantiated, and no actual pipeline data is transferred to .NET.
25+
26+
27+
## Sequence Diagram with UDFs and Data Retrieval
28+
29+
The sequence diagram below illustrates a more complex scenario involving user-defined functions (UDFs) and data retrieval:
30+
31+
![Icon](img/Spark-dotnet-sequence-diagram-udf-data.png)
32+
33+
This diagram includes the steps for registering and invoking UDFs, as well as fetching the dataset into .NET memory.

0 commit comments

Comments
 (0)