bolt4jr
is an R package designed to efficiently query,
extract, and process large-scale network data from Neo4j databases using
the Bolt protocol, with built-in support for batch processing and data
frame conversion.
bolt4jr
is an R package that facilitates interaction
with Neo4j databases using the Bolt protocol. It allows users to
efficiently query nodes and edges in a Neo4j graph database, convert
results into data frames, and process large datasets in batches. The
package is especially useful for extracting large-scale network data for
bioinformatics, computational biology, and other applications.
This README provides a comprehensive guide to installing and using
the bolt4jr
package for extracting network data from
Neo4j.
To install the bolt4jr
package directly from its GitHub
repository, use the remotes
package:
# Install remotes if not already installed
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
# Install bolt4jr from GitHub
::install_github("Broccolito/bolt4jr") remotes
To securely store your Neo4j connection details (URI, username, and password), you can use environment variables. This ensures that sensitive information is not hard-coded in your scripts.
Open your .Renviron
file:
::edit_r_environ() usethis
Add the following lines to the file, replacing placeholders with your connection details:
NEO4J_URI=bolt://<YOUR_NEO4J_URI>
NEO4J_USER=<YOUR_USERNAME>
NEO4J_PASSWORD=<YOUR_PASSWORD>
Save the file and restart your R session to load the environment variables.
Access the stored variables in R:
= Sys.getenv("NEO4J_URI")
uri = Sys.getenv("NEO4J_USER")
username = Sys.getenv("NEO4J_PASSWORD") password
Set up conda environment
setup_bolt4jr()
This function initializes the Conda environment required for the
bolt4jr
package. If no Conda binary is found, it installs
Miniconda. If the required Conda environment (bolt4jr
) is
not found, it creates the environment and installs the necessary
dependencies.
To query nodes from a Neo4j database, use the run_query
function. Here’s an example:
library(bolt4jr)
# Query nodes
= run_query(
nodes uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT elementId(n) AS node_id, n
LIMIT 1000"
)
# Examine the structure of the result
unlist(nodes[[1]])
$node_id
[1] "4:c77f6410-bc08-43ba-a172-0503ab1c93db:0"
$n.identifier
[1] "UBERON:0003233"
$n.name
[1] "epithelium of shoulder"
$n.mesh_id
[1] ""
$n.source
[1] "Uberon"
= convert_df(
nodes
nodes,field_names = c("node_id", "n.identifier", "n.name", "n.source")
)
# View the resulting data frame
head(nodes)
node_id | n.identifier | n.name | n.source |
---|---|---|---|
4:c77f6410-bc08-43ba-a172-0503ab1c93db:0 | UBERON:0003233 | epithelium of shoulder | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:1 | UBERON:2001901 | ceratobranchial 3 element | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:2 | UBERON:0004321 | middle phalanx of manual digit 3 | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:3 | UBERON:0002414 | lumbar vertebra | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:4 | UBERON:2005118 | middle lateral line primordium | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:5 | UBERON:0034769 | lymphomyeloid tissue | Uberon |
Since some field names (node_id
) are explicitly
specified in the query, and some other field names are known common
attributes (n.identifier
, n.name
,
n.source
), the extraction will work correctly. If
mismatched field names are provided, the function may fail.
Similarly, you can query edges:
# Query edges
= run_query(
edges uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT
elementId(r) AS edge_id,
elementId(startNode(r)) AS start_node_id,
elementId(endNode(r)) AS end_node_id,
r
LIMIT 1000"
)
# Examine the structure of the result
unlist(edges[[1]])
# Extract specific fields and convert to a data frame
= convert_df(
edges
edges,field_names = c("edge_id", "start_node_id", "end_node_id")
)
# View the resulting data frame
head(edges)
edge_id | start_node_id | end_node_id |
---|---|---|
4:c77f6410-bc08-43ba-a172-0503ab1c93db:10 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:0 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:1 |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:11 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:2 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:3 |
Since all field names (edge_id
,
start_node_id
, and end_node_id
) are explicitly
specified in the query, the extraction will work correctly. If
mismatched field names are provided, the function may fail.
For large networks, you can use the run_batch_query
function to process data in chunks. This function appends results to a
file incrementally, minimizing memory usage.
run_batch_query(
uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT
elementId(r) AS edge_id,
elementId(startNode(r)) AS start_node_id,
elementId(endNode(r)) AS end_node_id,
r",
field_names = c("edge_id", "start_node_id", "end_node_id"),
filename = "edges.tsv",
batch_size = 1000
)
run_batch_query(
uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT elementId(n) AS node_id, n",
field_names = c("node_id", "n.identifier", "n.name", "n.source"),
filename = "nodes.tsv",
batch_size = 1000
)
The convert_df
function simplifies converting Neo4j
query results into R data frames.
# Convert query results to a data frame
= convert_df(
nodes
nodes,field_names = c("node_id", "n.identifier", "n.name", "n.source")
)
# View the data frame
head(nodes)
Similar to querying not in batches, please make sure that all field names can be found in the neo4j query or are common attributes. If mismatched field names are provided, the function may fail.
.Renviron
file is saved correctly and restart your R
session.run_batch_query
for datasets exceeding memory limits.Contributions to bolt4jr
are welcome! Submit issues or
pull requests on the GitHub repository.
Alternatively, please contact
Wanjun Gu
for questions and clarifications.