TransWikia.com

Large Graphs: NetworkX distributed alternative

Data Science Asked by 20roso on November 10, 2021

I have built some implementations using NetworkX(graph Python module) native algorithms in which I output some attributes which I use them for classification purposes.

I want to scale it to a distributed environment. I have seen many approaches like neo4j, Graphx, GraphLab. However, I am quite new to this, thus I want to ask, which of them would be easy to locally apply graph algorithms (ex. node centrality measures), preferably using Python. To be more specific, which available option is closer related to NetworkX (easy installation, premade functions/algorithms, ML wise)?

3 Answers

GraphBLAS: graph algorithms in the language of linear algebra

There has been a lot of radical innovation in 2017-2020 in terms of distributed and parallel graph algorithms.

GraphBLAS itself provides the building blocks for creating more advanced algorithms.

If you want more sophisticated algorithms, like PageRank and ConnectedCompoments, you can check out LAGraph: a library plus a test harness for collecting algorithms that use the GraphBLAS.

Oh, and check out GraphBLAST by gunrock, in case you are interested in running graph algorithms on your GPU.

Since you are using Python, you probably want to use GraphBLAS for Python

Answered by Tobias Bergkvist on November 10, 2021

In this present moment, Apache has develop a powerfull API called PySpark. And you can setup Graphframes directly from pyspark command line. Launch from you shell terminal:

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

and you can develop your code entirely in python using graphframes API. Try the following example code

# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
    ("a", "Alice", 34),
    ("b", "Bob", 36),
    ("c", "Charlie", 30),
    ], ["id", "name", "age"])

# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
    ("a", "b", "friend"),
    ("b", "c", "follow"),
    ("c", "b", "follow"),
    ], ["src", "dst", "relationship"])

# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()

Above we could calculate pageRank from graph g. There are several algorithms already implemented on PySpark with Graphframes. I hope I've helped.

Answered by Emanuel Fontelles on November 10, 2021

Good , old and unsolved question! Distributed processing of large graphs as far as I know (speaking as a graph guy) has 2 different approaches, with the knowledge of Big Data frameworks or without it.

SNAP library from Jure Leskovec group at Stanford which is originally in C++ but also has a Python API (please check if you need to use C++ API or Python does the job you want to do). Using snap you can do many things on massive networks without any special knowledge of Big Data technologies. So I would say the easiest one.

Using Apache Graphx is wonderful only if you have experience in Scala because there is no Python thing for that. It comes with a large stack of built in algorithms including centrality measures. So the second easiest in case you know Scala.

Long time ago when I looked at GraphLab it was commercial. Now I see it goes open source so maybe you know better than me but from my out-dated knowledge I remember that it does not support a wide range of algorithms and if you need an algorithm which is not there it might get complicated to implement. On the other hand it uses Python which is cool. After all please check it again as my knowledge is for 3 years ago.

If you are familiar with Big Data frameworks and working with them, Giraph and Gradoop are 2 great options. Both do fantastic jobs but you need to know some Big Data architecture e.g. working with a hadoop platform.

PS

1) I have used simple NetworkX and multiprocessing to distributedly process DBLP network with 400,000 nodes and it worked well, so you need to know HOW BIG your graph is.

2) After all, I think SNAP library is a handy thing.

Answered by Kasra Manshaei on November 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP