Home | Login | Register

Advanced RAG: Building Knowledge Graphs with Neo4j Practical Deep Dive

...
Shweta Gargade

...
Category: RAG

What is a Knowledge Graph?

A knowledge graph is a database designed to store information in a structured and interconnected manner using nodes and relationships. Both nodes and relationships have properties, and we assign labels to nodes to group them together. Relationships have types and directions, indicating how nodes are connected.

  • Nodes: Represent entities or concepts (e.g., people, places, objects).
  • Relationships: Connect nodes and define how they are related (e.g., “works at”, “lives in”).
  • Properties: Attributes or details about nodes and relationships (e.g., name, age, job title).
  • Labels: Group similar nodes together (e.g., all nodes representing people might be labeled “Person”).
  • Types and Directions: Define the nature of relationships and the direction of connection (e.g., “Person” -[works at]-> “Company”).

Let’s explore a simple knowledge graph with three nodes to see how it works. Imagine we want to represent the relationship between a person, their job, and their employer. Here’s how it might look:

Nodes:

  • Shweta (Person)
  • Data Scientist (Job)
  • HSBC (Company)

Relationships:

  • Shweta -[has job]-> Data Scientist
  • Shweta -[works at]-> HSBC
  • Data Scientist-[is position at]-> HSBC

Querying Knowledge graph with cypher queries:

In this section, we’ll explore the default movie dataset in Neo4j and experiment with Cypher queries. I’ve included a few examples below — feel free to try them out and create your own queries to dive deeper!

from dotenv import load_dotenv
import os
from langchain_community.graphs import Neo4jGraph
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE')
kg = Neo4jGraph(
url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)
# Query 1: - Match all nodes in the graph
cypher = """MATCH (n) RETURN count(n)"""
kg.query(cypher)
# Ans: [{'count(n)': 171}]
# Query 2: - Match a single person by specifying the value of the `name` property on the `Person` node
cypher = """MATCH (tom:Person {name:"Tom Hanks"}) RETURN tom"""
kg.query(cypher)
#Ans: [{'tom': {'born': 1956, 'name': 'Tom Hanks'}}]
# Query 3: - Cypher patterns with conditional matching
cypher = """
MATCH (nineties:Movie)
WHERE nineties.released = 1990
RETURN nineties.title
"""
kg.query(cypher)
# Ans: [{'nineties.title': 'Joe Versus the Volcano'}]
# Query 4: - Pattern matching with multiple nodes
cypher = """
MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)
RETURN actor.name, movie.title LIMIT 1
"""
kg.query(cypher)
# Ans: [{'actor.name': 'Emil Eifrem', 'movie.title': 'The Matrix'}]
#Query 5: Delete data from the graph
#Let's keep the person named 'Emil Eifrem' in our database but
# delete their 'ACTED_IN' relationship
cypher = """
MATCH (emil:Person {name:"Emil Eifrem"})-[actedIn:ACTED_IN]->(movie:Movie)
RETURN emil.name, movie.title
"""
kg.query(cypher)
#Query 6: Adding data to the graph
cypher = """
CREATE (andreas:Person {name:"Andreas"})
RETURN andreas
"""
kg.query(cypher)

Preparing Text Data for KG

In a RAG system, vector representations of text are used to match your query with relevant chunks stored in a vector database. Similarly, to find relevant text in a knowledge graph, we need to create embeddings of the text fields within the graph.

#OPENAI Embeddings API setup
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
kg.query("""
CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
FOR (m:Movie) ON (m.taglineEmbedding)
OPTIONS { indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}"""
)
# show vector INDEXES
kg.query("""
SHOW VECTOR INDEXES
"""
)
# Ans:
'''
[{'id': 3,
'name': 'movie_tagline_embeddings',
'state': 'ONLINE',
'populationPercent': 100.0,
'type': 'VECTOR',
'entityType': 'NODE',
'labelsOrTypes': ['Movie'],
'properties': ['taglineEmbedding'],
'indexProvider': 'vector-1.0',
'owningConstraint': None,
'lastRead': None,
'readCount': None}]
'''

- Calculate vector representation for each movie tagline using OpenAI
- Add vector to the `Movie` node as `taglineEmbedding` property

kg.query("""
MATCH (movie:Movie) WHERE movie.tagline IS NOT NULL
WITH movie, genai.vector.encode(
movie.tagline,
"OpenAI",
{
token: $openAiApiKey,
}) AS vector
CALL db.create.setNodeVectorProperty(movie, "taglineEmbedding", vector)
"""
,
params={"openAiApiKey":OPENAI_API_KEY} )
result = kg.query("""
MATCH (m:Movie)
WHERE m.tagline IS NOT NULL
RETURN m.tagline, m.taglineEmbedding
LIMIT 1
"""
)
result[0]['m.tagline']
# Ans: result[0]['m.tagline']
# see embeddings
result[0]['m.taglineEmbedding'][:2]
# Ans: [0.01745212823152542,-0.005519301164895296]

- Calculate embedding for question
- Identify matching movies based on similarity of question and `taglineEmbedding` vectors

question = "What movies are about love?"
kg.query("""
WITH genai.vector.encode(
$question,
"OpenAI",
{token: $openAiApiKey,}) AS question_embedding
CALL db.index.vector.queryNodes(
'movie_tagline_embeddings',
$top_k,
question_embedding
) YIELD node AS movie, score
RETURN movie.title, movie.tagline, score
""",
params={"openAiApiKey":OPENAI_API_KEY,
"question": question,
"top_k": 2
})
# Ans:
[{'movie.title': 'Joe Versus the Volcano',
'movie.tagline': 'A story of love, lava and burning desire.',
'score': 0.9062923789024353},
{'movie.title': 'As Good as It Gets',
'movie.tagline': 'A comedy from the heart that goes for the throat.',
'score': 0.9022473096847534}]

Construct Knowledge Graph from Financial Documents

SEC Filing Data →

  • Companies must file various financial reports with the SEC annually.
  • A key document is the 10-K, an annual report of company activities.
  • These forms are public records available via the SEC EDGAR database.
  • Publicly traded companies file a 10-K each year with the SEC.
  • You can search these filings on the SEC’s EDGAR database.
  • In upcoming sections, you’ll work with a 10-K form from NetApp.

I’ve attached Jupyter Notebook link here for a deeper dive into the details. Feel free to explore it further.
Summary of the notebook →

  • After performing data cleaning, the file is saved in .json format, which will be used to construct the knowledge graph.
  • The document is split into sections using a LangChain TextSplitter, with a chunk size of 2,000 and an overlap of 200.
  • A knowledge graph is created where each chunk is represented as a node, with chunk metadata added as properties.
  • A vector index is generated by calculating text embedding vectors for each chunk.
  • Similarity search is then used to find the most relevant chunks.
  • Set up a LangChain RAG workflow to create a QA chat system for interacting with the form.

We already have chunk nodes that we’ve created earlier. we will create new node which will represent 10k Form itself.

Summary of the Notebook →

  1. Create a Form 10-K Node:
  • Establish a node representing the entire Form 10-K.
  • Populate it with metadata from a single chunk of the form.

2. Create a Linked List of Chunk Nodes:

  • Create nodes for each chunk within a section, ordered by chunkseqid.
  • Store the ordered list in a variable called section_chunk_list.

3. Add NEXT Relationships Between Chunks:

  • Use Neo4j’s apoc.nodes.link to connect ordered Chunk nodes with a NEXT relationship.
  • Start by linking chunks in the “Item 1” section.

4. Create Relationships for All Sections:

  • Loop through and link chunks for all sections of the Form 10-K.

5. Connect Chunks to Parent Form:

  • Establish PART_OF relationships between chunks and their parent Form 10-K.

6. Add SECTION Relationships:

  • Connect the Form 10-K to the first chunk of each section with a SECTION relationship.

7. Customize Similarity Search with Cypher:

  • Extend the vector store to accept Cypher queries for tailored results.

8. Set Up Vector Store and QA System:

  • Configure the vector store to use the query, and instantiate a retriever and QA chain in LangChain. Ask questions to interact with the data.

Expanding knowledge Graph

Read the collection of Form 13s →
- Investment management firms must report on their investments in companies to the SEC by filing a document called **Form 13**
- You’ll load a collection of Form 13 for managers that have invested in NetApp
- You can check out the CSV file by navigating to the data directory using the File menu at the top of the notebook.

Create New Nodes Manager and Company and relationship between them.

We’ve evolved our Knowledge Graph significantly since we first began. Initially, we started with just chunks of the Form 10-K and connected them with a FORM node. Now, we’ve expanded to include nodes for the company, manager, and all tha connected together. Let’s check the updated schema together.

Node properties are the following: 
1. Chunk {textEmbedding: LIST, f10kItem: STRING, chunkSeqId: INTEGER,
text: STRING,cik: STRING, cusip6: STRING, names: LIST,
formId: STRING, source: STRING, chunkId: STRING},
2. Form {cusip6: STRING, names: LIST, formId: STRING, source: STRING},
3. Company {cusip6: STRING, names: LIST, companyName: STRING, cusip:STRING},
4. Manager {managerName: STRING, managerCik: STRING, managerAddress: STRING}
Relationship properties are the following: 
SECTION {f10kItem: STRING},
OWNS_STOCK_IN {shares: INTEGER, reportCalendarOrQuarter: STRING, value: FLOAT}
The relationships are the following:
(:Chunk)-[:NEXT]-(:Chunk),
(:Chunk)-[:PART_OF]->(:Form),
(:Form)-[:SECTION]-(:Chunk),
(:Company)-[:FILED]->(:Form),
(:Manager)-[:OWNS_STOCK_IN]->(:Company)

I’ve attached Jupyter Notebook link here for a deeper dive into the details. Feel free to explore it further.

Chat with Knowledge Graph

We start with a small graph(Minimal Viable Graph- MVG), then extract, enhance, expand, and repeat to grow the graph.
Extract — Identify interesting information into seperate nodes.

Enhance — Supercharge the data with a vector embeddings

Expand — Connect information to expand context.

Extract, Enhance and Expand with SEC Documents

In this Jupyter notebook, I’ve added a few Cypher queries to interact with the graph. Writing efficient Cypher queries with limited knowledge can be challenging. However, we can leverage GPT-3.5 or any LLM to help us in generating these queries. By incorporating few-shot learning in our prompts — using a few example queries as a guide — we can improve the LLM’s output. Additionally, integrating this functionality with Langchain via the GraphCypherQAChain makes the process seamless and straightforward.



Tags: LLM, GenAI, NLP, RAG, OpenSourceAI, openai, chatgpt, Knowledge Graph


© VisionNLP LLP 2020 - 2025
Created By WEBNext Labs Theme By Amit Kumar Jha