A knowledge graph is a database designed to store information in a structured and interconnected manner using nodes and relationships. Both nodes and relationships have properties, and we assign labels to nodes to group them together. Relationships have types and directions, indicating how nodes are connected.
Let’s explore a simple knowledge graph with three nodes to see how it works. Imagine we want to represent the relationship between a person, their job, and their employer. Here’s how it might look:
Nodes:
Shweta
(Person)Data Scientist
(Job)HSBC
(Company)Relationships:
Shweta
-[has job]-> Data Scientist
Shweta
-[works at]-> HSBC
Data Scientist
-[is position at]-> HSBC
In this section, we’ll explore the default movie dataset in Neo4j and experiment with Cypher queries. I’ve included a few examples below — feel free to try them out and create your own queries to dive deeper!
from dotenv import load_dotenv
import os
from langchain_community.graphs import Neo4jGraph
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE')
kg = Neo4jGraph(
url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)
# Query 1: - Match all nodes in the graph
cypher = """MATCH (n) RETURN count(n)"""
kg.query(cypher)
# Ans: [{'count(n)': 171}]
# Query 2: - Match a single person by specifying the value of the `name` property on the `Person` node
cypher = """MATCH (tom:Person {name:"Tom Hanks"}) RETURN tom"""
kg.query(cypher)
#Ans: [{'tom': {'born': 1956, 'name': 'Tom Hanks'}}]# Query 3: - Cypher patterns with conditional matching
cypher = """
MATCH (nineties:Movie)
WHERE nineties.released = 1990
RETURN nineties.title
"""
kg.query(cypher)
# Ans: [{'nineties.title': 'Joe Versus the Volcano'}]# Query 4: - Pattern matching with multiple nodes
cypher = """
MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)
RETURN actor.name, movie.title LIMIT 1
"""
kg.query(cypher)
# Ans: [{'actor.name': 'Emil Eifrem', 'movie.title': 'The Matrix'}]#Query 5: Delete data from the graph
#Let's keep the person named 'Emil Eifrem' in our database but
# delete their 'ACTED_IN' relationship
cypher = """
MATCH (emil:Person {name:"Emil Eifrem"})-[actedIn:ACTED_IN]->(movie:Movie)
RETURN emil.name, movie.title
"""
kg.query(cypher)#Query 6: Adding data to the graph
cypher = """
CREATE (andreas:Person {name:"Andreas"})
RETURN andreas
"""
kg.query(cypher)
In a RAG system, vector representations of text are used to match your query with relevant chunks stored in a vector database. Similarly, to find relevant text in a knowledge graph, we need to create embeddings of the text fields within the graph.
#OPENAI Embeddings API setup
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
kg.query("""
CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
FOR (m:Movie) ON (m.taglineEmbedding)
OPTIONS { indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}"""
)# show vector INDEXES
kg.query("""
SHOW VECTOR INDEXES
"""
)# Ans:
'''
[{'id': 3,
'name': 'movie_tagline_embeddings',
'state': 'ONLINE',
'populationPercent': 100.0,
'type': 'VECTOR',
'entityType': 'NODE',
'labelsOrTypes': ['Movie'],
'properties': ['taglineEmbedding'],
'indexProvider': 'vector-1.0',
'owningConstraint': None,
'lastRead': None,
'readCount': None}]
'''
- Calculate vector representation for each movie tagline using OpenAI
- Add vector to the `Movie` node as `taglineEmbedding` property
kg.query("""
MATCH (movie:Movie) WHERE movie.tagline IS NOT NULL
WITH movie, genai.vector.encode(
movie.tagline,
"OpenAI",
{
token: $openAiApiKey,
}) AS vector
CALL db.create.setNodeVectorProperty(movie, "taglineEmbedding", vector)
""",
params={"openAiApiKey":OPENAI_API_KEY} )
result = kg.query("""
MATCH (m:Movie)
WHERE m.tagline IS NOT NULL
RETURN m.tagline, m.taglineEmbedding
LIMIT 1
"""
)result[0]['m.tagline']
# Ans: result[0]['m.tagline']# see embeddings
result[0]['m.taglineEmbedding'][:2]
# Ans: [0.01745212823152542,-0.005519301164895296]
- Calculate embedding for question
- Identify matching movies based on similarity of question and `taglineEmbedding` vectors
question = "What movies are about love?"
kg.query("""
WITH genai.vector.encode(
$question,
"OpenAI",
{token: $openAiApiKey,}) AS question_embedding
CALL db.index.vector.queryNodes(
'movie_tagline_embeddings',
$top_k,
question_embedding
) YIELD node AS movie, score
RETURN movie.title, movie.tagline, score
""",
params={"openAiApiKey":OPENAI_API_KEY,
"question": question,
"top_k": 2
})# Ans:
[{'movie.title': 'Joe Versus the Volcano',
'movie.tagline': 'A story of love, lava and burning desire.',
'score': 0.9062923789024353},
{'movie.title': 'As Good as It Gets',
'movie.tagline': 'A comedy from the heart that goes for the throat.',
'score': 0.9022473096847534}]
SEC Filing Data →
I’ve attached Jupyter Notebook link here for a deeper dive into the details. Feel free to explore it further.
Summary of the notebook →
We already have chunk nodes that we’ve created earlier. we will create new node which will represent 10k Form itself.
Summary of the Notebook →
2. Create a Linked List of Chunk Nodes:
chunkseqid
.section_chunk_list
.3. Add NEXT Relationships Between Chunks:
apoc.nodes.link
to connect ordered Chunk
nodes with a NEXT
relationship.4. Create Relationships for All Sections:
5. Connect Chunks to Parent Form:
PART_OF
relationships between chunks and their parent Form 10-K.6. Add SECTION Relationships:
SECTION
relationship.7. Customize Similarity Search with Cypher:
8. Set Up Vector Store and QA System:
Read the collection of Form 13s →
- Investment management firms must report on their investments in companies to the SEC by filing a document called **Form 13**
- You’ll load a collection of Form 13 for managers that have invested in NetApp
- You can check out the CSV file by navigating to the data directory using the File menu at the top of the notebook.
Create New Nodes Manager and Company and relationship between them.
We’ve evolved our Knowledge Graph significantly since we first began. Initially, we started with just chunks of the Form 10-K and connected them with a FORM node. Now, we’ve expanded to include nodes for the company, manager, and all tha connected together. Let’s check the updated schema together.
Node properties are the following:
1. Chunk {textEmbedding: LIST, f10kItem: STRING, chunkSeqId: INTEGER,
text: STRING,cik: STRING, cusip6: STRING, names: LIST,
formId: STRING, source: STRING, chunkId: STRING},
2. Form {cusip6: STRING, names: LIST, formId: STRING, source: STRING},
3. Company {cusip6: STRING, names: LIST, companyName: STRING, cusip:STRING},
4. Manager {managerName: STRING, managerCik: STRING, managerAddress: STRING}
Relationship properties are the following:
SECTION {f10kItem: STRING},
OWNS_STOCK_IN {shares: INTEGER, reportCalendarOrQuarter: STRING, value: FLOAT} The relationships are the following:
(:Chunk)-[:NEXT]-(:Chunk),
(:Chunk)-[:PART_OF]->(:Form),
(:Form)-[:SECTION]-(:Chunk),
(:Company)-[:FILED]->(:Form),
(:Manager)-[:OWNS_STOCK_IN]->(:Company)
I’ve attached Jupyter Notebook link here for a deeper dive into the details. Feel free to explore it further.
We start with a small graph(Minimal Viable Graph- MVG), then extract, enhance, expand, and repeat to grow the graph.
Extract — Identify interesting information into seperate nodes.
Enhance — Supercharge the data with a vector embeddings
Expand — Connect information to expand context.
Extract, Enhance and Expand with SEC Documents
In this Jupyter notebook, I’ve added a few Cypher queries to interact with the graph. Writing efficient Cypher queries with limited knowledge can be challenging. However, we can leverage GPT-3.5 or any LLM to help us in generating these queries. By incorporating few-shot learning in our prompts — using a few example queries as a guide — we can improve the LLM’s output. Additionally, integrating this functionality with Langchain via the GraphCypherQAChain makes the process seamless and straightforward.
Tags: LLM, GenAI, NLP, RAG, OpenSourceAI, openai, chatgpt, Knowledge Graph