- Published on
Implementing GraphRAG — Entity Extraction, Community Detection, and Graph-Augmented Retrieval
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Traditional RAG treats documents as isolated chunks. GraphRAG adds structure: entities (people, organizations, concepts), relationships (worked-for, founded, influenced), and communities (tightly-knit groups of entities). This richer representation allows reasoning across documents and answering complex multi-hop questions. "Which CEOs worked at Microsoft and later founded AI startups?" requires understanding relationships—pure vector search fails.
- Entity and Relationship Extraction With LLM
- Neo4j Graph Schema Design
- Inserting Entities and Relationships into Neo4j
- Community Detection (Leiden Algorithm)
- Local vs Global Search in GraphRAG
- Combining Vector and Graph Retrieval
- When GraphRAG Outperforms Standard RAG
- Cost and Complexity Trade-Offs
- Checklist
- Conclusion
Entity and Relationship Extraction With LLM
Start by extracting structured information from text.
interface Entity {
name: string;
type: 'Person' | 'Organization' | 'Location' | 'Concept';
description?: string;
}
interface Relationship {
source: Entity;
target: Entity;
type: string; // "founded", "worked-for", "located-in", etc.
description?: string;
}
interface GraphConstructionResult {
entities: Entity[];
relationships: Relationship[];
}
async function extractEntitiesAndRelationships(
text: string,
llm: LanguageModel
): Promise<GraphConstructionResult> {
const prompt = `Extract entities and relationships from this text:
Text: "${text}"
Respond with JSON:
{
"entities": [
{ "name": "...", "type": "Person|Organization|Location|Concept", "description": "..." }
],
"relationships": [
{ "source": "entity_name", "target": "entity_name", "type": "relationship_type", "description": "..." }
]
}`;
const response = await llm.generate({ prompt });
return JSON.parse(response);
}
// Process a document into a knowledge graph
async function documentToGraph(
docId: string,
content: string,
llm: LanguageModel
): Promise<{ docId: string; entities: Entity[]; relationships: Relationship[] }> {
// For long documents, chunk and extract per chunk
const chunks = chunkDocument(content, { size: 2000 });
const allEntities = new Map<string, Entity>();
const allRelationships: Relationship[] = [];
for (const chunk of chunks) {
const { entities, relationships } = await extractEntitiesAndRelationships(chunk, llm);
// Deduplicate entities by name
for (const entity of entities) {
if (!allEntities.has(entity.name)) {
allEntities.set(entity.name, entity);
}
}
allRelationships.push(...relationships);
}
return {
docId,
entities: Array.from(allEntities.values()),
relationships: allRelationships,
};
}
LLM-based extraction is flexible and works across domains, but prone to errors. For production, add validation:
async function validateExtraction(
entities: Entity[],
relationships: Relationship[]
): Promise<boolean> {
// Check: all relationship endpoints exist in entities
const entityNames = new Set(entities.map(e => e.name));
for (const rel of relationships) {
if (!entityNames.has(rel.source.name) || !entityNames.has(rel.target.name)) {
console.warn(
`Invalid relationship: ${rel.source.name} → ${rel.target.name}`
);
return false;
}
}
return true;
}
Neo4j Graph Schema Design
Design a schema for storing entities and relationships.
interface GraphSchema {
nodeLabels: string[]; // Person, Organization, Location, Concept
relationshipTypes: string[]; // FOUNDED, WORKED_FOR, LOCATED_IN
properties: {
node: Record<string, string>; // e.g., Person: [name, bio, birth_date]
relationship: Record<string, string>; // e.g., WORKED_FOR: [startDate, endDate]
};
}
const schemaExample: GraphSchema = {
nodeLabels: ['Person', 'Organization', 'Location', 'Concept'],
relationshipTypes: [
'FOUNDED',
'WORKED_FOR',
'LOCATED_IN',
'INFLUENCED',
'PUBLISHED',
],
properties: {
node: {
Person: 'name, bio, birth_date, nationality',
Organization: 'name, industry, founded_year, headquarters',
Location: 'name, country, region',
Concept: 'name, definition, domain',
},
relationship: {
FOUNDED: 'year, role',
WORKED_FOR: 'start_year, end_year, role',
LOCATED_IN: 'since_year',
},
},
};
async function createGraphSchema(neo4j: Neo4jDriver): Promise<void> {
const session = neo4j.session();
try {
// Create Person nodes
await session.run(`
CREATE CONSTRAINT person_name IF NOT EXISTS
FOR (p:Person) REQUIRE p.name IS UNIQUE
`);
// Create Organization nodes
await session.run(`
CREATE CONSTRAINT org_name IF NOT EXISTS
FOR (o:Organization) REQUIRE o.name IS UNIQUE
`);
// Create indexes for performance
await session.run(`CREATE INDEX person_bio IF NOT EXISTS FOR (p:Person) ON (p.bio)`);
} finally {
await session.close();
}
}
Inserting Entities and Relationships into Neo4j
async function insertGraphData(
neo4j: Neo4jDriver,
entities: Entity[],
relationships: Relationship[]
): Promise<void> {
const session = neo4j.session();
try {
// Insert entities
for (const entity of entities) {
const label = entity.type; // Person, Organization, etc.
await session.run(
`
MERGE (e:${label} { name: $name })
SET e.description = $description
RETURN e
`,
{ name: entity.name, description: entity.description ?? '' }
);
}
// Insert relationships
for (const rel of relationships) {
const sourceType = rel.source.type;
const targetType = rel.target.type;
const relType = rel.type.toUpperCase().replace(/ /g, '_');
await session.run(
`
MATCH (source:${sourceType} { name: $sourceName })
MATCH (target:${targetType} { name: $targetName })
MERGE (source)-[r:${relType}]->(target)
SET r.description = $description
RETURN r
`,
{
sourceName: rel.source.name,
targetName: rel.target.name,
description: rel.description ?? '',
}
);
}
} finally {
await session.close();
}
}
Community Detection (Leiden Algorithm)
Detect tightly-connected groups of entities for higher-level reasoning.
interface Community {
id: number;
entities: Entity[];
size: number;
density: number; // How interconnected
summary: string; // LLM-generated description
}
async function detectCommunities(
neo4j: Neo4jDriver,
llm: LanguageModel
): Promise<Community[]> {
const session = neo4j.session();
try {
// Use Louvain or Leiden algorithm via Neo4j GDS library
const result = await session.run(`
CALL gds.leiden.stream('graph', {
relationshipWeightProperty: 'weight',
maxLevels: 5
})
YIELD nodeId, communityId
RETURN communityId, collect(gds.util.asNode(nodeId)) as nodes
`);
const communities: Community[] = [];
for (const record of result.records) {
const communityId = record.get('communityId');
const nodes = record.get('nodes');
// Compute density
const nodeCount = nodes.length;
const edgeCount = nodes.reduce((sum, node) => sum + node.rel_count, 0);
const maxEdges = (nodeCount * (nodeCount - 1)) / 2;
const density = maxEdges > 0 ? edgeCount / maxEdges : 0;
// Generate summary
const nodeNames = nodes.map(n => n.properties.name).join(', ');
const summary = await llm.generate({
prompt: `Summarize this group in one sentence: ${nodeNames}`,
});
communities.push({
id: communityId,
entities: nodes.map(n => ({
name: n.properties.name,
type: n.labels[0],
})),
size: nodeCount,
density,
summary,
});
}
return communities;
} finally {
await session.close();
}
}
Communities capture higher-level structure. A "community" might represent a company, a research field, or a historical period.
Local vs Global Search in GraphRAG
Two retrieval strategies:
Local Search: For specific entities or relationships
async function localSearch(
query: string,
neo4j: Neo4jDriver,
llm: LanguageModel
): Promise<string> {
// Extract target entity from query using LLM
const entityName = await llm.generate({
prompt: `Extract the main entity from this query: "${query}"
Return just the entity name.`,
});
const session = neo4j.session();
try {
// Find the entity and its immediate neighbors
const result = await session.run(
`
MATCH (e { name: $entityName })
OPTIONAL MATCH (e)-[r]-(neighbor)
RETURN e, r, neighbor
LIMIT 50
`,
{ entityName: entityName.trim() }
);
// Format as context for LLM
const context = result.records
.map(
record =>
`${record.get('e').properties.name} ${record.get('r').type} ${record.get('neighbor')?.properties.name ?? ''}`
)
.join('\n');
return await llm.generate({
prompt: `${query}\n\nContext:\n${context}`,
});
} finally {
await session.close();
}
}
Global Search: For complex, multi-hop questions
async function globalSearch(
query: string,
neo4j: Neo4jDriver,
llm: LanguageModel
): Promise<string> {
// Query relevant communities
const communityQuery = `
CALL gds.leiden.stream('graph')
YIELD nodeId, communityId
RETURN DISTINCT communityId
LIMIT 10
`;
const session = neo4j.session();
try {
const communities = await session.run(communityQuery);
// For each community, generate a summary
const summaries = await Promise.all(
communities.records.map(async (record) => {
const communityId = record.get('communityId');
const entitiesResult = await session.run(
`
CALL gds.leiden.stream('graph')
YIELD nodeId, communityId
WHERE communityId = $cid
RETURN collect(gds.util.asNode(nodeId)) as nodes
`,
{ cid: communityId }
);
const nodes = entitiesResult.records[0].get('nodes');
const nodeNames = nodes
.map((n: any) => n.properties.name)
.slice(0, 10)
.join(', ');
return nodeNames;
})
);
// Generate answer from community summaries
const contextStr = summaries.join('\n\n');
return await llm.generate({
prompt: `${query}\n\nContext (community summaries):\n${contextStr}`,
});
} finally {
await session.close();
}
}
Local search is fast and precise. Global search is slower but handles complex queries.
Combining Vector and Graph Retrieval
Hybrid retrieval uses both vector similarity (semantic) and graph structure (relationship-based).
interface HybridRetrievalResult {
vectorChunks: DocumentChunk[];
graphEntities: Entity[];
combinedContext: string;
}
async function hybridRetrieval(
query: string,
vectorDb: VectorStore,
neo4j: Neo4jDriver,
llm: LanguageModel,
weights: { vectorWeight: number; graphWeight: number } = {
vectorWeight: 0.6,
graphWeight: 0.4,
}
): Promise<HybridRetrievalResult> {
// Vector retrieval
const vectorChunks = await vectorDb.search(query, { topK: 10 });
// Graph retrieval: extract entities from query
const entities = await llm.generate({
prompt: `Extract entities (people, organizations) from: "${query}"
Return as JSON array: ["entity1", "entity2", ...]`,
});
const entityList = JSON.parse(entities);
const session = neo4j.session();
let graphEntities: Entity[] = [];
try {
for (const entityName of entityList) {
const result = await session.run(
`
MATCH (e { name: $name })-[r]-(related)
RETURN e, r, related
LIMIT 10
`,
{ name: entityName }
);
graphEntities.push(
...result.records.map(rec => ({
name: rec.get('e').properties.name,
type: rec.get('e').labels[0],
}))
);
}
} finally {
await session.close();
}
// Combine results
const combinedContext = `
Vector retrieval results:
${vectorChunks.map(c => c.content).join('\n\n')}
Graph entities and relationships:
${graphEntities.map(e => `${e.name} (${e.type})`).join(', ')}
`;
return {
vectorChunks,
graphEntities,
combinedContext,
};
}
When GraphRAG Outperforms Standard RAG
GraphRAG wins on:
- Multi-hop reasoning: "Find all people who worked at company X then founded startups"
- Relationship-heavy domains: Organizational hierarchies, social networks, academic genealogies
- Disambiguation: "Trump" (person) vs "Trump card" (concept)
- Temporal reasoning: "CEOs hired in 2020 who left by 2022"
Standard RAG is simpler and sufficient for:
- Fact-based retrieval: "What is photosynthesis?"
- Single-document QA
- When entities and relationships are sparse
Cost and Complexity Trade-Offs
| Factor | Standard RAG | GraphRAG |
|---|---|---|
| Setup | Days | Weeks (extraction, graph construction) |
| Embedding cost | Lower (chunks only) | Higher (entities + relationships) |
| Query latency | ~50ms | 100-500ms (graph traversal) |
| Maintenance | Low | High (deduplication, schema evolution) |
| Accuracy on simple queries | 80% | 80% (no improvement) |
| Accuracy on complex queries | 40% | 75% (major improvement) |
Checklist
- Use LLM for flexible entity and relationship extraction
- Validate extracted entities (endpoints exist in graph)
- Design Neo4j schema with clear node labels and relationship types
- Detect communities for higher-level reasoning
- Implement local search (specific entities) and global search (complex questions)
- Combine vector and graph retrieval for hybrid strength
- Monitor extraction quality; errors compound in graph
- Measure improvement on complex, multi-hop questions
- Start simple with vector RAG; add graph only if needed
Conclusion
GraphRAG adds structure to RAG: entities, relationships, and communities enable reasoning across documents. Best for domains with rich relationships (organizations, people, concepts). The cost is complexity: entity extraction, deduplication, and graph maintenance. Start with standard vector RAG. If you find yourself struggling with multi-hop questions, GraphRAG is worth the investment.