Published on

Implementing GraphRAG — Entity Extraction, Community Detection, and Graph-Augmented Retrieval

Authors

Introduction

Traditional RAG treats documents as isolated chunks. GraphRAG adds structure: entities (people, organizations, concepts), relationships (worked-for, founded, influenced), and communities (tightly-knit groups of entities). This richer representation allows reasoning across documents and answering complex multi-hop questions. "Which CEOs worked at Microsoft and later founded AI startups?" requires understanding relationships—pure vector search fails.

Entity and Relationship Extraction With LLM

Start by extracting structured information from text.

interface Entity {
  name: string;
  type: 'Person' | 'Organization' | 'Location' | 'Concept';
  description?: string;
}

interface Relationship {
  source: Entity;
  target: Entity;
  type: string; // "founded", "worked-for", "located-in", etc.
  description?: string;
}

interface GraphConstructionResult {
  entities: Entity[];
  relationships: Relationship[];
}

async function extractEntitiesAndRelationships(
  text: string,
  llm: LanguageModel
): Promise<GraphConstructionResult> {
  const prompt = `Extract entities and relationships from this text:

Text: "${text}"

Respond with JSON:
{
  "entities": [
    { "name": "...", "type": "Person|Organization|Location|Concept", "description": "..." }
  ],
  "relationships": [
    { "source": "entity_name", "target": "entity_name", "type": "relationship_type", "description": "..." }
  ]
}`;

  const response = await llm.generate({ prompt });
  return JSON.parse(response);
}

// Process a document into a knowledge graph
async function documentToGraph(
  docId: string,
  content: string,
  llm: LanguageModel
): Promise<{ docId: string; entities: Entity[]; relationships: Relationship[] }> {
  // For long documents, chunk and extract per chunk
  const chunks = chunkDocument(content, { size: 2000 });

  const allEntities = new Map<string, Entity>();
  const allRelationships: Relationship[] = [];

  for (const chunk of chunks) {
    const { entities, relationships } = await extractEntitiesAndRelationships(chunk, llm);

    // Deduplicate entities by name
    for (const entity of entities) {
      if (!allEntities.has(entity.name)) {
        allEntities.set(entity.name, entity);
      }
    }

    allRelationships.push(...relationships);
  }

  return {
    docId,
    entities: Array.from(allEntities.values()),
    relationships: allRelationships,
  };
}

LLM-based extraction is flexible and works across domains, but prone to errors. For production, add validation:

async function validateExtraction(
  entities: Entity[],
  relationships: Relationship[]
): Promise<boolean> {
  // Check: all relationship endpoints exist in entities
  const entityNames = new Set(entities.map(e => e.name));

  for (const rel of relationships) {
    if (!entityNames.has(rel.source.name) || !entityNames.has(rel.target.name)) {
      console.warn(
        `Invalid relationship: ${rel.source.name}${rel.target.name}`
      );
      return false;
    }
  }

  return true;
}

Neo4j Graph Schema Design

Design a schema for storing entities and relationships.

interface GraphSchema {
  nodeLabels: string[]; // Person, Organization, Location, Concept
  relationshipTypes: string[]; // FOUNDED, WORKED_FOR, LOCATED_IN
  properties: {
    node: Record<string, string>; // e.g., Person: [name, bio, birth_date]
    relationship: Record<string, string>; // e.g., WORKED_FOR: [startDate, endDate]
  };
}

const schemaExample: GraphSchema = {
  nodeLabels: ['Person', 'Organization', 'Location', 'Concept'],
  relationshipTypes: [
    'FOUNDED',
    'WORKED_FOR',
    'LOCATED_IN',
    'INFLUENCED',
    'PUBLISHED',
  ],
  properties: {
    node: {
      Person: 'name, bio, birth_date, nationality',
      Organization: 'name, industry, founded_year, headquarters',
      Location: 'name, country, region',
      Concept: 'name, definition, domain',
    },
    relationship: {
      FOUNDED: 'year, role',
      WORKED_FOR: 'start_year, end_year, role',
      LOCATED_IN: 'since_year',
    },
  },
};

async function createGraphSchema(neo4j: Neo4jDriver): Promise<void> {
  const session = neo4j.session();

  try {
    // Create Person nodes
    await session.run(`
      CREATE CONSTRAINT person_name IF NOT EXISTS
      FOR (p:Person) REQUIRE p.name IS UNIQUE
    `);

    // Create Organization nodes
    await session.run(`
      CREATE CONSTRAINT org_name IF NOT EXISTS
      FOR (o:Organization) REQUIRE o.name IS UNIQUE
    `);

    // Create indexes for performance
    await session.run(`CREATE INDEX person_bio IF NOT EXISTS FOR (p:Person) ON (p.bio)`);
  } finally {
    await session.close();
  }
}

Inserting Entities and Relationships into Neo4j

async function insertGraphData(
  neo4j: Neo4jDriver,
  entities: Entity[],
  relationships: Relationship[]
): Promise<void> {
  const session = neo4j.session();

  try {
    // Insert entities
    for (const entity of entities) {
      const label = entity.type; // Person, Organization, etc.

      await session.run(
        `
        MERGE (e:${label} { name: $name })
        SET e.description = $description
        RETURN e
      `,
        { name: entity.name, description: entity.description ?? '' }
      );
    }

    // Insert relationships
    for (const rel of relationships) {
      const sourceType = rel.source.type;
      const targetType = rel.target.type;
      const relType = rel.type.toUpperCase().replace(/ /g, '_');

      await session.run(
        `
        MATCH (source:${sourceType} { name: $sourceName })
        MATCH (target:${targetType} { name: $targetName })
        MERGE (source)-[r:${relType}]->(target)
        SET r.description = $description
        RETURN r
      `,
        {
          sourceName: rel.source.name,
          targetName: rel.target.name,
          description: rel.description ?? '',
        }
      );
    }
  } finally {
    await session.close();
  }
}

Community Detection (Leiden Algorithm)

Detect tightly-connected groups of entities for higher-level reasoning.

interface Community {
  id: number;
  entities: Entity[];
  size: number;
  density: number; // How interconnected
  summary: string; // LLM-generated description
}

async function detectCommunities(
  neo4j: Neo4jDriver,
  llm: LanguageModel
): Promise<Community[]> {
  const session = neo4j.session();

  try {
    // Use Louvain or Leiden algorithm via Neo4j GDS library
    const result = await session.run(`
      CALL gds.leiden.stream('graph', {
        relationshipWeightProperty: 'weight',
        maxLevels: 5
      })
      YIELD nodeId, communityId
      RETURN communityId, collect(gds.util.asNode(nodeId)) as nodes
    `);

    const communities: Community[] = [];

    for (const record of result.records) {
      const communityId = record.get('communityId');
      const nodes = record.get('nodes');

      // Compute density
      const nodeCount = nodes.length;
      const edgeCount = nodes.reduce((sum, node) => sum + node.rel_count, 0);
      const maxEdges = (nodeCount * (nodeCount - 1)) / 2;
      const density = maxEdges > 0 ? edgeCount / maxEdges : 0;

      // Generate summary
      const nodeNames = nodes.map(n => n.properties.name).join(', ');
      const summary = await llm.generate({
        prompt: `Summarize this group in one sentence: ${nodeNames}`,
      });

      communities.push({
        id: communityId,
        entities: nodes.map(n => ({
          name: n.properties.name,
          type: n.labels[0],
        })),
        size: nodeCount,
        density,
        summary,
      });
    }

    return communities;
  } finally {
    await session.close();
  }
}

Communities capture higher-level structure. A "community" might represent a company, a research field, or a historical period.

Local vs Global Search in GraphRAG

Two retrieval strategies:

Local Search: For specific entities or relationships

async function localSearch(
  query: string,
  neo4j: Neo4jDriver,
  llm: LanguageModel
): Promise<string> {
  // Extract target entity from query using LLM
  const entityName = await llm.generate({
    prompt: `Extract the main entity from this query: "${query}"
    Return just the entity name.`,
  });

  const session = neo4j.session();

  try {
    // Find the entity and its immediate neighbors
    const result = await session.run(
      `
      MATCH (e { name: $entityName })
      OPTIONAL MATCH (e)-[r]-(neighbor)
      RETURN e, r, neighbor
      LIMIT 50
    `,
      { entityName: entityName.trim() }
    );

    // Format as context for LLM
    const context = result.records
      .map(
        record =>
          `${record.get('e').properties.name} ${record.get('r').type} ${record.get('neighbor')?.properties.name ?? ''}`
      )
      .join('\n');

    return await llm.generate({
      prompt: `${query}\n\nContext:\n${context}`,
    });
  } finally {
    await session.close();
  }
}

Global Search: For complex, multi-hop questions

async function globalSearch(
  query: string,
  neo4j: Neo4jDriver,
  llm: LanguageModel
): Promise<string> {
  // Query relevant communities
  const communityQuery = `
    CALL gds.leiden.stream('graph')
    YIELD nodeId, communityId
    RETURN DISTINCT communityId
    LIMIT 10
  `;

  const session = neo4j.session();

  try {
    const communities = await session.run(communityQuery);

    // For each community, generate a summary
    const summaries = await Promise.all(
      communities.records.map(async (record) => {
        const communityId = record.get('communityId');

        const entitiesResult = await session.run(
          `
          CALL gds.leiden.stream('graph')
          YIELD nodeId, communityId
          WHERE communityId = $cid
          RETURN collect(gds.util.asNode(nodeId)) as nodes
        `,
          { cid: communityId }
        );

        const nodes = entitiesResult.records[0].get('nodes');
        const nodeNames = nodes
          .map((n: any) => n.properties.name)
          .slice(0, 10)
          .join(', ');

        return nodeNames;
      })
    );

    // Generate answer from community summaries
    const contextStr = summaries.join('\n\n');
    return await llm.generate({
      prompt: `${query}\n\nContext (community summaries):\n${contextStr}`,
    });
  } finally {
    await session.close();
  }
}

Local search is fast and precise. Global search is slower but handles complex queries.

Combining Vector and Graph Retrieval

Hybrid retrieval uses both vector similarity (semantic) and graph structure (relationship-based).

interface HybridRetrievalResult {
  vectorChunks: DocumentChunk[];
  graphEntities: Entity[];
  combinedContext: string;
}

async function hybridRetrieval(
  query: string,
  vectorDb: VectorStore,
  neo4j: Neo4jDriver,
  llm: LanguageModel,
  weights: { vectorWeight: number; graphWeight: number } = {
    vectorWeight: 0.6,
    graphWeight: 0.4,
  }
): Promise<HybridRetrievalResult> {
  // Vector retrieval
  const vectorChunks = await vectorDb.search(query, { topK: 10 });

  // Graph retrieval: extract entities from query
  const entities = await llm.generate({
    prompt: `Extract entities (people, organizations) from: "${query}"
    Return as JSON array: ["entity1", "entity2", ...]`,
  });

  const entityList = JSON.parse(entities);
  const session = neo4j.session();

  let graphEntities: Entity[] = [];

  try {
    for (const entityName of entityList) {
      const result = await session.run(
        `
        MATCH (e { name: $name })-[r]-(related)
        RETURN e, r, related
        LIMIT 10
      `,
        { name: entityName }
      );

      graphEntities.push(
        ...result.records.map(rec => ({
          name: rec.get('e').properties.name,
          type: rec.get('e').labels[0],
        }))
      );
    }
  } finally {
    await session.close();
  }

  // Combine results
  const combinedContext = `
Vector retrieval results:
${vectorChunks.map(c => c.content).join('\n\n')}

Graph entities and relationships:
${graphEntities.map(e => `${e.name} (${e.type})`).join(', ')}
  `;

  return {
    vectorChunks,
    graphEntities,
    combinedContext,
  };
}

When GraphRAG Outperforms Standard RAG

GraphRAG wins on:

  1. Multi-hop reasoning: "Find all people who worked at company X then founded startups"
  2. Relationship-heavy domains: Organizational hierarchies, social networks, academic genealogies
  3. Disambiguation: "Trump" (person) vs "Trump card" (concept)
  4. Temporal reasoning: "CEOs hired in 2020 who left by 2022"

Standard RAG is simpler and sufficient for:

  1. Fact-based retrieval: "What is photosynthesis?"
  2. Single-document QA
  3. When entities and relationships are sparse

Cost and Complexity Trade-Offs

FactorStandard RAGGraphRAG
SetupDaysWeeks (extraction, graph construction)
Embedding costLower (chunks only)Higher (entities + relationships)
Query latency~50ms100-500ms (graph traversal)
MaintenanceLowHigh (deduplication, schema evolution)
Accuracy on simple queries80%80% (no improvement)
Accuracy on complex queries40%75% (major improvement)

Checklist

  • Use LLM for flexible entity and relationship extraction
  • Validate extracted entities (endpoints exist in graph)
  • Design Neo4j schema with clear node labels and relationship types
  • Detect communities for higher-level reasoning
  • Implement local search (specific entities) and global search (complex questions)
  • Combine vector and graph retrieval for hybrid strength
  • Monitor extraction quality; errors compound in graph
  • Measure improvement on complex, multi-hop questions
  • Start simple with vector RAG; add graph only if needed

Conclusion

GraphRAG adds structure to RAG: entities, relationships, and communities enable reasoning across documents. Best for domains with rich relationships (organizations, people, concepts). The cost is complexity: entity extraction, deduplication, and graph maintenance. Start with standard vector RAG. If you find yourself struggling with multi-hop questions, GraphRAG is worth the investment.