Published on

Building a Research Agent — Web Search, Summarization, and Report Generation

Authors

Introduction

Research agents gather information from multiple sources, synthesize findings, and generate comprehensive reports. Unlike simple web search, agents follow up on interesting findings, verify facts across sources, and detect contradictions. This post covers building research agents that produce trustworthy, well-cited reports.

Search Tool Integration

Integrate multiple search providers for redundancy and better coverage.

interface SearchResult {
  url: string;
  title: string;
  snippet: string;
  source: string;
  retrievedAt: number;
}

interface SearchProvider {
  name: string;
  search: (query: string, limit?: number) => Promise<SearchResult[]>;
}

class SearchOrchestrator {
  private providers: Map<string, SearchProvider> = new Map();

  registerProvider(provider: SearchProvider): void {
    this.providers.set(provider.name, provider);
  }

  async search(query: string, topK: number = 5): Promise<SearchResult[]> {
    // Search with multiple providers in parallel
    const allResults = await Promise.all(
      Array.from(this.providers.values()).map((provider) =>
        provider
          .search(query, topK)
          .catch((error) => {
            console.warn(`Provider ${provider.name} failed: ${error}`);
            return [];
          }),
      ),
    );

    // Merge and deduplicate
    const merged = allResults.flat();
    return this.deduplicateResults(merged).slice(0, topK);
  }

  private deduplicateResults(results: SearchResult[]): SearchResult[] {
    const seen = new Set<string>();
    const unique: SearchResult[] = [];

    for (const result of results) {
      const key = result.url.toLowerCase();

      if (!seen.has(key)) {
        seen.add(key);
        unique.push(result);
      }
    }

    return unique.sort((a, b) => b.retrievedAt - a.retrievedAt);
  }
}

// Example providers
const tavilyProvider: SearchProvider = {
  name: 'tavily',
  async search(query: string, limit: number = 10): Promise<SearchResult[]> {
    const response = await fetch('https://api.tavily.com/search', {
      method: 'POST',
      body: JSON.stringify({
        api_key: process.env.TAVILY_API_KEY,
        query,
        max_results: limit,
      }),
    });

    const data = (await response.json()) as any;

    return data.results.map((r: any) => ({
      url: r.url,
      title: r.title,
      snippet: r.snippet,
      source: 'tavily',
      retrievedAt: Date.now(),
    }));
  },
};

const googleProvider: SearchProvider = {
  name: 'google',
  async search(query: string, limit: number = 10): Promise<SearchResult[]> {
    const response = await fetch(
      `https://www.googleapis.com/customsearch/v1?q=${encodeURIComponent(query)}&key=${process.env.GOOGLE_API_KEY}&cx=${process.env.GOOGLE_CX}`,
    );

    const data = (await response.json()) as any;

    return (data.items || []).slice(0, limit).map((item: any) => ({
      url: item.link,
      title: item.title,
      snippet: item.snippet,
      source: 'google',
      retrievedAt: Date.now(),
    }));
  },
};

Multiple providers reduce bias and improve coverage.

Source Credibility Scoring

Not all sources are equally trustworthy.

interface SourceScore {
  url: string;
  credibility: number; // 0-1
  reasons: string[];
}

class SourceScorer {
  async scoreCredibility(url: string, snippet: string): Promise<SourceScore> {
    let score = 0.5; // Start neutral
    const reasons: string[] = [];

    // Check 1: Domain reputation
    const domain = new URL(url).hostname;

    if (this.isTrustedDomain(domain)) {
      score += 0.2;
      reasons.push('From trusted domain');
    } else if (this.isSuspiciousDomain(domain)) {
      score -= 0.3;
      reasons.push('From suspicious domain');
    }

    // Check 2: Academic/institutional sources
    if (domain.includes('.edu') || domain.includes('.gov') || domain.includes('.org')) {
      score += 0.15;
      reasons.push('Educational or institutional source');
    }

    // Check 3: Content analysis
    if (this.hasReferences(snippet)) {
      score += 0.1;
      reasons.push('Contains citations/references');
    }

    if (this.hasAuthorship(snippet)) {
      score += 0.05;
      reasons.push('Has author attribution');
    }

    // Check 4: Recency
    if (this.isRecent(url)) {
      score += 0.1;
      reasons.push('Recently updated');
    }

    // Check 5: Bias detection
    if (this.hasObviousBias(snippet)) {
      score -= 0.15;
      reasons.push('Possible bias detected');
    }

    return {
      url,
      credibility: Math.max(0, Math.min(1, score)),
      reasons,
    };
  }

  private isTrustedDomain(domain: string): boolean {
    const trusted = [
      'wikipedia.org',
      'bbc.com',
      'reuters.com',
      'apnews.com',
      'scholar.google.com',
      'github.com',
    ];

    return trusted.some((t) => domain.includes(t));
  }

  private isSuspiciousDomain(domain: string): boolean {
    return domain.includes('scam') || domain.includes('fake') || domain.length > 50;
  }

  private hasReferences(snippet: string): boolean {
    return /\[.*?\]|cite|reference|source/i.test(snippet);
  }

  private hasAuthorship(snippet: string): boolean {
    return /by\s+\w+|author|written by/i.test(snippet);
  }

  private isRecent(url: string): boolean {
    // Check if URL or domain suggests recent content
    const yearMatch = url.match(/202[0-9]/);
    return !!yearMatch;
  }

  private hasObviousBias(snippet: string): boolean {
    const biasIndicators = ['opinion', 'believes', 'obviously', 'clearly', 'naturally'];

    return biasIndicators.some((indicator) =>
      new RegExp(`\\b${indicator}\\b`, 'i').test(snippet),
    );
  }
}

Score sources to prioritize reliable information.

Deduplication and Clustering

Similar results from different sources should be clustered.

interface ResultCluster {
  topic: string;
  results: SearchResult[];
  consensus: string;
  contradictions: string[];
}

class ResultDeduplicate {
  async clusterResults(results: SearchResult[]): Promise<ResultCluster[]> {
    const clusters: Map<string, SearchResult[]> = new Map();

    for (const result of results) {
      // Extract topic from result
      const topic = await this.extractTopic(result.title, result.snippet);

      if (!clusters.has(topic)) {
        clusters.set(topic, []);
      }

      clusters.get(topic)!.push(result);
    }

    // Convert clusters to structured format
    const clustered: ResultCluster[] = [];

    for (const [topic, clusterResults] of clusters.entries()) {
      const consensus = await this.findConsensus(clusterResults);
      const contradictions = await this.findContradictions(clusterResults);

      clustered.push({
        topic,
        results: clusterResults,
        consensus,
        contradictions,
      });
    }

    return clustered;
  }

  private async extractTopic(title: string, snippet: string): Promise<string> {
    // Use LLM to extract main topic
    const prompt = `Extract the main topic (2-3 words) from this content:
Title: ${title}
Snippet: ${snippet}

Respond with just the topic, nothing else.`;

    return this.llmCall(prompt);
  }

  private async findConsensus(results: SearchResult[]): Promise<string> {
    // Find common theme across results
    const snippets = results.map((r) => r.snippet).join('\n\n');

    const prompt = `What's the consensus across these sources?

${snippets}

Summarize the common points in 1-2 sentences.`;

    return this.llmCall(prompt);
  }

  private async findContradictions(results: SearchResult[]): Promise<string[]> {
    if (results.length < 2) {
      return [];
    }

    const snippets = results.map((r) => `Source: ${r.url}\n${r.snippet}`).join('\n\n---\n\n');

    const prompt = `Do these sources contradict each other? List any contradictions found.

${snippets}

Return JSON: { "contradictions": ["contradiction 1", "contradiction 2"] }`;

    const response = await this.llmCall(prompt);

    try {
      const parsed = JSON.parse(response);
      return parsed.contradictions;
    } catch {
      return [];
    }
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }
}

Clustering reveals where sources agree and where they contradict.

Progressive Deepening

Follow up on interesting findings with deeper searches.

interface ResearchQuery {
  query: string;
  depth: number; // 1-5, how deep to go
  maxQueries: number; // Stop after this many queries
}

class ProgressiveResearcher {
  private queryCount: number = 0;
  private findings: Map<string, SearchResult[]> = new Map();

  async research(initialQuery: string, maxDepth: number = 3): Promise<Map<string, SearchResult[]>> {
    const queue: string[] = [initialQuery];
    this.queryCount = 0;

    while (queue.length > 0 && this.queryCount < 20) {
      const query = queue.shift()!;
      this.queryCount++;

      console.log(`Query ${this.queryCount}: ${query}`);

      // Search
      const results = await this.search(query);
      this.findings.set(query, results);

      // Extract follow-up questions
      const followUps = await this.generateFollowUpQueries(query, results);

      // Add to queue if we haven't hit max depth
      if (this.queryCount < 10) {
        queue.push(...followUps.slice(0, 2)); // Only top 2 follow-ups per query
      }
    }

    return this.findings;
  }

  private async search(query: string): Promise<SearchResult[]> {
    // Perform search
    return [];
  }

  private async generateFollowUpQueries(
    currentQuery: string,
    results: SearchResult[],
  ): Promise<string[]> {
    const prompt = `Based on these search results, what follow-up questions would deepen understanding?

Query: ${currentQuery}

Results:
${results.map((r) => `- ${r.title}: ${r.snippet}`).join('\n')}

Return JSON: { "followUpQueries": ["question 1", "question 2", "question 3"] }`;

    const response = await this.llmCall(prompt);

    try {
      const parsed = JSON.parse(response);
      return parsed.followUpQueries;
    } catch {
      return [];
    }
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }
}

Progressive deepening explores a topic systematically instead of stopping at surface results.

Citation Tracking

Keep track of sources for proper attribution.

interface Citation {
  url: string;
  title: string;
  source: string;
  accessedAt: number;
  quote?: string;
  pageNumber?: number;
}

class CitationManager {
  private citations: Citation[] = [];

  addCitation(result: SearchResult, quote?: string): void {
    this.citations.push({
      url: result.url,
      title: result.title,
      source: result.source,
      accessedAt: Date.now(),
      quote,
    });
  }

  generateBibliography(): string {
    // Sort by URL
    const sorted = this.citations.sort((a, b) => a.url.localeCompare(b.url));

    return sorted
      .map((c, i) => {
        const title = c.title || new URL(c.url).hostname;
        return `[${i + 1}] ${title}. Retrieved from ${c.url} on ${new Date(c.accessedAt).toLocaleDateString()}`;
      })
      .join('\n');
  }

  generateInlineReferences(text: string): string {
    // Replace fact mentions with citation markers
    let result = text;

    for (const citation of this.citations) {
      if (citation.quote) {
        const pattern = new RegExp(citation.quote, 'g');
        const index = this.citations.indexOf(citation) + 1;
        result = result.replace(pattern, `${citation.quote} [${index}]`);
      }
    }

    return result;
  }
}

Proper citations enable fact-checking and maintain transparency.

Fact Extraction

Extract factual claims from search results.

interface ExtractedFact {
  claim: string;
  source: string;
  credibility: number;
  evidence: string;
}

class FactExtractor {
  async extractFacts(results: SearchResult[], topic: string): Promise<ExtractedFact[]> {
    const facts: ExtractedFact[] = [];

    for (const result of results) {
      const prompt = `Extract factual claims from this source about "${topic}".

Source: ${result.title}
Content: ${result.snippet}

Return JSON: {
  "facts": [
    { "claim": "...", "evidence": "..." }
  ]
}`;

      const response = await this.llmCall(prompt);

      try {
        const parsed = JSON.parse(response);

        for (const fact of parsed.facts) {
          facts.push({
            claim: fact.claim,
            source: result.url,
            credibility: 0.7, // Would score based on source
            evidence: fact.evidence,
          });
        }
      } catch {
        // Skip malformed responses
      }
    }

    return facts;
  }

  async deduplicateFacts(facts: ExtractedFact[]): Promise<ExtractedFact[]> {
    // Group similar facts
    const groups: Map<string, ExtractedFact[]> = new Map();

    for (const fact of facts) {
      const normalized = this.normalizeClaim(fact.claim);

      if (!groups.has(normalized)) {
        groups.set(normalized, []);
      }

      groups.get(normalized)!.push(fact);
    }

    // Return deduplicated facts with best credibility
    const deduplicated: ExtractedFact[] = [];

    for (const group of groups.values()) {
      const best = group.reduce((a, b) => (a.credibility > b.credibility ? a : b));
      best.credibility = Math.min(1, best.credibility + (group.length - 1) * 0.05); // Boost for consensus

      deduplicated.push(best);
    }

    return deduplicated;
  }

  private normalizeClaim(claim: string): string {
    // Remove common variations to match similar claims
    return claim.toLowerCase().replace(/[^\w\s]/g, '').trim();
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }
}

Fact extraction and deduplication surfaces the most important findings.

Report Structuring

Organize findings into a coherent report.

interface ResearchReport {
  title: string;
  executive_summary: string;
  sections: Array<{
    heading: string;
    content: string;
    citations: Citation[];
  }>;
  contradictions: Array<{
    claim: string;
    sources: string[];
  }>;
  bibliography: string;
  methodology: string;
}

class ReportGenerator {
  async generate(
    topic: string,
    findings: Map<string, SearchResult[]>,
    facts: ExtractedFact[],
  ): Promise<ResearchReport> {
    const sections = await this.structureSections(topic, facts);
    const contradictions = await this.identifyContradictions(facts);
    const summary = await this.generateSummary(topic, facts);

    return {
      title: `Research Report: ${topic}`,
      executive_summary: summary,
      sections,
      contradictions,
      bibliography: 'Generated bibliography',
      methodology:
        'Data gathered from multiple web sources. Facts extracted and deduplicated. Sources scored for credibility.',
    };
  }

  private async structureSections(
    topic: string,
    facts: ExtractedFact[],
  ): Promise<ResearchReport['sections']> {
    // Group facts into logical sections
    const sections: ResearchReport['sections'] = [];

    const grouped = this.groupFactsByTheme(facts);

    for (const [theme, themeFacts] of grouped) {
      const content = themeFacts.map((f) => `- ${f.claim}`).join('\n');

      sections.push({
        heading: theme,
        content,
        citations: [], // Would map facts to citations
      });
    }

    return sections;
  }

  private groupFactsByTheme(
    facts: ExtractedFact[],
  ): Map<string, ExtractedFact[]> {
    // Use LLM to categorize facts into themes
    const groups: Map<string, ExtractedFact[]> = new Map();

    for (const fact of facts) {
      // Simple: group by first word
      const theme = fact.claim.split(' ')[0];

      if (!groups.has(theme)) {
        groups.set(theme, []);
      }

      groups.get(theme)!.push(fact);
    }

    return groups;
  }

  private async generateSummary(topic: string, facts: ExtractedFact[]): Promise<string> {
    const factSummary = facts.map((f) => f.claim).join(', ');

    const prompt = `Write a 2-3 sentence executive summary for this research topic.

Topic: ${topic}
Key findings: ${factSummary}`;

    return this.llmCall(prompt);
  }

  private async identifyContradictions(
    facts: ExtractedFact[],
  ): Promise<ResearchReport['contradictions']> {
    // Find facts that contradict each other
    return [];
  }

  private async llmCall(prompt: string): Promise<string> {
    return '';
  }
}

Well-structured reports are easy to read and fact-check.

Human Review Checkpoint

Before finalizing, have humans review key findings.

interface ReviewRequest {
  reportId: string;
  topic: string;
  contradictions: string[];
  unusualFindings: string[];
  facts: ExtractedFact[];
}

class HumanReviewCheckpoint {
  async requestReview(report: ResearchReport): Promise<void> {
    const request: ReviewRequest = {
      reportId: `report-${Date.now()}`,
      topic: report.title,
      contradictions: report.contradictions.map((c) => c.claim),
      unusualFindings: await this.identifyUnusualFindings(report),
      facts: [],
    };

    console.log(`Report ready for review: ${request.reportId}`);
    console.log(`Contradictions to verify: ${request.contradictions.join(', ')}`);
    console.log(`Unusual findings: ${request.unusualFindings.join(', ')}`);

    // In production: send to review queue
  }

  private async identifyUnusualFindings(report: ResearchReport): Promise<string[]> {
    // Findings that differ significantly from common knowledge
    return [];
  }
}

Human review catches errors and validates unusual claims.

Checklist

  • Search: integrate multiple providers for coverage
  • Scoring: rank sources by credibility
  • Deduplication: cluster similar results
  • Deepening: follow up on interesting findings
  • Citations: track all sources for attribution
  • Facts: extract and deduplicate key claims
  • Structure: organize findings into sections
  • Review: human verification of unusual claims

Conclusion

Research agents that systematically search, verify, and synthesize information produce better reports than single searches. Use multiple sources, score credibility, follow up on findings, and maintain proper citations. Humans should verify unusual claims and contradictions. The result is research you can trust.