Text Mining and Automated Taxonomy Management Of Unstructured Data

Table of contents
Contributors
Upasna Doshi

Introduction

For venture capital firms and investors, timely access to accurate company data and market trends is critical for portfolio analysis and risk assessment. However, extracting actionable insights from unstructured news articles and maintaining up-to-date company taxonomies are labor-intensive, error-prone tasks. Manual processes often lead to delays, inconsistent categorization, and missed opportunities.

Analysts track private company data, analyze and simplify it, delivering insights that help subscribers spot the best business opportunities quickly. The analysts at the venture intelligence platform faced these challenges:

  1. Analysts struggled to monitor a batch repository of 35 lakh (3.5 million) news articles to detect shifts in business offerings.
  2. Manual monitoring of taxonomy drift (e.g., shifts from "AI Services" to "AI Observability Platform") consumed 20+ hours/week.

The client partnered with Akaike to deploy an event-driven pipeline for automated taxonomy management.

The Problem

Inefficient Data Retrieval

  • Analysts manually searched media articles for insights like "List European fintech with recent leadership changes," taking hours to compile reports.
  • Legacy keyword searches missed contextual nuances (e.g., "Series B funding" vs "Series B products").

Static Taxonomies

  • Company categories (e.g., "HealthTech" vs "Telemedicine") became outdated as market narratives evolved.
  • Analysts manually tracked news articles to detect shifts, but consensus thresholds (e.g., 7/10 articles) were applied inconsistently.

The Solution

Akaike developed an event-driven pipeline for automated taxonomy management to streamline data retrieval and taxonomy updates.

Key Steps:

1. Information Retrieval:

  • Leveraged an internal database of 3.5 million curated articles.
  • Trained a binary classifier to filter irrelevant articles (e.g., missing company names or sectors), reducing the dataset to 30 lakh (3 million) relevant articles.
  • Created efficient index structures for rapid data access and query optimization for complex searches.

2. Data Preprocessing:

  • Cleaned and normalized text data to remove noise
  • Applied tokenization and lemmatization to standardize terms
  • Removed stop words and irrelevant content
  • Generated article summaries at a high level for improved processing

3. Text Representation:

  • Implemented the Bag-of-Words (BoW) approach to represent document contents
  • Created term frequency matrices to capture document-term relationships
  • Applied TF-IDF weighting to highlight important terms
  • Developed domain-specific vocabulary for venture capital terminology

4. Data Extraction:

  • Applied Association Rule Mining techniques to identify relationships
  • Used Apriori algorithm to discover frequent itemsets
  • Implemented FP-Growth for efficient pattern mining
  • Designed a custom architecture with three main modules:
    • Pre-processing module for text cleaning
    • Pattern discovery module for identifying relationships
    • Pattern analysis module for evaluating discovered patterns

5. Mining Steps for Frequent Words:

  • Scanned data to count item occurrences and generated candidate patterns.
  • Deployed Flan-T5 for pattern analysis and benchmarked performance against GPT-3.5 and GPT-4 for accuracy and scalability.
  • Applied pruning techniques and confidence thresholds to retain significant associations.

6. Data Analysis:

  • Presented results in a structured tabular format with columns including:
    • Company Name
    • Category Occurrence Count
    • Confidence Score
    • Support Metrics
    • Similarity Scoring
  • Generated CSV/JSON outputs for integration with existing tools

Impact & Results

Workflow Improvements:

  1. Data-Driven Portfolio Analysis: Investment analysts gained insights from an internal database through association rule mining, discovering patterns like "Companies mentioned with 'AI' also appear with 'high growth'" in seconds rather than hours.
  2. Automated Category Assignment: Reduced manual categorization by 90% by leveraging occurrence frequency and similarity scoring to assign companies to appropriate sectors.
  3. Pattern-Based Intelligence: Identified emerging market trends by analyzing co-occurrence patterns across multiple articles, enabling proactive investment decisions.
  4. Cost Efficiency: Reduced monthly data processing costs by 97% (0.10→0.00029/article) through optimized text mining and pattern discovery algorithms.

Why It Worked

  • Binary Filtering: The classifier reduced noise by excluding 5 lakh irrelevant articles.
  • LLM Benchmarking: Flan-T5 achieved 95% accuracy in pattern discovery, outperforming GPT-3.5 (88%) while being 10x cheaper.

Scalable Processing: The pipeline processed 35 lakh articles in 800 hours (~1 second/article).

The Akaike Edge

Akaike's expertise lies in end-to-end management of AI lifecycle processes, including problem identification and data collection to model deployment and ongoing monitoring, utilizing industry-leading frameworks, tools, and libraries.