From Crawl Policies to Collections - Major Change in Sosse 1.14

Page content

The upcoming Sosse 1.14 release marks a significant evolution in how web crawling is configured and managed. We’re moving from the complex “Crawl Policy” system to a more intuitive Collections approach that puts user experience first.

⚠️ Important: We strongly recommend backing up your data before upgrading to Sosse 1.14.

Why the Change?

The old Crawl Policy system, while powerful, had several limitations that made it challenging for users:

  • Complex matching logic: URLs were matched against policy patterns with unintuitive recursion rules
  • Single context per URL: Each URL could only belong to one policy, limiting flexibility
  • Unintuitive recursion logic: The depth and recursion rules were hard to understand and configure
  • Performance overhead: Complex pattern matching for every URL discovery

Introducing Collections

Collections represent a fundamental change in approach. Instead of asking “which policy matches this URL?”, we now ask “which collection should this content belong to?”.

Key Benefits

🎯 User-Friendly: Collections are conceptual groups that users understand intuitively. Instead of complex recursion logic, you define what content belongs together with clear regex categories.

🔄 Context Isolation: The same URL can now exist in multiple collections with different crawling behaviors. For example, a news article could be in both a “Daily News” collection (screenshots enabled, frequent recrawling) and an “Archive” collection (no screenshots, yearly recrawling).

⚡ Simplified Logic: No more complex recursion depth calculations. Each collection has clear unlimited and limited regex patterns that are easy to understand and maintain.

What’s New in Collections?

Enhanced Regex Control

Collections introduce three types of URL matching:

  • unlimited_regex: URLs that are always crawled, regardless of depth
  • limited_regex: URLs crawled only within the specified recursion depth
  • excluded_regex: URLs explicitly excluded from this collection

Backward Compatibility

Collections now include flexible cross-collection queueing options for users who need advanced URL distribution:

Queue to any collection: When enabled, if a URL doesn’t match the current collection’s patterns, Sosse will check all other collections and queue it in the first matching one. This provides maximum flexibility and ensures no relevant content is missed.

Queue to specific collections: For more granular control, you can select specific collections to check when a URL doesn’t match the current collection’s patterns. This is perfect when you have related collections that should share certain types of content.

These per-collection options give you fine-grained control over how URLs are distributed across your indexing system.

Migration Process

The upgrade to Sosse 1.14 includes automatic data migration:

  1. Collection Creation: Policies are converted to collections with equivalent behavior
  2. Document Assignment: Existing documents are assigned to appropriate collections
  3. Configuration Update: New regex fields are populated based on existing policy rules

Important Notes

  • Behavior Changes: After migration, crawling behavior may differ slightly due to the new architecture
  • Testing Recommended: Test your crawling workflows in a development environment first
  • Configuration Review: Review your new collections and adjust patterns if needed. Documents can be moved to different collections if necessary