TransWikia.com

Search for massive offline file collections

Software Recommendations Asked by Grismar on February 24, 2021

I’m looking for a solution that would allow indexing and search of large collections of files.

A brief description of our situation:

  • many (100+) engineers work on large projects, creating file collections in the range of 5GB – 50TB of data per project, consisting of 1,000s to 10’s of millions of files;
  • we have a process where we archive projects that get delivered, as we have a legal obligation to keep files around, and sometimes a practical use for files from old projects;
  • to keep data manageable, projects that are no longer live get backed up to the cloud (AWS Glacier) in big tarred chunks.

I’m looking for a search solution that allows:

  • indexing of file collections before they are packaged and moved to the cloud;
  • addition of additional collections over time (and ideally removal of old collections when they expire);
  • search across all collections (path, name, date modified, no full text search required, although it could be a ‘nice to have’ for some formats like .docx, .pdf)
  • search ideally doesn’t need a local install, or if it does, the license cost isn’t prohibitive (i.e. licensing of the server / database / index, not the client if one is required); something like a web-based search interface would be ideal
  • search only has to be accessible on corporate LAN, so no strong requirements with regard to security or access restriction, although some group-based control would be a ‘nice to have’
  • support and maintenance by local, non-developer IT staff, i.e. an ‘off the shelf’ solution with sufficient support from a supplier

I’m aware of solutions like Everything, File Locator Pro (Agent Ransack), etc. These are all not bad, but they are often limited to individually (or per seat) licensed desktop clients and tend to use proprietary databases or small database types that don’t really scale to the level we need them to. I feel that everything I can find is more aimed at individuals, small teams or organisations htat

Also, most of these solutions tend to focus on indexing live file collections and don’t deal well with keeping files in their indices after they’ve disappeared from their original location.

What I imagine the ideal solution looks like:

  • simple and user-friendly web-based front end for search in a robust search engine and storage back end, with tooling for indexing of file collections to be used by admins (i.e. no specific requirements on user-friendliness there).

Does anyone have suggestions, other than something like "roll your own based on Elastic"?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP