The hard part is already solved, you don't even have to crawl the web to build t...

The hard part is already solved, you don't even have to crawl the web to build the index. There is already a periodically refreshed index of the web that you can download: commoncrawl.org

Now someone just needs to configure, Apache Lucene as a proper docker image that can consume this index.