The hard part is already solved, you don't even have to crawl the web to build the index. There is already a periodically refreshed index of the web that you can download:
commoncrawl.org
Now someone just needs to configure, Apache Lucene as a proper docker image that can consume this index.
Now someone just needs to configure, Apache Lucene as a proper docker image that can consume this index.