I have a page on GitLab pages, it’s built using the GitLab CI hosted on gitlab.com. I would like to run TypeSense docsearch scraper in the CI on the freshly built page, but I’m not sure if it can be done reliably.
An ideal solution would be to run the scraper on the fully deployed page, just tell it to scrape https://<user>.gitlab.io/<project>. This would be perfect, because the search index collects the absolute addresses, and it would work on the real ones. The problem is that the page would need to be fully published and available on the internet, which doesn’t seem to have a deterministic timing.
An alternative would be to host the newly built page in a CI service and link the containers so scraper accesses the service when sending requests to https://<user>.gitlab.io/<project>. This is tricky when HTTPS comes in, the scraper would need to be somehow told to ignore that the certificate is at best self-signed.