Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One thing that is quite unfortunate with the state of SEO and the web in general today is that when I asked "what are the latest versions of common programming languages and when were they released?" a large amount of the sources were "13 Tools You Should Learn Now" and the like. This might be a solvable problem within the search API they provide to the LLM, but for now I wouldn't trust current LLMs to be able to filter out these articles as less trustworthy than the official website of the programming language in question.


Given how many of those SEO spam sites are themselves generated by ChatGPT now, OpenAI can simply back-reference their own logs to find out which sites are probably SEO spam while everyone else is left guessing. That's vertical integration!


If they do that , That's a genius idea.


So it'll turn to yet another arms race - similar to captcha, cybersecurity and nuclear weapons. SEO will use AI to fill in fluff inside AI-generated content (which is already done).

It won't directly match ChatGPT logs and OpenAI would just be pouring precious compute to a bottomless pit trying to partial-match.


Serve Claude-generated version to OpenAI bots. Serve OpenAI-generated version to Claude bots. Problem solved.

Serve users a random version and A/B test along the way.


Then you are still left with self hosted models which are pretty good at this task.


Or offer two search results when they suspect one is spam and see which one a user likes and train off of that, just the way they do now with ChatGPT.


I’m sure they will be more subtle than that otherwise it will get circumvented.

I’m sure they will/are tackling this at the model level. Train them to both generate good completions while also embedding text with good performance at separating generated and human text.


Personally, I'm a pessimist on this front. People assert that a model-in-training can effortlessly sift out the real data from mountains of LLM spam. But then people also assert that AI detectors do not work and can never work, since LLM output is simply too good, and any watermarking can be broken up by a light paraphrasing step. It doesn't make much sense to have it both ways.

I can only await companies' attempts to publish enough junk to create an 'alternative truth' for new LLMs to believe in. The worst part is, it might even work.


Would someone even want to circumvent it though? Most sites won't care very much about encouraging scrapers to include them in LLM training data, it's not like you get paid.


If your website is created to promote a product of course you are incentivized to be included.


> when I asked "what are the latest versions of common programming languages and when were they released?"

The issue is with the query itself. You're assuming that there's some oracle that will understand your question and surface the relevant information for you. Most likely, it will use the word themselves as part of the query, which SEO sites will exploit.

A more pragmatic search workflow would be to just search for "most common programming languages used" [0], then used the Wikipedia page to get the relevant information [1]. Much more legwork, but with sources. And still quite fast.

[0]: (Screenshot) https://ibb.co/ggBLy8G

[1]: (Screenshot) https://ibb.co/H4g5bDf


It would be very odd if I have to speak to an LLM the same way I “speak” to Google. The point of this whole feature seems to be that the LLM chooses when and how to search to best respond to the user.


>what are the latest versions of common programming languages and when were they released?

is this a real question you needed an answer to, or a hypothetical you posed to test the quality of search results?

of course you're going to get listicles for a query like that, because it sounds like a query specifically chosen to find low-quality listicles.


It is a proxy to questions that pop up as part of other sessions in my daily use. The LLM chooses its query itself and as long as it is not fine tuned to avoid the “listicles”, or its underlying search engine are not putting importance on more factual responses, I don’t think the answer quality will be as high. It would be weird and redundant if I had to talk to the LLM as if it’s an old school search engine, wouldn’t it?


I honestly doubt there exist an actual reputable resource having it on a same page. Each language tracks their own latest version(s). Wikipedia tracks latest versions for a variety of software but it's on different pages.


Technically that list exists inside of the two big version managers!


You are the best kind of correct.


this is why I pay for Kagi. granted those results still come up, but you can block specific domains from ever appearing in the results and configure how listicles are displayed.


How many can you block and filter manually ? 10 ? 100 ? 10k ? Who will test sites for the blocklist ? The domain block feature is great but unless it's collaborative listing it's not gonna be super effective.


It’s super effective for me because I just block stuff as things pop up that I don’t want. I’ve also added more weight to certain domains that I want more results from. I wouldn’t want anyone touching my config, it’s mine and it works great!


.... Test sites for the blocklist? What?

Also they do share the most blocked/raised/lowered etc sites: https://kagi.com/stats?stat=leaderboard

We've had this problem of "good defaults" before with ad trackers blocking domains. I'm sure it'll be Sooner than later when some community lists become popular and begin being followed en mass


I meant your average user can test a handful of sites if they are seo spam or good sites but a single search return 10+ results and even more when a user searches multiple things , multiple times a day . Average user doesn't have the time to test these many websites.


1,000


Kagi is admittedly pretty great for this.


As an alternative, ublacklist is free and open-source.


Average serp page has 10 results . What if all 10 matches with your blacklist ? Not to mention you can't do anything if the engine dosen't search deeper .


You probably have to browse to the next page or refine the search terms.


This was a nice counterexample to other queries that seemed quite decent.

I got:

  Programming Language Latest Version Release Date
  Python 3.12.0 October 2, 2024 ???!!!!!
  Java SE 22 March 2024
  JavaScript ECMAScript 2024 June 2024
  C++ C++23 December 2023
  C# 12.0 September 2024
  Ruby 3.3.0 December 25, 2023
  PHP 8.3.0 November 2023
  Swift 6.0 September 2024
  Go 1.22 August 2024  !!!
  Rust 1.75 October 2024 !!!!
  Kotlin 2.0.0 May 21, 2024 !!!
This is pretty bad. (???!!! added by me)

However, my follow up query "Provide primary web page for each language listed above" was quite decent:

Here are the primary websites for the programming languages mentioned:

  Python: https://www.python.org/
  Java: https://www.oracle.com/java/
  JavaScript: https://developer.mozilla.org/en- 
 US/docs/Web/JavaScript
  C++: https://isocpp.org/
  C#: https://learn.microsoft.com/en-us/dotnet/csharp/
  Ruby: https://www.ruby-lang.org/en/
  PHP: https://www.php.net/
  Swift: https://www.swift.org/
  Go: https://go.dev/
  Rust: https://www.rust-lang.org/
  Kotlin: https://kotlinlang.org/
The problem was with the 3rd query "Provide latest version for each language as mentioned on the primary website for that language."

That brought back the first result basically unchanged.

So certainly this is a work in progress but very promising.


SEO spam is always going to focus on the biggest market, and by doing so they can be completely transparent and obvious to whoever they're not trying to fool.

I'd assume right now the SEO target is still mainly Google rather than ChatGPT, but that's only an "I recon" not a citation.

If and when ChatGPT does become the main target for SEO spam, then Googling may start giving good results again.


Wouldn't it be "I reckon"? :-)


D'oh, yes. :)


This is the next step for SEO: be able to game ChatGPT prompts trying to filter out SEO crap...


How do you think people will try to game AI-based search?


By buying openAI


For Java I got:

As of October 31, 2024, the latest version of Java is Java 23, released on September 17, 2024. The most recent Long-Term Support (LTS) version is Java 21, released on September 19, 2023.

Which all seems correct and accurate.


Googling for this query is fast, and instantly surfaces the download link. That seems pretty useful...


Yeah I did also find it to be mostly accurate. However, seeing the sources I felt like I kind of have to check all the languages just in case it picked up information from a random "X ways to do Y" article that might not have been prioritizing accuracy. And for this search query I did see several languages' actual websites, but I did another very similar query earlier where 9 out of 12 results were all numbered list articles clearly intended for SEO. 2 of them were actual official sites. And 1 was what appears to be a decent attempt at talking about programming languages (i.e. not SEO only).


This is because it is referencing and regurgitating Wikipedia articles.


Nope. It had found an Oracle page for the LTS date and an OpenJDK page for the latests version.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: