One thing that is quite unfortunate with the state of SEO and the web in general...

jsheard · on Oct 31, 2024

Given how many of those SEO spam sites are themselves generated by ChatGPT now, OpenAI can simply back-reference their own logs to find out which sites are probably SEO spam while everyone else is left guessing. That's vertical integration!

arromatic · on Oct 31, 2024

If they do that , That's a genius idea.

code51 · on Oct 31, 2024

So it'll turn to yet another arms race - similar to captcha, cybersecurity and nuclear weapons. SEO will use AI to fill in fluff inside AI-generated content (which is already done).

It won't directly match ChatGPT logs and OpenAI would just be pouring precious compute to a bottomless pit trying to partial-match.

jerjerjer · on Oct 31, 2024

Serve Claude-generated version to OpenAI bots. Serve OpenAI-generated version to Claude bots. Problem solved.

Serve users a random version and A/B test along the way.

methyl · on Nov 1, 2024

Then you are still left with self hosted models which are pretty good at this task.

itissid · on Oct 31, 2024

Or offer two search results when they suspect one is spam and see which one a user likes and train off of that, just the way they do now with ChatGPT.

DSingularity · on Oct 31, 2024

I’m sure they will be more subtle than that otherwise it will get circumvented.

I’m sure they will/are tackling this at the model level. Train them to both generate good completions while also embedding text with good performance at separating generated and human text.

LegionMammal978 · on Nov 1, 2024

Personally, I'm a pessimist on this front. People assert that a model-in-training can effortlessly sift out the real data from mountains of LLM spam. But then people also assert that AI detectors do not work and can never work, since LLM output is simply too good, and any watermarking can be broken up by a light paraphrasing step. It doesn't make much sense to have it both ways.

I can only await companies' attempts to publish enough junk to create an 'alternative truth' for new LLMs to believe in. The worst part is, it might even work.

sebzim4500 · on Oct 31, 2024

Would someone even want to circumvent it though? Most sites won't care very much about encouraging scrapers to include them in LLM training data, it's not like you get paid.

DSingularity · on Nov 1, 2024

If your website is created to promote a product of course you are incentivized to be included.

skydhash · on Oct 31, 2024

> when I asked "what are the latest versions of common programming languages and when were they released?"

The issue is with the query itself. You're assuming that there's some oracle that will understand your question and surface the relevant information for you. Most likely, it will use the word themselves as part of the query, which SEO sites will exploit.

A more pragmatic search workflow would be to just search for "most common programming languages used" [0], then used the Wikipedia page to get the relevant information [1]. Much more legwork, but with sources. And still quite fast.

[0]: (Screenshot) https://ibb.co/ggBLy8G

[1]: (Screenshot) https://ibb.co/H4g5bDf

blixt · on Nov 1, 2024

It would be very odd if I have to speak to an LLM the same way I “speak” to Google. The point of this whole feature seems to be that the LLM chooses when and how to search to best respond to the user.

notatoad · on Oct 31, 2024

>what are the latest versions of common programming languages and when were they released?

is this a real question you needed an answer to, or a hypothetical you posed to test the quality of search results?

of course you're going to get listicles for a query like that, because it sounds like a query specifically chosen to find low-quality listicles.

blixt · on Nov 1, 2024

It is a proxy to questions that pop up as part of other sessions in my daily use. The LLM chooses its query itself and as long as it is not fine tuned to avoid the “listicles”, or its underlying search engine are not putting importance on more factual responses, I don’t think the answer quality will be as high. It would be weird and redundant if I had to talk to the LLM as if it’s an old school search engine, wouldn’t it?

jerjerjer · on Oct 31, 2024

I honestly doubt there exist an actual reputable resource having it on a same page. Each language tracks their own latest version(s). Wikipedia tracks latest versions for a variety of software but it's on different pages.

krainboltgreene · on Nov 1, 2024

Technically that list exists inside of the two big version managers!

jerjerjer · on Nov 4, 2024

You are the best kind of correct.

inhumantsar · on Oct 31, 2024

this is why I pay for Kagi. granted those results still come up, but you can block specific domains from ever appearing in the results and configure how listicles are displayed.

arromatic · on Oct 31, 2024

How many can you block and filter manually ? 10 ? 100 ? 10k ? Who will test sites for the blocklist ? The domain block feature is great but unless it's collaborative listing it's not gonna be super effective.

hmottestad · on Oct 31, 2024

It’s super effective for me because I just block stuff as things pop up that I don’t want. I’ve also added more weight to certain domains that I want more results from. I wouldn’t want anyone touching my config, it’s mine and it works great!

hughesjj · on Oct 31, 2024

.... Test sites for the blocklist? What?

Also they do share the most blocked/raised/lowered etc sites: https://kagi.com/stats?stat=leaderboard

We've had this problem of "good defaults" before with ad trackers blocking domains. I'm sure it'll be Sooner than later when some community lists become popular and begin being followed en mass

arromatic · on Oct 31, 2024

I meant your average user can test a handful of sites if they are seo spam or good sites but a single search return 10+ results and even more when a user searches multiple things , multiple times a day . Average user doesn't have the time to test these many websites.

speckx · on Oct 31, 2024

1,000

blharr · on Oct 31, 2024

Kagi is admittedly pretty great for this.

speedgoose · on Oct 31, 2024

As an alternative, ublacklist is free and open-source.

arromatic · on Oct 31, 2024

Average serp page has 10 results . What if all 10 matches with your blacklist ? Not to mention you can't do anything if the engine dosen't search deeper .

speedgoose · on Oct 31, 2024

You probably have to browse to the next page or refine the search terms.

sireat · on Oct 31, 2024

This was a nice counterexample to other queries that seemed quite decent.

I got:

  Programming Language Latest Version Release Date
  Python 3.12.0 October 2, 2024 ???!!!!!
  Java SE 22 March 2024
  JavaScript ECMAScript 2024 June 2024
  C++ C++23 December 2023
  C# 12.0 September 2024
  Ruby 3.3.0 December 25, 2023
  PHP 8.3.0 November 2023
  Swift 6.0 September 2024
  Go 1.22 August 2024  !!!
  Rust 1.75 October 2024 !!!!
  Kotlin 2.0.0 May 21, 2024 !!!

This is pretty bad. (???!!! added by me)

However, my follow up query "Provide primary web page for each language listed above" was quite decent:

Here are the primary websites for the programming languages mentioned:

  Python: https://www.python.org/
  Java: https://www.oracle.com/java/
  JavaScript: https://developer.mozilla.org/en- 
 US/docs/Web/JavaScript
  C++: https://isocpp.org/
  C#: https://learn.microsoft.com/en-us/dotnet/csharp/
  Ruby: https://www.ruby-lang.org/en/
  PHP: https://www.php.net/
  Swift: https://www.swift.org/
  Go: https://go.dev/
  Rust: https://www.rust-lang.org/
  Kotlin: https://kotlinlang.org/

The problem was with the 3rd query "Provide latest version for each language as mentioned on the primary website for that language."

That brought back the first result basically unchanged.

So certainly this is a work in progress but very promising.

ben_w · on Oct 31, 2024

SEO spam is always going to focus on the biggest market, and by doing so they can be completely transparent and obvious to whoever they're not trying to fool.

I'd assume right now the SEO target is still mainly Google rather than ChatGPT, but that's only an "I recon" not a citation.

If and when ChatGPT does become the main target for SEO spam, then Googling may start giving good results again.

adamc · on Oct 31, 2024

Wouldn't it be "I reckon"? :-)

ben_w · on Oct 31, 2024

D'oh, yes. :)

benob · on Oct 31, 2024

This is the next step for SEO: be able to game ChatGPT prompts trying to filter out SEO crap...

joshdavham · on Oct 31, 2024

How do you think people will try to game AI-based search?

randomNumber7 · on Nov 1, 2024

By buying openAI

hmottestad · on Oct 31, 2024

For Java I got:

As of October 31, 2024, the latest version of Java is Java 23, released on September 17, 2024. The most recent Long-Term Support (LTS) version is Java 21, released on September 19, 2023.

Which all seems correct and accurate.

wstrange · on Oct 31, 2024

Googling for this query is fast, and instantly surfaces the download link. That seems pretty useful...

blixt · on Oct 31, 2024

Yeah I did also find it to be mostly accurate. However, seeing the sources I felt like I kind of have to check all the languages just in case it picked up information from a random "X ways to do Y" article that might not have been prioritizing accuracy. And for this search query I did see several languages' actual websites, but I did another very similar query earlier where 9 out of 12 results were all numbered list articles clearly intended for SEO. 2 of them were actual official sites. And 1 was what appears to be a decent attempt at talking about programming languages (i.e. not SEO only).

andrewinardeer · on Oct 31, 2024

This is because it is referencing and regurgitating Wikipedia articles.

hmottestad · on Oct 31, 2024

Nope. It had found an Oracle page for the LTS date and an OpenJDK page for the latests version.