New from 404 Media: Bluesky may have said it won't use user data to train generative AI, but someone else just published a dataset of million Bluesky posts for "machine learning research". Already very popular dataset, your data may be scraped 404media.co/someone-made-a-dat…
Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'
A Hugging Face employee made a huge dataset of Bluesky posts, and it’s already very popular.Samantha Cole (404 Media)
teilten dies erneut
AlgoCompSynth by znmeb 🇺🇦
Als Antwort auf Joseph Cox • • •Stu
Als Antwort auf Joseph Cox • • •Tom Walker
Als Antwort auf Stu • • •Evan Prodromou
Als Antwort auf Tom Walker • • •amd
Als Antwort auf Evan Prodromou • • •Evan Prodromou
Als Antwort auf amd • • •Martijn Vos
Als Antwort auf Evan Prodromou • •Is follower-only a thing on ActivityPub? I thought everything was unencrypted and basically accessible to any server that receives it.
Strypey
Als Antwort auf Martijn Vos • • •@mcv
> Is follower-only a thing on ActivityPub?
Yes. They're called Followers posts in Mastodon now.
> I thought everything was unencrypted and basically accessible to any server that receives it.
Yes. There's been all sorts of research into things like Object Capabilities, that could force receiving servers to do what they're told. But for now, not displaying Specific People posts publicly, or showing Followers(-only) posts only to followers, is an unenforceable handshake agreement.
@evan
Hyolobrika (left) mag das.
Martijn Vos
Als Antwort auf Strypey • •@Strypey @Evan Prodromou
So it only works as intended if all your followers are on servers that obey the agreement. Accidentally following someone on a questionable server can create a leak.
Is it completely out of the question to add some encryption to the protocol for this sort of situation?
Strypey
Als Antwort auf Martijn Vos • • •(1/?)
@mcv
> it only works as intended if all your followers are on servers
...whose admins and software obey the agreement ("protocol"), yes.
For now if you want decentralised *and* E2EE, you need to switch to another network for that. Options include federated networks like XMMP+OMEMO or Matrix, or a P2P one like Jami or Tox.
If you're wanting something noob-friendly, I'd go for Element (Matrix) or Snikket (XMPP).
(Full disclosure: I've done paid contracting for Snikket)
@evanprodromou
Strypey
Als Antwort auf Strypey • • •(2/?)
@mcv
> Is it completely out of the question to add some encryption to the protocol for this sort of situation?
Opinions vary. IMHO Mastodon's decision to clone Titter DMs was a mistake. I'm inclined to think it's wiser to clearly separate social publishing from private communications, by having them in separate apps (with some kind of SSO so we can use the same account in both).
I've even suggested using AutoCrypt, so I can check my fediverse DMs in Delta Chat;
codeberg.org/fediverse/fediver…
Direct Messages as emails
Codeberg.orgStrypey
Als Antwort auf Strypey • • •(3/?)
But I suspect I'm in the minority. Most people seem to want the fediverse to be a thneed (see The Lorax), and work of various kinds is underway on bringing E2EE to the verse. Apparently there's a taskforce working on it;
socialhub.activitypub.rocks/t/…
... which might be this one?
github.com/swicg/activitypub-e…
Then there's @soatak@furry.engineer's efforts;
soatok.blog/2024/09/13/e2ee-fo…
(this covers some policy issues as well as technical ones and is well worth a read)
End-to-End Encryption (E2EE) Task Force Meeting - Jul 19, 2024
SocialHubSoatok
2024-09-13 10:25:12
Strypey
Als Antwort auf Strypey • • •(4/4)
@helge posted a Fediverse Idea on an E2EE AP messenger last year;
codeberg.org/fediverse/fediver…
... and @dansup's of PixelFed and loops.video announced Sup messenger about a year ago;
wedistribute.org/2023/08/sup-b…
The MIMI working group at the IETF included the possibility of using AP in their investigations;
bifurcation.github.io/mimi-aim…
So in summary, a lot is going on all over the place. Maybe we need to get all these folks in a room?
Sup is a New Messaging App by the Creator of Pixelfed
Sean Tilley (We Distribute)Martijn Vos
Als Antwort auf Strypey • •@Strypey
I still think the combination of usenet+email was the perfect integration of public and private communication. They weren't completely separate; usenet posts included the email address of the author.
I'd like to see something modular, like maybe XMPP+ActivityPub, or something like that.
But that still doesn't address semi-public communication, like to all your followers, or to a specific group/circle/aspect, that still guarantees (through encyption) that it's only to that group.
I imagine everybody would automatically publish their public key as part of their profile, and a limited message would be encrypted, with for each authorized recipient an attachment containing the key encrypted with their public key. Of course that could get pretty heavy with posts to lots of users, but servers could throw away attached keys that aren't for any of their own users.
Strypey
Als Antwort auf Martijn Vos • • •@mcv
> But that still doesn't address semi-public communication, like to all your followers, or to a specific group/circle/aspect
E2EE private groups are the core of Matrix, to the point that DMs are just groups with only 2 members. Delta Chat can encrypt group with AutoCrypt, and I believe XMPP can encrypt private groups too, with MUC+OMEMO.
But from what I've read, the new MLS standard is key to doing E2EE groups efficiently. Devs from all 3 protocol networks are working on implementations.
Schicke Schicke Schweine
Als Antwort auf Strypey • • •@strypey @mcv @evanprodromou
Sadly nobody can use it, and self host. The installer script is very out of date, and broken.
github.com/snikket-im/snikket-…
init.sh is out of date. docker-compose no longer exists. Users can't install snikket self hosted · Issue #14 · snikket-im/snikket-selfhosted
GitHubStrypey
Als Antwort auf Schicke Schicke Schweine • • •@SchickeSchickeSchweine
> Sadly nobody can use it, and self host. The installer script is very out of date, and broken
I know @snikket_im are keen to support a range of installation options. But I suspect progress is being slowed by funding challenges. The same thing I'm told is making it harder to deliver Matrix 2.0, MLS support, etc (funding challenges for Element).
@mcv @evanprodromou
Schicke Schicke Schweine
Als Antwort auf Strypey • • •Strypey
Als Antwort auf Schicke Schicke Schweine • • •@SchickeSchickeSchweine
> It turns out they updated the installer script almost immediately and I tried it and my server installed just like that
I love it when this happens! One advantage of being able to @mention projects when we complain about them ; )
@snikket_im @mcv @evanprodromou
Strypey
Als Antwort auf Strypey • • •@mcv
> There's been all sorts of research into things like Object Capabilities
FYI @cwebber wrote a bit about this, with some links for further reading, towards the end of her insightful analysis of BlueSky/ ATProto;
dustycloud.org/blog/how-decent…
@evan
How decentralized is Bluesky really? -- Dustycloud Brainstorms
dustycloud.orgmxk
Als Antwort auf Tom Walker • • •mxk
Unbekannter Ursprungsbeitrag • • •Set up a local instance with a few sock puppets running some type of spider algorithm following random people on huge instances
Tom Walker
Als Antwort auf mxk • • •@mxk @slims Yeah, a bit more effort than "curl the firehose" but not that much more.
But it's probably a worse dataset overall, in a way that's helpful to us, because the median fedi post is... kind of weird? Not what you would want to train an LLM on surely
Joseph Cox
Als Antwort auf Joseph Cox • • •Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'
Samantha Cole (404 Media)Zuthal
Als Antwort auf Joseph Cox • • •even if they don't actively train their own models, anything that is public-facing on the internet is bound to be gobbled up by the scrapers :/
you just can't reliably protect against those while allowing anonymous human visitors to see the content