Episode Details
Back to Episodes
🎄ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
Description
Hey everyone, happy ThursdAI!
As always, here's a list of things we covered this week, including show notes and links, to prepare you for the holidays.
TL;DR of all topics covered:
* Open Source AI
* OpenChat-3.5-1210 - a top performing open source 7B model from OpenChat team beating GPT3.5 and Grok (link, HF, Demo)
* LAION 5B dataset taken down due to CSAM allegations from Stanford (link, full report pdf)
* FLASK - New evaluation framework from KAIST - based on skillset (link)
* Shows a larger difference between open/closed source
* Open leaderboard reliability issues, vibes benchmarks and more
* HF releases a bunch of MLX ready models (LLama, Phi, Mistral, Mixtral) (link)
* New transformer alternative architectures - Hyena & Mamba are heating up (link)
* Big CO LLMs + APIs
* Apple - LLM in a flash paper is making rounds (AK, Takeaways thread)
* Anthropic adheres to the messages API format (X)
* Microsoft Copilot finally has plugins (X)
* Voice & Audio
* AI Music generation Suno is now part of Microsoft Copilot plugins and creates long beautiful songs (link)
* AI Art & Diffusion
* Midjourney v6 is out - better text, great at following instructions (link)
Open Source AI
We start today with a topic I didn't expect to be covering, the LAION 5B dataset, was taken down, after a report from Stanford Internet Observatory found instances of CSAM (Child Sexual Abuse material) in the vast dataset. The outlined report had identified hundreds to thousands of instances of images of this sort, and used something called PhotoDNA by Microsoft to identify the images by hashes, using a sample of NSFW marked images.
LAION 5B was used to train Stable Diffusion, and 1.4 and 1.5 were trained on a lot of images from that dataset, however SD2 for example was only trained on images not marked as NSFW. The report is very thorough, going through the methodology to find and check those types of images. Worth noting that LAION 5B itself is not an image dataset, as it only contains links to images and their descriptions from alt tags.
Obviously this is a very touchy topic, given the way this dataset was scraped from the web, and given how many image models were trained on it, the report doesn't allege anything close to influence on the models it was trained on, and outlines a few methods of preventing issues l