You (and Ed, who I very much respect) are correct: DeepSeek software is open source*. But from the jump, their app and official server instance were plagued with security holes - most likely accidental ones, since they were harmful to DeepSeek itself! - and naturally their app sends data to China because China is where the app is from.
I do find it pretty funny that they were also sending data to servers in the US though. This isn’t a China issue, it’s a privacy/design issue, and even after they resolve the security holes they still receive your data. Same as OpenAI, same as every other AI company.
* DeepSeek releases genuinely open source code for everything except for its models, which exceeds industry standards. The models can be downloaded and used without restriction, and this is considered “Open” according to the OSI, but most other people would say it’s not. I don’t think it’s open either. But again, they have gone above and beyond industry standards, and that is why "Open"AI is angry at them.
It’s actually easier than that - the model gets loaded onto open-source software that is usually made by American companies. It doesn’t connect to stuff… because it can’t. It’s like an MP4 or jpeg file.
I know, I’m just saying that if the software (not the dataset) calls home, you can either modify the software, which is hard to do right, or you just easily block it with a firewall
From my own fractured understanding, this is indeed true, but the “DeepSeek” everybody is excited about, which performs as well as OpenAI’s best products but faster, is a prebuilt flagship model called R1. (Benchmarks here.)
The training data will never see the light of day. It would be an archive of every ebook under the sun, every scraped website, just copyright infringement as far as the eye can see. That would be the source they would have to release to be open source, and I doubt they would.
But DeepSeek does have the code for “distilling” other companies’ more complex models into something smaller and faster (and a bit worse) - but, of course, the input models are themselves not open source, because those models (like Facebook’s restrictive Llama model) were also trained on stolen data. (I’ve downloaded a couple of these distillations just to mess around with them. It feels like having a dumber, slower ChatGPT in a terminal.)
Theoretically, you could train a model using DeepSeek’s open source code and ethically sourced input data, but that would be quite the task. Most people just add an extra layer of training data and call it a day. Here’s one such example (I hate it.) I can’t even imagine how much data you would have to create yourself in order to train one of these things from scratch. George RR Martin himself probably couldn’t train an AI to speak in a comprehensible manner by feeding it his life’s work.
You (and Ed, who I very much respect) are correct: DeepSeek software is open source*. But from the jump, their app and official server instance were plagued with security holes - most likely accidental ones, since they were harmful to DeepSeek itself! - and naturally their app sends data to China because China is where the app is from.
I do find it pretty funny that they were also sending data to servers in the US though. This isn’t a China issue, it’s a privacy/design issue, and even after they resolve the security holes they still receive your data. Same as OpenAI, same as every other AI company.
* DeepSeek releases genuinely open source code for everything except for its models, which exceeds industry standards. The models can be downloaded and used without restriction, and this is considered “Open” according to the OSI, but most other people would say it’s not. I don’t think it’s open either. But again, they have gone above and beyond industry standards, and that is why "Open"AI is angry at them.
I’m sure open AI will just “borrow” their code and a new release will magically be just as efficient
Heh. OpenAI already accused them of training off their ChatGPT output. Hilarious if true (because fuck OpenAIq), but impressive if false.
If you don’t want your locally hosted deep seek to send days to China, just setup a firewall and you’re done
It’s actually easier than that - the model gets loaded onto open-source software that is usually made by American companies. It doesn’t connect to stuff… because it can’t. It’s like an MP4 or jpeg file.
I know, I’m just saying that if the software (not the dataset) calls home, you can either modify the software, which is hard to do right, or you just easily block it with a firewall
True. I don’t know much about their software, though. They’ve released so much stuff over a short amount of time, I’m having a hard time keeping track
Exactly.
But Ed said you could also use your own models, train it yourself.
From my own fractured understanding, this is indeed true, but the “DeepSeek” everybody is excited about, which performs as well as OpenAI’s best products but faster, is a prebuilt flagship model called R1. (Benchmarks here.)
The training data will never see the light of day. It would be an archive of every ebook under the sun, every scraped website, just copyright infringement as far as the eye can see. That would be the source they would have to release to be open source, and I doubt they would.
But DeepSeek does have the code for “distilling” other companies’ more complex models into something smaller and faster (and a bit worse) - but, of course, the input models are themselves not open source, because those models (like Facebook’s restrictive Llama model) were also trained on stolen data. (I’ve downloaded a couple of these distillations just to mess around with them. It feels like having a dumber, slower ChatGPT in a terminal.)
Theoretically, you could train a model using DeepSeek’s open source code and ethically sourced input data, but that would be quite the task. Most people just add an extra layer of training data and call it a day. Here’s one such example (I hate it.) I can’t even imagine how much data you would have to create yourself in order to train one of these things from scratch. George RR Martin himself probably couldn’t train an AI to speak in a comprehensible manner by feeding it his life’s work.