Korean researchers stole the headlines this week when they released a paper outlining how they trained a Large Language Learning Module (LLM) solely on the dark web. The model is based on the Google BERT algorithm and is called DarkBERT.
The dark web is renowned for being the sleazy underbelly of the internet. The creators say they want to use the model to help fight cybercrime, but should we be worried about the new AI created by crawling the dark web?
Large Language Learning Models
According to Tech Target, Large Language Learning Models are AI algorithms which learn to understand, summarize and generate content by feeding off data from the internet.
OpenAI’s ChatGPT is the most successful LLM to date and has revolutionized the tech industry since its release in November last year.
Until now, LLMs such as ChatGPT and Google’s Bard train on the open web, the “standard” internet, and are multi-purpose tools. But, DarkBERT is task-specific and trained entirely on the dark web.
According to Investopedia, the dark web is the internet’s underworld which contains dark and often illegal content. It’s rife with criminal activity and is a playground for hackers and cybercriminals.
Users must use specific software to access the dark web – sites are encrypted to provide anonymity and not indexed by search engines. The result is that people use it as a marketplace for things like leaked data, weapons, drugs, and extreme porn.
As reported by Futurism, Korean researchers connected their model to the Tor network, one of the most popular ways to access the dark web, to create its database.
The model fed off data for 16 days and put it into two sets, raw and preprocessed – filtering the raw data for ethical reasons.
In the yet-to-be-peer-reviewed paper titled “ DarkBERT: A language model for the dark side of the internet, researchers say that:
“ Our automated web crawler takes the approach of removing any non-text media and only stores raw text data. By doing so, we do not expose ourselves to any sensitive media that is potentially illegal.”
According to Dexerto, DarkBERT is based on Google’s BERT framework, which Facebook used to create RoBERTa in 2019. DarkBERT, is the result of the dark web data going through RoBERTa – and outperforms other language models.
The Future of DarkBERT
DarkBERT sounds sinister, but the research team say they will use the model for the greater good. They hope it will help us understand more about the dark web and help to fight cybercrime.
The team intends to put the model through further training and decide exactly how to use it.
There are no plans to release DarkBERT to the public, and people can’t access the data due to its sensitive nature. But, academics can view it by request for educational purposes.
Many experts agree that DarkBERT may help make the internet safer, but there are many concerns about the overall safety of AI. AI can be unpredictable, and tech industry leaders are concerned about its rapid development.
While the full potential of AI is still unknown, should we really teach it about the darkest side of human nature?