Generative AI as a whole is powered by large and complex deep learning models pre-trained on vast amounts of data, commonly referred to as Foundation Models (FMs).
LLMs are a variant of these Foundation Models which have been specifically trained on massive amounts of text data – including but not limited to books, articles, websites, code etc.
LLMs use complex and sophisticated statistical models to analyze vast datasets, identify patterns and connections in the data between words and phrases, and leverage these to eventually generate completely new text.
LLMs can perform several tasks, such as generating texts and other content, based on the learnings from massive datasets. These datasets usually consist of large troves of information taken from the internet, which runs the risk of including personal data within the algorithm’s training data set.
With the massive increase in accessibility and adoption of these large language models, there are several LLM data protection and privacy-related concerns that arise.
The journey of LLMs began with early neural network experiments in the 1950s.
The pioneering step forward, however, was the creation of Eliza, a primitive chatbot, by Joseph Weizenbaum, an MIT researcher. Eliza highlighted the potential of natural language processing (NLP). Three decades later, complex neural networks could be envisaged due to the advent of Long Short-Term (LSTM) networks in 1997.
Moving ahead, in 2010, Stanford introduced the CoreNLP suite which enabled sentiment analysis and named entity recognition, further pushing practical applications of NLP.
In 2011, with Google’s advent of word embeddings, a new level of contextual understanding was brought to NLP. This set the stage for the rise of transformer models in 2017, characterized by the development of GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).
GPT-3 launched in 2020 with a massive 175 Bn parameter model, setting the standard for LLM capabilities.
The introduction of ChatGPT in late 2022 brought LLMs into significant public attention which has been escalating ever since and generated exponentially more practical use cases than anyone could visualize.
LLM evolution continues with GPT-4, which is a 1 trillion parameter model representing a quantum leap in size and capabilities of predecessors.
To quantify an LLM in terms of its size, model complexity and training resource intensity, it is prudent to look at Parameters and Tokens.
These are two important metrics used to measure the size and complexity of an LLM.
Parameters are the number of variables in the LLM's neural network.
These variables represent the weights and biases that are used to learn the relationships between the input and output data.
The more parameters an LLM has, the more complex it is and the better it can learn to generate text that is similar to the text it was trained on.
Tokens are the basic units of text that the LLM used to process and generate language. Tokens can be characters, words, or sub- words, depending on the chosen tokenization method.
The more tokens an LLM has, the more expressive it can be in its output. Now that we know what constitutes an LLM, we can look at the various types of LLMs.
LLMs can vary depending on where it is hosted and for whom.
The various types of LLMs are:
Publicly available LLMs are typically massive in scale, with over 500 billion parameters, and are accessible to developers, researchers, and organizations through API calls.
These models are hosted on cloud platforms and can be used for various NLP tasks such as text generation, translation, summarization, and more.
Some of the leading examples include OpenAI's GPT models, Google's PaLM (Path Language Model), Meta’s LLaMA, and NVIDIA's NeMo™.
These models have gained popularity due to their versatility and ease of integration into various applications.
They have the advantage of being continuously updated and improved by the developers, which helps ensure they remain state-of-the-art in terms of performance.
The use of these models allows for their continuous training.
Private LLMs are models that are customized or tuned for specific tasks or industries, or organizations and are operated within a private cloud or an in-house infrastructure.
These models could be either purpose-built from scratch or fine-tuned versions of publicly available models.
The primary aim of these LLMs is to address specific business needs, confidentiality concerns, or compliance requirements.
Organizations may opt to develop and manage their LLMs to have more control over costs and data security.
Examples of private LLM providers include NVIDIA, which offers smaller to medium-sized LLMs, and platforms like Hugging Face and MosaicML that facilitate fine-tuning and customization.
For example, Bloomberg has created BloombergGPT, a 50-billion parameter private large language model, purpose-built from scratch for finance to better serve their customers.
Open Source LLMs are models that are made available to the public with their source code and parameters, allowing developers to modify, enhance, and adapt them for various purposes.
These models are often part of a hybrid strategy for enterprises looking to combine the benefits of both public and private LLMs.
By using open source LLMs, organizations can avoid vendor lock-in and have more control over their AI initiatives.
Falcon40B and its scaled-up version “Falcon 180B” by the UAE is a notable example of an open source LLM where recent announcements made it free of royalties for commercial and research use.
Millions of websites are used to train LLMs.
For instance, the GPT-3 model was reportedly trained on 570GB of data scraped from the internet, through sources like books, texts, articles, etc.
GPT-4’s training set, admittedly, ‘may’ include publicly available personal information, and by combining capabilities of the model, “GPT-4 has the potential to be used to attempt to identify individuals when augmented with outside data.”
The possibility of sensitive personal data such as political opinions or religious beliefs being included in the training data sets of public-facing LLMs.
The training data sets for such natural- language processing models can also include data scraped from public dialog sites or forums such as Reddit, Facebook, Quora, etc.
For instance, OpenAI’s privacy policy states that personal information may be collected through inputs and file uploads which may then be used to improve existing services, to develop new services, and to train the models that power ChatGPT.
Tools like ‘ProfileGPT’ have emerged to analyse and summarise the information that is collected about individual users by large language models.
For instance, the creators of ProfileGPT claim that users’ interactions with ChatGPT may enable extraction of personal data related to the user’s life summary, their hobbies and interests, political/religious views, mental health, etc.
More generally, in big data analytics, it is well-settled that it is not just the data provided by individuals which can be used for analysis, but also observed data, derived data, and inferred data.
Concerns have also been raised about the possibility of hackers ‘poisoning’ the dataset to create security vulnerabilities, which can then be exploited to extract sensitive personal data.
The data privacy concerns with generative AI is on the rise.
The top 3 reasons why organizations need data privacy tool for their LLMs are highlighted below:
Suggested Read: OWASP Top 10 Vulnerabilities for LLMs
Data that are fed into LLMs are used to continuously train them. Once a sensitive data is fed into the LLM system, there is no going back.
LLMs are a double-edged sword technology that can backfire anytime.
ChatGPT alone boasts a user base of over 180 million, and deleting the data isn't an option, predicting its future use—or misuse—becomes a daunting task, and retraining the LLM to “roll it back” to its state before you share those sensitive contract details can be prohibitively expensive.
For example, privacy regulations in Europe, Argentina, and the Philippines (just to name a few) all support an individual’s “right to be forgotten.”
This grants individuals the right to have their personal information removed or erased from a system.
Without an LLM delete button, there’s no way for a business to address such a request without retraining their LLM from scratch.
Data localization or data residency laws mean that data about people from a specific country must be collected, processed, or stored within that country before it can be sent abroad.
Usually, the data can only be sent after following local privacy or data protection laws, like telling users how their information is used and getting their permission.
Data localization can also be used to ensure that data is subject to specific privacy laws and regulations.
For instance:
Data localization requires storing of a citizen’s data within the borders of a country. This means a company with global presence needs to comply with individual privacy laws in every country it has customers in.
In such scenarios, LLMs complying with all rules and regulations simultaneously across borders is a tough task. The more decentralized the models are, the more divided it is for managing attention and the allocation of resources.
Data localization results in vulnerabilities that include data exfiltration and problems with infrastructure, encryption, and the integrity of source code.
Malicious actors can invade LLMs models via various means, diluting privacy at every step.
Here are few of them that poses a challenge for LLMs in complying privacy laws:
Hardware attacks typically involve physical access to devices. However, LLMs cannot directly access physical devices. Instead, they can only access information associated with the hardware. Side-channel attack is one attack that can be powered by the LLMs.
Side-channel attacks typically entail the analysis of unintentional information leakage from a physical system or implementation, such as a cryptographic device or software, with the aim of inferring secret (e.g., keys) and business sensitive information.
LLMs are powerful tools for working with text and information, but they don't have the capabilities to directly launch operating system attacks. However, they could be misused to analyze information that could be helpful to someone planning such an attack.
LLMs are used to target hardware, operating systems, and create malware. Researchers have demonstrated the use of LLMs to distribute malware, create ransomware, and explore different coding strategies for generating malware in softwares.
LLMs excel in constructing malware using building block descriptions and can generate multiple variants with varying detection rates.
LLMs not only have the potential to generate content from training data, but research also highlights the capability of well-trained LLMs to infer personal attributes from text, such as location, income, and gender.
Researchers also revealed how these models can extract personal information from seemingly benign queries.
A prevalent example of a network- level attack via LLM is phishing attacks. Research demonstrated the scalability of spear phishing campaigns by generating realistic and cost-effective phishing messages for over 600 British Members of Parliament using ChatGPT.
Data privacy in the generative AI age has become complex with multiple regulations requiring different bases of compliance. 71% of countries have adopted some form of legislation for preserving privacy.
To preserve privacy in generative AI requires the following best practices:
The most popular approach to addressing AI data privacy, and the one that’s being promoted by cloud providers like Google, Microsoft, AWS, and Snowflake, is to run your LLM privately on their infrastructure.
Con: Outside of privacy concerns, if you’re choosing to run an LLM privately rather than take advantage of an existing managed service, then you’re stuck with managing the updates, and possibly the infrastructure, yourself. It’s also going to be much more expensive to run an LLM privately.
Have a state-of-the-art tool for cleaning and processing your data before feeding into LLMs, securing your model.
Model security is more important than the data itself as LLMs are black boxes. For instance, not even the programmers at OpenAI know exactly how ChatGPT configures itself to produce its text. Model developers traditionally design their models before committing them to a program code, but LLMs use data to configure themselves.
Data sharing to external sources and the data output from LLMs requires a secured access control mechanism.
Poor access management invites human-based intrusions and errors. Data from Verizon’s 2022 Data Breach Incident Report found that 82% of data breaches involved the human element, stemming from credential theft, phishing attacks, employee misuse or mistakes.
Learn more about sensitive data: What is sensitive data and how to protect it?
During training of LLMs, multiple ML engineers work together on similar or separate projects.
Data is accessible to them in raw form, without any data masking techniques. To preserve privacy in generative AI, ML engineers has to be governed with data policies for complying with regulatory bodies.
Learn more about regulatory compliance: What is Regulatory Compliance? A Comprehensive Guide
Generative AI privacy issues have a stem arising from security of the model and the data itself. Security of LLM models start with cleaning data and making it devoid of sensitive information, data that are crucial to customers and businesses.
With rising complexities in regulatory laws, the definition of sensitive data is constantly evolving. Sensitive data cannot be merely the monolithic standards of PII, PHI and PCI. It has broadened to include sensitives such as business intelligence, sentiment based topics, developer tokens, and linked PII.
LLMs are black boxes, where they learn and grow as we provide more information. But the basis of their growth lies within ourselves – we can dictate what they learn.
To secure your LLM model as well as ensure privacy of your sensitive data, schedule a product demo, and get first-hand experience on how OptIQ is changing the space of data security and data privacy.
To ensure the safety of Large Language Models (LLM), it's important to implement robust data handling and processing protocols, regularly update and audit the models for biases or errors, and enforce strict access controls with data security tools like OptIQ.
LLM security involves protecting the models from unauthorized access, tampering, or misuse. This includes securing the training data, ensuring the integrity of the models, and safeguarding their outputs against exploitation. This is possible with data cleaning and processing, data access control, data access governance and privacy.
Data privacy in LLMs pertains to protecting the personal information that may be included in the training datasets or generated in the model's outputs. It involves ensuring that data is anonymized, consent is obtained where necessary, and that the model does not inadvertently reveal sensitive information.