Learn how you can generate any type of text with c models with the help of Huggingface transformer libraries in Python.
Text generation is the task of automatically generating text using machine learning so that it cannot be distinguishable whether it’s written by a human or a machine. It is also widely used for text suggestion and completion in various real-world applications.
In recent years, a lot of transformer-based models appeared to be great at this task. One of the most known is the GPT-2 model which was trained on massive unsupervised text, that generates quite impressive text.
Another major breakthrough appeared when OpenAI released a GPT-3 paper and its capabilities, this model is too massive that is more than 1400 times larger than its previous version (GPT-2).
Unfortunately, we cannot use GPT-3 as OpenAI did not release the model weights, and even if it did, we as normal people won’t be able to have that machine that can load the model weights into the memory, because it’s too large.
Luckily, EleutherAI did a great job trying to mimic the capabilities of GPT-3 by releasing the GPT-J model. GPT-J model has 6 billion parameters consisting of 28 layers with a dimension of 4096, it was pre-trained on the Pile dataset, which is a large-scale dataset created by EleutherAI itself.
The Pile dataset is a massive one with a size of over 825GB, it consists of 22 sub-datasets including Wikipedia English (6.38GB), GitHub (95.16GB), Stack Exchange (32.2GB), ArXiv (56.21GB), and more. This explains the amazing performance of GPT-J that you’ll hopefully discover in this tutorial.
In this guide, we’re going to perform text generation using GPT-2 as well as EleutherAI models using the Huggingface Transformers library in Python.
The below table shows some of the most useful models along with their number of parameters and size, I suggest you choose the largest you can fit in your environment memory:
Model | Number of Parameters | Size |
gpt2 |
124M | 523MB |
EleutherAI/gpt-neo-125M |
125M | 502MB |
EleutherAI/gpt-neo-1.3B |
1.3B | 4.95GB |
EleutherAI/gpt-neo-2.7B |
2.7B | 9.94GB |
EleutherAI/gpt-j-6B |
6B | 22.5GB |
The EleutherAI/gpt-j-6B
Model has 22.5GB of size, so make sure you have at least a memory of more than 22.5GB to be able to perform inference on the model. The good news is that Google Colab with the High-RAM option worked for me. If you’re not able to load that big model, you can try other smaller versions such as EleutherAI/gpt-neo-2.7B
or EleutherAI/gpt-neo-1.3B
. The models we gonna use in this tutorial are the highlighted ones in the above table.
Note that this is different from generating AI chatbot conversations using models such as DialoGPT. If you want that, we have a tutorial for it, make sure to check it out.
Let’s get started, installing the transformer library:
In this tutorial, we will only use the pipeline API, as it’ll be more than enough for text generation.
Let’s get started by the standard GPT-2 model:
First, let’s use GPT-2 model to generate 3 different sentences by sampling from the top 50 candidates:
Output:
I have set top_k
to 50 which means we pick the 50 highest probability vocabulary tokens to keep for filtering, we also decrease the temperature
to 0.6 (default is 1) to increase the probability of picking high probability tokens, setting it to 0 is the same as greedy search (i.e. picking the most probable token)
Notice the third sentence was cut and not completed, you can always increase the max_length
to generate more tokens.
By passing the input text to the TextGenerationPipeline (pipeline
object), we’re passing the arguments to the model.generate()
Method. Therefore, I highly suggest you check the parameters of the model.generate() reference for a more customized generation. I also suggest you read this blog post explaining most of the decoding techniques the method offers.
Now we have explored GPT-2, it’s time to dive into the fascinating GPT-J:
The model size is about 22.5GB, make sure your environment is capable of loading the model to the memory, I’m using the High-RAM instance on Google Colab and it’s running quite well. However, it may take a while to generate sentences, especially when you pass a higher value of max_length.
Let’s pass the same parameters, but with a different prompt:
Output:
Honestly, I can’t distinguish whether this is generated by a neural network or written by a human being!
Since GPT-J and other EleutherAI pre-trained models are trained on the Pile dataset, it can not only generate English text, but it can talk anything, let’s try to generate Python code:
Output:
I prompted the model with an import os
Statement to indicate that’s Python code, and I made a comment on listing African countries. Surprisingly, it not only got the syntax of Python right, and generated African countries, but it also listed the countries in Alphabetical order and also chose a suitable variable name!
I definitely invite you to play around with the model and let me know in the comments if you find anything even more interesting.
Notice I have lowered the temperature
to 0.05, as this is not really an open-ended generation, I want the African countries to be correct as well as the Python syntax, I have tried increasing the temperature in this type of generation and it led to misleading generation.
One more Python prompt:
Output:
The model successfully generated working OpenCV code, where it loaded the image, applies the cv2.flip()
Function to the image, and resizes it, and then continues with making the image black and white, interesting.
Next, let’s try Bash scripting:
Check this out:
The first command worked like a charm in my machine!
Another shell script:
I have updated the repository using the apt-get
command, and prompted to try to generate the commands for installing and starting Nginx, here is the output:
The model successfully generated the two responsible commands for installing Nginx, and starting the web server! It then tries to create a user and add it to the sudoers. However, notice the repetition, we can get rid of that by setting the repetition_penalty
Parameter (default is 1, i.e. No penalty), check this paper for more information.
Trying Java now, prompting the model with Java main function wrapped in a Test class and adding a comment to print the first 20 Fibonacci numbers:
Extraordinarily, the model added the complete Java code for generating Fibonacci numbers:
I have executed the code before the weird "A:"
, not only it’s a working code, but it generated the correct sequence!
Finally, Let’s try generating LaTeX code:
I tried to begin an ordered list in LaTeX, and before that, I added a comment indicating a list of Asian countries, output:
A correct syntax with the right countries!
Leave a comment