How I Created an AI Clone of Myself With ChatGPT

Tariq Massaoudi
Analytics Vidhya
Published in
7 min readApr 21, 2023

Using Lang Chain & Pinecone to create a chatbot that simulates my personality

Generated by AI

Are you a fan of Black Mirror or other sci-fi shows that explore the potential dark side of technology? Well, get ready to enter the dystopian future because the accelerating AI innovation may give you the creeps. Imagine an AI bot that can analyze your conversation history and replicate your personality traits, mannerisms, and even your sense of humor. It’s like having a digital clone of yourself that can chat with your friends and family, answer emails, or even take over your social media accounts. Sounds like something straight out of a sci-fi film, right? But we might get something like that in the near future. In this article, I’m experimenting with an AI chatbot that replicates my personality using my chat history from Facebook. The bot gives surprisingly good results. I’ll teach you how to do that with your own data in a few lines of code. The bot’s name is AI Tariq 🤖, you can chat with him (or me) using this link. Beware that you might need an OpenAI key.

What we’re building:

Here’s an architecture diagram of what we’re building today, along with a summary of our approach:

  • The user enters a prompt.
  • We use an OpenAI embedding model and Pinecone to perform a semantic search for the most relevant conversations from chat logs.
  • We then feed these examples to the AI via a custom prompt.
  • The AI attempts to answer as you would do in real life.
What we’re gonna build today architecture (photo by author)

The architecture of the provided demo is slightly more complex. It includes streaming responses, which means that the user does not have to wait for the entire response from AI to see the result. In simpler terms, it’s like watching ChatGPT type live. The demo also includes hosting. Here is the full architecture for your reference:

Full Demo Architecture (photo by author)

In this article we’ll focus on the AI bot part.

Download your Facebook Data:

To download your Facebook profile information, go to Settings > Privacy > Your Facebook Information > Download Your Information. Ensure that you have selected “Messages” before downloading.

Make sure that the format is JSON and choose your prefered date range:

Once you have selected “Messages” for download, click the “Request Download” button located at the bottom of the page. Facebook will take some time to prepare your data, and you will receive an email notification when it’s ready for download. After downloading, you will receive a zip file named “facebook-yourname.zip”. Open it and extract the “inbox” folder located under “messages”.

Once you have located the “inbox” folder in the downloaded file, you will see individual folders for each conversation you’ve ever had on Facebook. You can choose to use all of them or only select a subset of conversations. Create a new folder named “data” and copy the conversations you wish to use into it.

Now that you have the raw data, the next step is to format it in a readable way for the AI bot.

Formatting and Cleaning the data:

To follow along with the process, first download this repository and open the notebook titled “clean_data”. The data is in JSON format by default, and the first step is to transform it into a more usable DataFrame format. This can be achieved using two functions, namely “load_all_messages_in_folder” and “load_all_folders”.

To ensure that the files are loaded in chronological order, the function “sort_key_files” is used.


def sort_key_files(file):
return int(file.split('_')[1].split('.')[0])

def load_all_messages_in_folder(path):
"""
This function expects to get a folder path with files named messages_x.json where x is an integer
it will load them in order and returns a pandas dataframe
"""
merged_messages=pd.DataFrame()
arr = os.listdir(path)
arr.sort(key=sort_key_files)
print(arr)
for file in arr:
json_content=parse_fb_json(path+'/'+file)
temp_dfs = TempDFs(json_content)
messages=temp_dfs.df_list[2]
merged_messages=pd.concat([merged_messages,messages])
return merged_messages

def load_all_folders(folders):
"""
Given a list of folders, load all messages in folders and return a single dataframe
"""
df_all_messages=pd.DataFrame()
for folder in folders:
extracted_data=load_all_messages_in_folder(f'./data/{folder}')
df_all_messages=pd.concat([extracted_data,df_all_messages])
return df_all_messages
folders=['inwi','seller']
df=load_all_folders(folders)
Configuring the chatbot:

The next step involves improving the quality of the data by performing some cleaning. If your conversations involve multiple languages, you can optionally choose to keep only those in English. Since we are only interested in the message content and the sender name, we will keep only these two.

def is_in_english(quote,max_error_count = 3):
"""
detects if a text is in english using the number of spelling mistakes
"""
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > max_error_count)) else True

def clean_messages(df,yourname):
#remove rows with null content
df=df.dropna(subset=['content'])

#Facebook also keeps track of reactions in rows that will start with "Reacted", we are deleting those
df['react']=df['content'].apply(lambda s: str(s).split(' ')[0])
df=df[df['react']!='Reacted']


#uncomment this if you want only english messages
# df['english']=df.content.apply(lambda s : is_in_english(str(s)))
# df=df[df['english']==True]

df=df[df['type']=='Generic']
df=df[~df['content'].str.startswith('You changed the group')]
df=df[~df['content'].str.startswith('You named the group')]
df['sender_name']=df['sender_name'].apply(lambda s : yourname if s==yourname else 'Person')
df=df[['content','sender_name']]

return df
clean_df=clean_messages(df,'Tariq Massaoudi')

Finally we’re transforming the table to text which is readable for our LLM and exporting the result to a file.

The format used is:

Yourname: “message” Person: “message” …

def to_document(df):
"""Returns a single string in the format Sender : Message \n Sender: Message """

df['content_with_names']=df['sender_name']+":"+df['content']
result = '\n'.join(df['content_with_names'].values)
return result
with open("disscussion.txt", mode='w', encoding='utf-8') as file:
file.write(to_document(clean_df))

Embedding our messages with OpenAI to query them with Pinecone:

To proceed with this step, open the notebook titled “langchain_bot”. However, before getting started, you will need to obtain both an OpenAPI key and a Pinecone key. Fortunately, both services offer free trials!

Since LLMs have a limited context window, we cannot feed the bot our entire conversation history. Instead, we need to feed it only the relevant history based on user input. To achieve this, we first split our entire corpus into documents with a predefined length. Next, we use the OpenAI embedding model to embed these documents into vectors that represent the semantic meaning of the words. For instance, the words “happy” and “excited” will be closer in vector space than “happy” and “sad”.

We can then use Pinecone to save these vectors and run a similarity search, which will return the top documents.

loader = TextLoader('disscussion.txt')
data = loader.load()
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

#change this based on document
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
texts = text_splitter.split_documents(data)

pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_API_ENV
)
#replace by your index name
index_name = "aitariq"

embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
#Embeds your docs with openAI and loads them to pinecone
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)



#load search from Pinecone index, this will allow us to search for docs to feed the bot as examples
query = "What are Tariq's Favorite Animes?"
search_wrapper=Pinecone.from_existing_index(index_name=index_name,embedding=embedding)
#k dertermines how many docs we return
docs = search_wrapper.similarity_search(query=query, include_metadata=True,k=

Here’s an example of a document from the search results for the query above about anime:

Building our AI Bot with LangChain:

Langchain makes it easy to manipulate AI prompts.

In this step, we will create a prompt template for our bot. The documents that are similar to the user input, which we obtain from Pinecone, will serve as the examples that we feed to our AI. This approach is known as few-shot prompting, where the AI analyzes the examples and tries to mimic your personality. Additionally, we will feed the AI with the chat history using the built-in “ConversationBufferWindowMemory” utility in LangChain.

from langchain import OpenAI, LLMChain, PromptTemplate
from langchain.memory import ConversationBufferWindowMemory
template="""
You are going to immerse yourself into the role of Tariq.
Tariq is an Enginneer from Morocco, he is 24 years old.
Human will give you an input and examples of a conversation between Tariq and another person.
Use these examples as context to generate an answer to the Human's input in Tariq's style.
Your answer should be believable, in a casual tone and in Tariq's style.
Answer how Tariq would Answer.
Be creative.

Examples:

{examples}

Examples END

{history}

Human: {human_input}
Tariq:

"""


prompt = PromptTemplate(
input_variables=["history", "human_input","examples"],
template=template
)

#change k to affect how many previous conversation lines does the bot remember
#Set verbose = True for debugging
chatgpt_chain = LLMChain(
llm=OpenAI(temperature=0.7,openai_api_key=OPENAI_API_KEY),
prompt=prompt,
verbose=False,
memory=ConversationBufferWindowMemory(k=4,memory_key="history",input_key="human_input"),
)

def get_answer(human_input):
"""
Takes a human input and returns the bot's response
"""
docs = search_wrapper.similarity_search(query=human_input, include_metadata=True,k=10)

examples='\n'.join(["Example "+str(i+1) +": \n"+ doc.page_content for i,doc in enumerate(docs)])

output = chatgpt_chain.predict(human_input=human_input,examples=examples)
return output


print(get_answer("What are you favorite dishes?"))

Here’s are AI Tariq’s favorite dishes, I tend to agree with him:

My favorite dishes are definitely couscous and tajine.
There's nothing better than a warm plate of couscous
or tajine with a side of veggies and a nice glass of Moroccan tea.
I also love a good tagine with some olives and lemon.
The combination of flavors is unbeatable.

Conclusion:

To sum up, we’ve experimented with creating an AI clone of ourselves. Have fun talking to yourself! I can’t wait to see when more sophisticated approaches of this surface. We’re living in exciting times!

If you’ve managed to get this far, congratulations! Thanks for reading; I hope you’ve enjoyed the article. Please feel free to reach out to me on LinkedIn for further discussion or personal contact.

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Responses (2)