Use AI to create a story from an image

Published Oct 27, 2023

Every image has a tale to tell, but what if we could unveil those hidden narratives with the power of artificial intelligence? We're here to explore the intersection of computer vision and natural language processing. This tutorial delves into the practical application of Hugging Face AI models to turn images into captivating stories.

As we navigate this tutorial, you'll discover how AI can discern the subtle nuances within an image, extract meaningful information, and weave it into eloquent narratives.

Here is a video of the use case:
https://www.youtube.com/shorts/PrQQ2sLm8FI

Understanding Hugging Face Models

Hugging Face, a prominent name in the realm of artificial intelligence, is your gateway to a wide array of pre-trained language models and transformer architectures. Before we delve into the mechanics of creating stories from images, it's crucial to understand why Hugging Face models are pivotal for this task.

Versatility and Popularity: Hugging Face stands out for its versatility. The platform offers access to various pre-trained models that can tackle diverse natural language processing tasks. These models are widely recognized and adopted across a multitude of applications.

Community Collaboration: Hugging Face adopts an open-source and community-driven approach, allowing AI enthusiasts, developers, and researchers to contribute to model development and share their innovations. This collaborative spirit ensures that the models continuously evolve and improve.

Ease of Integration: Hugging Face models are designed for user-friendliness. They provide straightforward and intuitive APIs, catering to both experienced developers and newcomers to the world of AI. Whether you're an expert in Python or just getting started with coding, working with Hugging Face models is relatively straightforward.

Top-Notch Performance: Hugging Face models have set benchmarks in various natural language understanding and generation tasks. They are renowned for their outstanding performance, making them a preferred choice for achieving high-quality results.

Components for Image-to-Story Conversion

In the journey to create stories from images using Hugging Face AI models, we'll be breaking down the process into three crucial components. These components form the backbone of our image-to-story conversion pipeline:

Image to Text: Deciphering the Visual Scenario
Our first step is to teach the machine to understand the scenario presented by an image. This process, known as image captioning, involves translating visual content into text descriptions. It's the foundation upon which we build our storytelling magic.

To achieve this, we'll leverage computer vision models that can analyze the image's content and generate descriptive text. These models have been trained on vast datasets, allowing them to recognize objects, scenes, and context within images. The result is a textual representation of the image, which serves as the starting point for our storytelling journey.

Language Model Magic: Crafting Stories with LLMs
With the image content converted into text, we move on to the storytelling phase. Here, we introduce the star of the show - Large Language Models (LLMs). These models are the creative minds behind the stories we'll be generating.

LLMs are massive neural networks that have been trained on extensive text corpora, making them masters of language. They can take a textual prompt, like the description of an image, and continue the narrative with engaging and coherent text. In our case, they will transform the image caption into a short story.

The beauty of LLMs lies in their ability to generate contextually relevant and imaginative text. They provide a canvas upon which we can paint captivating narratives, all powered by the magic of artificial intelligence.

Text to Speech: Giving Voice to Our Story
Our storytelling journey doesn't stop at written words. We want our stories to come to life, and that's where text-to-speech models enter the scene. These models can convert our written stories into spoken words, generating audio that adds an extra layer of immersion to the experience.

Text-to-speech models have made remarkable strides recently, offering human-like voice synthesis. They take the text output from our language model and turn it into an audio file that can be played and enjoyed.

With this component, we bridge the gap between written and spoken storytelling, bringing our narratives to the ears of our audience.

Practical Example

For the first part image to text, we are going to use the model Salesforce/blip-image-captioning-large https://huggingface.co/Salesforce/blip-image-captioning-large
The Salesforce/blip-image-captioning-large model is a powerful tool for image captioning. It's been pre-trained on extensive datasets, making it adept at recognizing objects, scenes, and context within images. This model serves as our bridge between the visual and textual worlds.

You'll need a Hugging Face account and a token for accusing the API.

First create a .env file and put your Hugging Face token there:
HUGGINGFACEHUB_API_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx

Create an app.py file where we will be working:

from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
load_dotenv(find_dotenv())

# image to text
def image_2_text(url):
    image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
    text = image_to_text(url)
    print(text)
    return text

image_2_text('photo.jpg')

When executing this, you should be able to see a small description of the photo:
[{'generated_text': 'a castle with a tower'}]

For the second part, we'll use the HuggingFaceH4/zephyr-7b-alpha https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha model to generate a short story based on the text:

Add to the app.py file:

def generate_story(scenario):
    pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.float16, device_map="auto")

    messages = [
        {
            "role": "system",
            "content": "You are a story teller; You can generate a short story based on a single narrative, the story should be no more than 30 words long",
        },
        {"role": "user", "content": str(scenario)},
    ]
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(outputs[0]["generated_text"])
    return outputs[0]["generated_text"]

For me, it returns something along this:

In a far-off land, stood a castle grand,
With a tower so high, it touched the sky's command.

For the last part Text to Speech, we are using espnet/kan-bayashi_ljspeech_vits https://huggingface.co/espnet/kan-bayashi_ljspeech_vits

def substring_after(s, delim):
    return s.partition(delim)[2]

def text_to_speech(message):
    API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"
    headers = {"Authorization": f"Bearer {HUGGINGFACEHUB_API_TOKEN}"}
    payloads = {
        "inputs": substring_after(message, "<|assistant|>")
    }
    response = requests.post(API_URL, headers=headers, json=payloads)

    with open('audio.flac', 'wb') as file:
        file.write(response.content)

The resulting audio, it is like this: https://share.zight.com/8Lu6obd5

To execute the code via Streamlit, here is the complete code, or in our repository here: https://github.com/uokesita/ImageToStory

from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
from langchain import PromptTemplate, LLMChain, OpenAI
import torch
import requests
import os
import re
import streamlit as st

load_dotenv(find_dotenv())
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
# image to text

def image_2_text(url):
    image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
    text = image_to_text(url)

    # print(text)
    return text

# text to speech
def generate_story(scenario):
    pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.float16, device_map="auto")

    messages = [
        {
            "role": "system",
            "content": "You are a story teller; You can generate a short story based on a single narrative, the story should be no more than 30 words long",
        },
        {"role": "user", "content": str(scenario)},
    ]
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(substring_after(outputs[0]["generated_text"], "<|assistant|>"))

    return substring_after(outputs[0]["generated_text"], "<|assistant|>")

def substring_after(s, delim):
    return s.partition(delim)[2]

def text_to_speech(message):
    API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"
    headers = {"Authorization": f"Bearer {HUGGINGFACEHUB_API_TOKEN}"}
    payloads = {
        "inputs": message
    }
    response = requests.post(API_URL, headers=headers, json=payloads)

    with open('audio.flac', 'wb') as file:
        file.write(response.content)

def main():
    st.set_page_config(page_title="Use AI to create a story from an image")
    st.header("Use AI to create a story from an image")

    uploaded_file = st.file_uploader("Choose an image", type="jpg")

    if uploaded_file is not None:
        print(uploaded_file)
        bytes_data = uploaded_file.getvalue()
        with open(uploaded_file.name, "wb") as file:
            file.write(bytes_data)
        st.image(uploaded_file, caption="Uploaded Image", use_column_width=True)
        scenario = image_2_text(uploaded_file.name)
        story = generate_story(scenario)
        text_to_speech(story)

        with st.expander("scenario"):
            st.write(scenario)
        with st.expander("story"):
            st.write(story)

        st.audio("audio.flac")

if __name__ == '__main__':
    main()

You can run it with:
streamlit run app.py

Examples and Use Cases

We'll explore real-world examples and use cases of image-to-story conversion using Hugging Face AI models and the Salesforce/blip-image-captioning-large model.
Enhancing Social Media Engagement: Imagine you're a social media manager for a travel company. You can use image-to-story conversion to automatically generate captivating captions and stories for your travel photos. This saves time and increases engagement with your audience.

Accessibility and Inclusivity: Image-to-text conversion has profound implications for accessibility. By providing textual descriptions of images, we make digital content more inclusive for visually impaired individuals who rely on screen readers to access information.

Content Creation and Marketing: Content creators and marketers can leverage image-to-story conversion to breathe life into their visual content. Whether it's generating product descriptions from images or crafting engaging narratives for advertising campaigns, the possibilities are vast.

Educational Resources: Educators can use this technology to create educational materials that bridge the gap between visual and textual content. Images in textbooks, for example, can be accompanied by automatically generated explanations.

News and Journalism: Journalists can use image captioning to briefly summarize the content of news images, making it easier to search and categorize visual content in news archives.

Artificial Intelligence in Art: In the realm of art, this technology can be used to generate artistic descriptions for paintings and visual artworks, adding a layer of interpretation to the viewer's experience.
These are just a few examples of how image-to-text conversion can be applied across various domains. The versatility of this technology makes it a valuable tool in the hands of creators, educators, businesses, and anyone looking to enhance their content with AI-generated narratives.

Conclusion

As we wrap up this journey into the fascinating world of image-to-story conversion, let's take a moment to recap what we've learned and the possibilities that lie ahead.

Empowering Creativity: With the help of Hugging Face AI models like, we've unlocked the power to transform images into captivating stories. This technology empowers us to unleash our creativity and bring stories to life in ways we couldn't have imagined before.

Real-World Impact: Image-to-text conversion has a significant impact on accessibility, content creation, marketing, education, journalism, and more. It's a versatile tool that opens up new possibilities in various fields, making it easier to bridge the gap between visuals and text.

Continuous Exploration: The world of AI and natural language processing is continuously evolving. As you embark on your own image-to-story adventures, remember that innovation knows no bounds. Explore, experiment, and push the boundaries of what's possible with AI.

Predictive modeling AI Python Generative models Streamlit

Report

Enjoy this post? Give Osledy Bazó a like if it's helpful.

Osledy Bazó

Hi I'm Osledy Bazó, Fullstack Developer

React, React Native, AI Dev

Discover and read more posts from Osledy Bazó

get started

1Reply

Yohanna Thomas

a year ago

As a student, I recently used https://essaypro.com/term-paper-help term paper help, and I’m incredibly pleased with the results. The writer assigned to my project was knowledgeable and delivered a well-researched, high-quality term paper that exceeded my expectations. The entire process was smooth, from placing the order to receiving the final draft. Communication was easy, and the customer support team was always available to answer my questions.