Power of a Prompt & a Human in the loop

About 15/20 years ago I thought about cataloguing our DVD collection and the solution I was leaning towards then was to pay some cub scouts to do that. Well it’s 2025 and my phone camera and Gemini 2.0 flash experimental solved the problem.

I knew AI was going to have a direct effect on jobs but I didn’t think it would affect cub scouts earning a few pounds .

This post was supposed to be about the app I was going to build using cursor to create the rest of the DVD cataloguing app I discussed here .

However in the process of evaluating the model I would use to analyse a photo and extract the DVD titles it became evident that it wasn’t going to be worth building that part of the app as it would be quicker to just use Gemini Flash with an appropriate prompt. As this accomplished everything the media reader app I had planned would have in a way faster time. I probably would have had fun creating some monster raspberry pi scanning thing but my priority was to categorise the DVDs.

The app I had planned needed to meet the following requirements:

  1. Be able to take a high res image
  2. Identify the DVDs in the image
  3. Identify the genre the DVDs in the image belonged to
  4. Output a csv file for each photo that had the title of the DVD, location and the genre it should be catalogued against

I had thoughts of using a raspberry pi + camera until I started evaluating a model to do this .

In just over an hour I had catalogued 198 DVDs ( I am at the time of writing up to 380 DVD entries)

I was initially thinking of building an app to run on a raspberry pi + pi camera so I wanted to evaluate local models which I’d narrowed down to :

  • Llama 3.2-vision
  • Llava

( mostly because they ran on my Mac )

I wasn’t able to get paligemma to run on my Mac (I was trying to do this without resorting to using my server for reasons!) else it would have been included. Yes I know you can probably get it to run by compiling transformers from source but that somehow needed to read my emails to do that and despite my curiosity wasn’t ever going to be a thing!

First I needed to figure out a prompt to use to evaluate the models so as all of this was meant to be as reliant on AI as possible I asked Gemini for a prompt to use using the following prompt

"If i provide an image that contains a number of dvds to a LLM vision model what prompt should i use to get it to output the name of each dvd in a csv file"

Its response was:

Prompt:

"Analyze the image provided and identify all visible DVDs.* 

Output the results in a CSV file with the following format:*

DVD Name
Example:

The Matrix Lord of the Rings: The Fellowship of the Ring Inception"

Explanation:

  • Clear Instructions: The prompt explicitly states the task: analyze the image and identify DVDs.
  • CSV Output Specification: It clearly defines the desired output format (CSV file with a single column for “DVD Name”).
  • Example: Providing an example helps the model understand the expected output structure and style.
  • Conciseness: The prompt is concise and avoids unnecessary words.

Key Considerations:

  • Image Quality: The quality of the image significantly impacts the model’s ability to accurately identify the DVDs. Ensure the image is clear, well-lit, and has a good resolution.
  • DVD Case Variations: The model may struggle with variations in DVD case designs, such as different sizes, colors, and spine styles.
  • Text Recognition: The model will need to accurately read the text on the DVD spines to identify the titles.
  • Model Capabilities: The specific capabilities of the LLM vision model will influence the accuracy and reliability of the results. Some models may be better suited for this task than others.

Additional Tips:

  • Pre-process the Image: If possible, pre-process the image to enhance clarity and improve the model’s performance. This might involve cropping, resizing, or adjusting contrast.

  • Test with a Small Dataset: Before processing the full image, test the model with a small subset of DVDs to evaluate its accuracy and identify potential issues.

  • Iterate and Refine: Continuously refine the prompt and adjust parameters based on the model’s performance.

  • By following these guidelines and carefully considering the specific requirements of your task, you can effectively use an LLM vision model to extract DVD names from images and generate the desired CSV output.

The next thing I did was to revisit open-webui on my laptop firing it up using Docker.

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

I uploaded the following photo and then used a shortened version of the prompt supplied by Gemini for the initial test just to validate its performance at analyzing the DVD’s

Analyze the image provided and identify all visible DVDs.
Example:
The Matrix Lord of the Rings: The Fellowship of the Ring Inception

alt_text

Llama 3.2-vision and Llava gave responses that were uncanilly very similar 🤔 . Llava was a lot faster than Llama 3.2-vision. Example of one of their responses:

 In the image you've provided, there are five DVD cases stacked on top of each other. The visible titles from left to right are:

1. "Maitret" - A French television series about a group of women in Nazi-occupied France.
2. "Ellery Queen" - An American comedy television series created by Megan Levens and showrunner Danielle Stubbs, that is set in the fictional town of Elmer's Inlet.
3. "Babysitter's Guide to Monster Hunting" - A horror-comedy film directed by Emily Hickerson and produced by Kevin Kwan.
4. "Comedies" - It appears there are multiple DVD cases in this section, but the titles are not clearly visible.
5. "Rope" - A British crime drama television series created by Philip Shelley and Matthew Read, that is set in the fictional town of Rochdale, Lancashire.

The stacked DVDs seem to be part of a collection with various genres represented, including comedy, mystery, horror, and drama

This was pretty poor. Not only did they get many wrong ( although I would like to watch “Baby sisters Guide to Monster Hunting”). Both LLMs also went beyond the specified instructions which just compounded the inaccuracies. Him indoors suggested that maybe taking the photos of the edge of the DVDs was probably too hard for the model to work with. I was unconvinced by that so that’s where I thought let’s see how Gemini 2.0 flash does with this. It’s response was:

Here are the DVDs visible in the image:
* Maigret
* Kitaro
* Ellery Queen Mysteries
* Babycart in Peril
* Comedies
* Shakespeare
* Rope
* Highlander
* Justice League
* Manga (partially visible)
If you'd like, you can provide another image and I'll try to identify the DVDs in it as well.

It just nailed it albeit with a slight disconnect with the Comedies / Shakespeare and amazingly enough it even identified the Justice league ( dark ) dvd although I am wondering what the Manga partially visible one was ?( It was actually Clint Eastwood firefox)

I then decided to see how Gemini 2.0 flash experimental did with the prompt I wanted to use for this with another image ( Yes I know for some reason I ditched the example in the prompt)

Analyze this image, identify the dvds and the genre each one is associated with . Output as a comma separated list with the dvd title identified, genre and a column called location with an entry of "Living room" for each of the DVDs identified

Response

DVD Title,Genre,Location
Mystery Men,Superhero/Comedy,Living room
Hancock,Superhero/Action/Comedy,Living room
Constantine,Supernatural/Action/Horror,Living room Blade,Action/Horror/Superhero,Living room
Blade Trinity,Action/Horror/Superhero,Living room
Perfect Collection Box Set,Anime/Action/Sci-Fi (Likely a collection of related anime),Living room
Outlaw Star,Anime/Space Western/Action/Adventure,Living room

Despite Gemini’s admittedly impressive performance it did have issues .

Occasional Hallucinations seen as an example was where it couldn’t actually analyze the “SouthPark bigger longer and uncut” DVD but filled in the gaps by deciding that was actually likely to be “Southpark 10” Good guess but wrong! This is an excellent example to remind you how LLMs work .After a while wandering around the house taking photos of DVDs I got sloppy. However “Adventure time” which also had the glare issue the analysis was okay.

alt_text

I sometimes think it felt we should have more documentaries in our collection as the times when it couldn’t categorise and didn’t just state unknown it would make things up & quite often that inaccurate categorisation would be Documentary.

Inconsistencies . The avengers ( 60’s action series ) box sets we have were randomly categorised as action or drama . It could of course use both labels and indeed has used multiple categories in earlier scans.

Struggles with Box sets and some titles e.g Deadpool & wolverine it decided was actually two DVDS one called Deadpool the other Wolverine . The photo that included this DVD was fine imho though so no blaming the one wielding the phone for that!

Saying you don’t know is fine rather than providing inaccurate information humans know this . I know we are living in a facts don’t matter world currently but that does have consequences. I’m just cataloguing DVDs but you can imagine the adverse outcomes in more sensitive scenarios?

So Human in the loop or perhaps for larger scaled apps using LLMs as judges is maybe one way to automate the validation, but then who judges the judges?

Yes the validation step will slow things down as I was definitely the bottleneck in the process but at the moment even with SOTA models like Gemini you still need to validate output. Now all I needed to do to keep to the arbitary rules I’d set myself regarding me not actually writing a line of code myself ( I’m just the guide) is get Cursor to generate the code to batch load the csv files into the SQLite database.