Remove background from images
Generate speech using a cloned voice
Run YOLO-World object detection on images