AI Studio的一個Prompt with image中GetCode 的坑 (curl問題)

JLin
17 min read2 hours ago

--

由於想要在一個單純的HTML透過javascript直接定時擷取串流圖片,然後固定的定時候過Gemini Flash 1.5的多模態功能,請AI描述串流播放中影像內容,然後積滿10張後,請他描述一下這個串流的總結或者回應等等的應用

成果是這樣,純HTML +javascript,就可以進行HLS擷取然後持續的用Gemini Flash描述影片行為 (免費版本要注意呼叫頻率與使用次數限制)

注意,由於我只是要放在自己手機或者電腦中跑顯示html網頁,如果html實際對外開放,是不建議在javascript上面讀入你的API KEY去直接執行的,最好是呼叫包在一個service中

再打開官方的AI Studio後很開心的測試,增加一張Sample Image,輸入了一個what is this

點下去Get Code,沒想到javascriptnodejs的版本,但我不想要架設nodejs,因為我想直接開啟html 放在任何地方就可以執行

於是我改選了curl

想說開心心的把它丟到 chatgpt o1-mini請他直接轉成 純javascript的方式給我,結果跑了後發現一直遇到錯誤,原因是說filetype 應該是binary不應該是json,我基於想要全程用LLM來產生這個html,於是也沒仔細看到底是出什麼錯誤,只丟上錯誤碼,不斷請他修正,重新產生,由於沒有保留當下的畫面,也就沒圖了

這中間切換到 chatgpt o1-preview / o1-mini / gemini pro exp版本等等,都沒有修正對程式碼,就差github copilot懶得打開來修正,猜測也是一樣,直到我看到gemini pro有一段回覆是寫這樣

我仔細看才發現,不過即使他發現了,他自己也沒有修正正確,這段程式碼是錯的,說要移除,結果還是在啊XD

總之錯誤在於,一個curl應該就只會有一種content-type,但這邊指定了兩種,上面標記 applicatioin/json

透過 -d “….”,可是又有一個 — data-binary·這邊是錯誤的

— — — — —

後來索性就不繼續AI了,自己點開了上面的API Docs

就在 Prompt with images這段內容

https://ai.google.dev/gemini-api/docs/vision?lang=rest#prompting-images

重點是看到下面,官方的這段才是正確的

https://github.com/google-gemini/generative-ai-python/blob/4f42118d3d4daf9addc1c18f7d6895f7cd74087d/samples/rest/files.sh#L84-L133

後來查了一下原來有幾個步驟

1 .透過 upload/v1beta/files 定義一個你自己的displayName

2.他會回傳一段upload_url接下來要post上傳二進位檔案的專用upload_url給你

3.使用這段upload_url上傳post你的圖片或檔案,確認ok

4.確認成功,就可以用回傳的json中file.url這個file_url當作content的一部分附加在json上面

以上就完成了,就可以正確取得到

Prompt with image 就可以成功,如果你是用python or nodejs or android等等,由於sdk內都幫你寫好了,所以基本不會遇到這個問題,也不需要知道他上傳的順序

像是android kotlin你只需要這樣寫,雖然上面寫還沒完成,但我實際使用是有一個類似Image的builder可以直接塞入content中,這邊應該是漏掉了

// NOTE: Your prompt contains media inputs, which are not currently supported by
// the Android SDK. The code snippet below may be incomplete.
//
// See here for more information and updates:
// https://ai.google.dev/gemini-api/docs/prompting_with_media

// In Android Studio, add the following dependency to your build.gradle.kts file:
// implementation("com.google.ai.client.generativeai:generativeai:0.7.0")

// Add the following code to your Kotlin source code
import com.google.ai.client.generativeai.GenerativeModel

val model = GenerativeModel(
"gemini-1.5-flash-exp-0827",BuildConfig.geminiApiKey,
generationConfig = generationConfig {
temperature = 1f
topK = 64
topP = 0.95f
maxOutputTokens = 8192
responseMimeType = "text/plain"
},
)

val chatHistory = listOf(
content("user") {
// TODO Upload file "image_animal5.jpeg"
text("what is this\n")
},
content("model") {
text("The image shows **Atlantic Puffins** (Fratercula arctica) in their natural habitat, likely on a cliff or coastal area.\n\n\nHere's a breakdown of the elements in the image:\n\n\n* **Puffins:** The main subject is a puffin in sharp focus, standing on a grassy cliff. Another puffin is visible in the background, out of focus. Puffins are characterized by their black and white plumage, bright orange beak, and orange feet. \n* **Environment:** The puffins are situated on a grassy area that looks like a cliff edge, with the ocean blurred in the background. The vegetation is primarily green, suggesting it's in a coastal or temperate region.\n* **Ocean:** The background is a soft blue, suggesting the ocean, which is the natural habitat of puffins.\n\n\n**In essence, the image captures a moment in the life of Atlantic Puffins in their natural environment.** These birds are often found in colonies on islands and cliffs near the ocean, and are known for their vibrant colors and comical appearance.\n")
},
)

val chat = model.startChat(chatHistory)
val response = chat.sendMessage("INSERT_INPUT_HERE")

println(response.text)
println(response.candidates.first().content.parts.first().asTextOrNull())

最後轉成的javascript會是這樣,裡面還有原本的ui code,就請自行腦補移除 XD

注意,由於我只是要放在自己手機或者電腦中跑顯示html網頁,如果html實際對外開放,是不建議在javascript上面讀入你的API KEY去直接執行的,最好是呼叫包在一個service中

async function captureAndSegment() {
if (!apiToken || videoElement.videoWidth === 0 || videoElement.videoHeight === 0) {
console.log('API Token not set or video not loaded.');
return;
}

try {
hiddenCanvas.width = videoElement.videoWidth;
hiddenCanvas.height = videoElement.videoHeight;
const context = hiddenCanvas.getContext('2d');
context.drawImage(videoElement, 0, 0, hiddenCanvas.width, hiddenCanvas.height);
console.log('Video frame captured.');

const blob = await new Promise(resolve => hiddenCanvas.toBlob(resolve, 'image/jpeg'));
if (!blob) {
console.log('Failed to convert video frame to Blob.');
return;
}
console.log('Video frame converted to Blob.');

const arrayBuffer = await blob.arrayBuffer();
const numBytes = arrayBuffer.byteLength;
const mimeType = blob.type;
console.log(`Blob size: ${numBytes} bytes, MIME type: ${mimeType}`);

// 1. 先使用display_name跟google取得圖片要上傳用的 url
const uploadUrl = `https://generativelanguage.googleapis.com/upload/v1beta/files?key=${apiToken}`;
const uploadHeaders = {
'X-Goog-Upload-Protocol': 'resumable',
'X-Goog-Upload-Command': 'start',
'X-Goog-Upload-Header-Content-Length': numBytes,
'X-Goog-Upload-Header-Content-Type': mimeType,
'Content-Type': 'application/json'
};
const uploadBody = JSON.stringify({ file: { display_name: 'captured_frame.jpeg' } });
console.log('Sending initial upload metadata request:', uploadBody);

const uploadResponse = await fetch(uploadUrl, {
method: 'POST',
headers: uploadHeaders,
body: uploadBody
});

if (!uploadResponse.ok) {
console.log('Upload API Error:', uploadResponse.status, uploadResponse.statusText);
return;
}
console.log('Initial upload metadata response received.');
// 2.從這邊取得 uploadUrl
const uploadUrlHeader = uploadResponse.headers.get('x-goog-upload-url');
if (!uploadUrlHeader) {
console.log('Upload URL not found in response headers.');
return;
}
console.log('Upload URL:', uploadUrlHeader);

const uploadBinaryHeaders = {
'X-Goog-Upload-Command': 'upload, finalize',
'X-Goog-Upload-Offset': '0',
'Content-Length': numBytes,
'Content-Type': mimeType
};
console.log('Uploading binary data to upload URL.');

// 3. 對這個 uploadUrl進行 post 你的binary圖片
const uploadBinaryResponse = await fetch(uploadUrlHeader, {
method: 'POST',
headers: uploadBinaryHeaders,
body: blob
});

if (!uploadBinaryResponse.ok) {
console.log('Upload Binary API Error:', uploadBinaryResponse.status, uploadBinaryResponse.statusText);
return;
}
console.log('Binary upload completed.');

const fileInfo = await uploadBinaryResponse.json();
// 4. 成功的話會在這邊取得fileUri
const fileUri = fileInfo.file.uri;
if (!fileUri) {
console.log('File URI not found in binary upload response.');
return;
}
console.log('File URI:', fileUri);

// 5. 最後把這個fileUri加入對話中
const generateContentUrl = `https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-exp-0827:generateContent?key=${apiToken}`;
const prompt = "描述圖片中的物品,或者人物動作,或是表情";
const generateContentBody = {
contents: [
{
role: "user",
parts: [
{ file_data: { file_uri: fileUri, mime_type: mimeType } },
{ text: prompt }
]
}
],
generationConfig: {
temperature: 1,
topK: 64,
topP: 0.95,
maxOutputTokens: 8192,
responseMimeType: "text/plain"
}
};
console.log('Sending generateContent request:', JSON.stringify(generateContentBody));

const generateContentResponse = await fetch(generateContentUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(generateContentBody)
});

if (!generateContentResponse.ok) {
console.log('GenerateContent API Error:', generateContentResponse.status, generateContentResponse.statusText);
return;
}
console.log('GenerateContent response received.');

const generateContentData = await generateContentResponse.json();
console.log('GenerateContent response data:', generateContentData);

// 基本上就成功啦
const description = generateContentData.candidates && generateContentData.candidates[0].content.parts
? generateContentData.candidates[0].content.parts[0].text
: "未提供描述。";
descriptions.push(description);
updateDescriptions();
console.log('Description added:', description);
} catch (error) {
console.log('Error:', error);
}
}

即使我後來將這段github上的curl流程給llm,也都沒能正確的產生,於是後來只好自己看著程式碼修正,自己開始手動改,終於才是完成可以運作,以上就是自己透過Google Studio使用prompt with image + curl遇到的坑。

--

--

JLin

JAVA / Android /Kotlin / Kotlin Native 對於Kotlin衍生的JVM等技術 Compose for web / desktop / Ktor Server,或是生成式AI (Gemini/OpenAI)各式應用, 都有小興趣