Vercel 60초 벽을 넘어라 — 서버리스에서 무거운 분석 돌리기

Server room with blinking lights representing cloud computing

504 Gateway Timeout — 시작은 단순했다

BioAI Market의 핵심 기능은 프로테오믹스 데이터 분석이다. 사용자가 CSV를 업로드하면 QC(Quality Control) → DE(Differential Expression) → Pathway 분석을 순차적으로 돌린다. 문제는 이 파이프라인 전체가 3분 이상 걸린다는 것이었다.

Vercel 무료 플랜의 서버리스 함수 timeout은 60초. Pro 플랜($20/월)으로 올려도 300초. 처음에는 단순하게 한 번의 API 호출로 전체 파이프라인을 실행했더니, 당연히 이런 에러가 떨어졌다:

504 Gateway Timeout
FUNCTION_INVOCATION_TIMEOUT
Task timed out after 60.00 seconds

"분석을 쪼개면 되겠지" — 이렇게 생각했다. 하지만 현실은 그리 단순하지 않았다.

시도 1: 분석 단계 분리 — 부분적 성공

전체 파이프라인을 3개의 API로 분리했다:

POST /api/analysis/qc      → QC 분석 (~20초)
POST /api/analysis/de       → DE 분석 (~40초)
POST /api/analysis/pathway  → Pathway 분석 (~30초)

프론트에서 순차 호출하는 방식:

// Step 1: QC
const qcResult = await fetch('/api/analysis/qc', {
  method: 'POST',
  body: JSON.stringify({ data, groups })
})

// Step 2: DE
const deResult = await fetch('/api/analysis/de', {
  method: 'POST',
  body: JSON.stringify({ data, groups, qcResult: await qcResult.json() })
})

// Step 3: Pathway
const pathwayResult = await fetch('/api/analysis/pathway', {
  method: 'POST',
  body: JSON.stringify({ deResult: await deResult.json(), organism: 'human' })
})

2그룹 비교에서는 잘 돌아갔다. 각 단계가 40초 안에 끝나니 timeout에 걸리지 않았다. 하지만 ANOVA 9그룹 분석에서 또 504가 터졌다.

POST /api/analysis/de - 504 Gateway Timeout

9그룹 ANOVA + Tukey HSD 조합은 DE 단계만 90초가 넘었다. 60초 벽에 또 막혔다.

시도 2: 진짜 원인은 payload size였다

여기서 이틀을 더 날렸다. DE 분석 시간을 줄이려고 코드 최적화에 매달렸는데, 로그를 자세히 보니 이상한 점이 있었다:

[DE] Analysis completed in 43.2 seconds
[DE] Serializing response...
504 Gateway Timeout

43초에 분석이 끝났는데 504가 뜬다? 60초 안에 끝났는데 왜?

원인은 response payload 크기였다. 9그룹 × 5000개 단백질의 DE 결과를 JSON으로 직렬화하면 수십 MB가 됐다. Vercel 서버리스는 compute time뿐 아니라, response body를 클라이언트에 전송하는 시간도 timeout에 포함된다. 43초에 계산이 끝나도 거대한 JSON을 전송하다가 60초를 넘긴 것이다.

이걸 깨닫기까지가 진짜 힘들었다. 공식 문서 어디에도 이런 경우를 명시하지 않았다.

해결 1: Response Trimming — 90% 감소

첫 번째 해결책은 응답 크기를 줄이는 것이었다:

# Before: 모든 단백질에 대해 full detail 반환
results = []
for protein in proteins:
    results.append({
        "protein": protein.name,
        "log2fc": protein.log2fc,
        "pvalue": protein.pvalue,
        "adj_pvalue": protein.adj_pvalue,
        "mean_control": protein.mean_control,
        "mean_treatment": protein.mean_treatment,
        "std_control": protein.std_control,
        "std_treatment": protein.std_treatment,
        "ci_lower": protein.ci_lower,
        "ci_upper": protein.ci_upper,
        "effect_size": protein.effect_size,
        # ... 15개 필드
    })

# After: Top 100만 full detail, 나머지는 stripped
results_sorted = sorted(results, key=lambda x: x['adj_pvalue'])
top_results = results_sorted[:100]  # full detail
stripped_results = [
    {"protein": r["protein"], "log2fc": r["log2fc"], "adj_pvalue": r["adj_pvalue"]}
    for r in results_sorted[100:]
]

response = {
    "top": top_results,
    "rest": stripped_results,
    "summary": {
        "total": len(results),
        "significant": sum(1 for r in results if r["adj_pvalue"] < 0.05)
    }
}

결과: JSON 크기가 18MB → 1.8MB로 약 90% 감소. 대부분의 경우 이것만으로 60초 안에 응답이 완료됐다.

하지만 이건 임시방편이었다. 데이터가 더 커지면 또 문제가 생길 게 뻔했다.

최종 해결: Async Job Queue — submit + poll 패턴

근본적인 해결책은 비동기 작업 큐였다. HTTP 요청-응답 사이클에서 분석을 분리하는 것이다.

아키텍처

[Frontend]                    [Vercel API]              [Python Backend]
    |                              |                          |
    |--- POST /jobs/submit ------->|                          |
    |<-- { jobId: "abc123" } ------|--- POST /analyze ------->|
    |                              |<-- 202 Accepted ---------|
    |--- GET /jobs/abc123 -------->|                          |
    |<-- { status: "running" } ----|    (분석 진행 중...)      |
    |                              |                          |
    |--- GET /jobs/abc123 -------->|                          |
    |<-- { status: "running" } ----|                          |
    |                              |                          |
    |--- GET /jobs/abc123 -------->|--- GET /result/abc123 -->|
    |<-- { status: "done", ... } --|<-- { result: ... } ------|

프론트엔드: 3초 간격 polling

async function submitAndPoll(data: AnalysisInput): Promise<AnalysisResult> {
  // 1. Submit
  const submitRes = await fetch('/api/jobs/submit', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(data)
  })
  const { jobId } = await submitRes.json()

  // 2. Poll every 3 seconds
  while (true) {
    await new Promise(resolve => setTimeout(resolve, 3000))

    const pollRes = await fetch(`/api/jobs/${jobId}`)
    const job = await pollRes.json()

    if (job.status === 'completed') {
      return job.result
    }
    if (job.status === 'failed') {
      throw new Error(job.error || 'Analysis failed')
    }

    // Update progress UI
    setProgress(job.progress || 0)
    setStatusMessage(job.message || 'Analyzing...')
  }
}

백엔드: threading으로 분석 실행

import threading
import uuid
from flask import Flask, request, jsonify

app = Flask(__name__)
jobs = {}  # In-memory store (production에서는 Redis 권장)

@app.route('/analyze', methods=['POST'])
def submit_job():
    job_id = str(uuid.uuid4())
    data = request.json

    jobs[job_id] = {"status": "running", "progress": 0, "message": "Starting..."}

    # Background thread에서 분석 실행
    thread = threading.Thread(target=run_analysis, args=(job_id, data))
    thread.start()

    return jsonify({"jobId": job_id}), 202

@app.route('/result/<job_id>', methods=['GET'])
def get_result(job_id):
    job = jobs.get(job_id)
    if not job:
        return jsonify({"error": "Job not found"}), 404
    return jsonify(job)

def run_analysis(job_id, data):
    try:
        # QC
        jobs[job_id]["message"] = "Running QC analysis..."
        jobs[job_id]["progress"] = 10
        qc_result = run_qc(data)

        # DE
        jobs[job_id]["message"] = "Running differential expression..."
        jobs[job_id]["progress"] = 40
        de_result = run_de(data, qc_result)

        # Pathway
        jobs[job_id]["message"] = "Running pathway analysis..."
        jobs[job_id]["progress"] = 70
        pathway_result = run_pathway(de_result)

        jobs[job_id] = {
            "status": "completed",
            "progress": 100,
            "result": {
                "qc": qc_result,
                "de": de_result,
                "pathway": pathway_result
            }
        }
    except Exception as e:
        jobs[job_id] = {"status": "failed", "error": str(e)}

사용자 경험 — 프로그레스 바가 답이다

비동기로 바꾸면서 오히려 UX가 좋아졌다. 이전에는 60초 동안 빈 로딩 스피너만 돌았는데, 이제는 실시간 진행률과 상태 메시지를 보여줄 수 있었다:

🔬 Running QC analysis... (10%)
📊 Running differential expression... (40%)
🧬 Running pathway analysis... (70%)
✅ Analysis complete! (100%)

사용자 입장에서는 3분이 걸려도 "진행되고 있구나"를 알 수 있으니 체감이 완전히 달랐다.

교훈

504의 원인을 정확히 파악하라 — compute timeout이 아니라 payload size일 수 있다
서버리스에서 무거운 작업은 async가 정답 — 동기 요청-응답으로 버티려 하지 마라
Response trimming은 항상 하라 — 프론트에서 5000개 row를 다 받을 필요는 없다
Progress feedback은 필수 — 긴 작업에서 사용자를 기다리게 하려면 "지금 뭘 하는지" 보여줘야 한다

참고 링크: