convertToCaptions()v4.0.131
This API assumes a newer version of Whisper.cpp than the stable release to support tokenLevelTimestamps
. As a downside, this version may crash unexpectedly.
Use an older version of Whisper.cpp (1.0.54 or earlier) if you prefer to use a stable version of Whisper.cpp and forgo tokenLevelTimeStamps
support.
Opinionated function that converts the output from transcribe()
into easily digestable captions.
Can also combine words with close timestamps.
Useful for TikTok/Reel-type of videos that animate captions word-by-word.
transcribe.mjstsx
importpath from "path";import {transcribe ,convertToCaptions } from "@remotion/install-whisper-cpp";const {transcription } = awaittranscribe ({inputPath : "/path/to/audio.wav",whisperPath :path .join (process .cwd (), "whisper.cpp"),model : "medium.en",tokenLevelTimestamps : true,});const {captions } =convertToCaptions ({transcription ,combineTokensWithinMilliseconds : 200,});for (constline ofcaptions ) {console .log (line .text ,line .startInSeconds );}
transcribe.mjstsx
importpath from "path";import {transcribe ,convertToCaptions } from "@remotion/install-whisper-cpp";const {transcription } = awaittranscribe ({inputPath : "/path/to/audio.wav",whisperPath :path .join (process .cwd (), "whisper.cpp"),model : "medium.en",tokenLevelTimestamps : true,});const {captions } =convertToCaptions ({transcription ,combineTokensWithinMilliseconds : 200,});for (constline ofcaptions ) {console .log (line .text ,line .startInSeconds );}
Options
transcription
The transcription
object that you retrieved from transcribe()
.
The tokenLevelTimestamps
option must have been set to true
.
combineTokensWithinMilliseconds
Combine words that are close to each other.
If words are not combined, they might display for a very short time if word-by-word captions are being used.
Disable combination by setting 0
.
Recommendation: 200
.
Return value
An object objects of the following shape:
ts
typeCaption = {text : string;startInSeconds : number;};typeReturnValue = {captions :Caption [];};
ts
typeCaption = {text : string;startInSeconds : number;};typeReturnValue = {captions :Caption [];};
Suggested usage
This shows how, given a data structure produced by convertToCaptions()
, word-by-word captions can be rendered in a Remotion project.
See our TikTok template for a full reference implementation.
@remotion/install-whisper-cpp
cannot be imported on the frontend, it is a Node.js API.
Only the TypeScript type is imported in this example
tsx
import type {Caption } from "@remotion/install-whisper-cpp";import {Sequence ,useVideoConfig } from "remotion";constCaptions :React .FC <{subtitles :Caption [];}> = ({subtitles }) => {const {fps } =useVideoConfig ();return (<>{subtitles .map ((subtitle ,index ) => {constnextSubtitle =subtitles [index + 1] ?? null;constsubtitleStartFrame =subtitle .startInSeconds *fps ;constsubtitleEndFrame =Math .min (nextSubtitle ?nextSubtitle .startInSeconds *fps :Infinity ,subtitleStartFrame +fps ,);return (<Sequence from ={subtitleStartFrame }durationInFrames ={subtitleEndFrame -subtitleStartFrame }><Subtitle key ={index }text ={subtitle .text } />;</Sequence >);})}</>);};
tsx
import type {Caption } from "@remotion/install-whisper-cpp";import {Sequence ,useVideoConfig } from "remotion";constCaptions :React .FC <{subtitles :Caption [];}> = ({subtitles }) => {const {fps } =useVideoConfig ();return (<>{subtitles .map ((subtitle ,index ) => {constnextSubtitle =subtitles [index + 1] ?? null;constsubtitleStartFrame =subtitle .startInSeconds *fps ;constsubtitleEndFrame =Math .min (nextSubtitle ?nextSubtitle .startInSeconds *fps :Infinity ,subtitleStartFrame +fps ,);return (<Sequence from ={subtitleStartFrame }durationInFrames ={subtitleEndFrame -subtitleStartFrame }><Subtitle key ={index }text ={subtitle .text } />;</Sequence >);})}</>);};