import os

ffmpeg_path = r"C:\ffmpeg\ffmpeg-8.1-essentials_build\bin"
os.environ["PATH"] += os.pathsep + ffmpeg_path

import whisper
import re
import nltk
import string
from collections import Counter
from transformers import pipeline
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

model = whisper.load_model("base")
print("Whisper model loaded successfully.")

Whisper model loaded successfully.

audio_file = r"C:\Users\ADMIN\lecture_full.mp3"   # <-- update this path

result = model.transcribe(audio_file)
transcript = result["text"]

C:\Users\ADMIN\anaconda3\envs\trading_env\Lib\site-packages\whisper\transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")

def clean_transcript(text):
    text = re.sub(r"\b(um|uh|hmm|you know|like)\b", "", text, flags=re.IGNORECASE)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"[^\w\s.,!?]", "", text)
    return text.strip()

cleaned_text = clean_transcript(transcript)

from transformers import pipeline

summarizer = pipeline(
    "text-generation",
    model="google/flan-t5-base"
)

def chunk_text(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

chunks = chunk_text(cleaned_text)

print(f"\nTotal chunks created: {len(chunks)}")

all_summaries = []

for i, chunk in enumerate(chunks):
    print(f"Summarizing chunk {i+1}/{len(chunks)}...")
    
    prompt = f"Summarize this lecture transcript clearly and concisely:\n\n{chunk}"
    
    result = summarizer(
        prompt,
        max_length=200,
        do_sample=False
    )
    
    all_summaries.append(result[0]["generated_text"])

final_summary = " ".join(all_summaries)
stop_words = set(stopwords.words("english"))

words = word_tokenize(cleaned_text.lower())

words = [
    word for word in words
    if word not in stop_words
    and word not in string.punctuation
    and len(word) > 2
]

keywords = Counter(words).most_common(10)

print("\n===== TOP KEYWORDS =====\n")
for word, freq in keywords:
    print(f"{word}: {freq}")

===== TOP KEYWORDS =====

time: 23
series: 21
data: 19
analysis: 14
component: 10
use: 7
thats: 7
one: 6
components: 6
forecasting: 6

print(final_summary[:1000])

Summarize this lecture transcript clearly and concisely:

My smartwatch tracks how much sleep I get each night. If Im feeling curious, I can look on my phone and see my nightly slumber plotted on a graph. It might look something this. And on the graph, on the Y axis, we have the hours of sleep. And then on the X axis, we have days. And this is an example of a time series. And what a time series is is data of the same entity, my sleep hours, collected at regular intervals, over days. And when we have time series, we can perform a time series analysis. And this is where we analyse the timestamp data to extract meaningful insights and predictions about the future. And while its super useful to forecast that I am going to probably get seven hours shut eye tonight based on the data, time series analysis plays a significant role in helping organisations drive better business decisions. So for example, using time series analysis, a retailer can use this functionality to predict future sales a

Step	Tool	Output
Transcribe audio	`openai-whisper`	Raw transcript string
Clean transcript	`re` (regex)	Cleaned text
Summarize	`FLAN-T5` via `transformers`	Chunk-wise + final summary
Extract keywords	`NLTK`	Top 10 frequent terms

Audio Processing & Speech-to-Text Pipeline¶

Audio → Transcript → Summary → Keywords¶

Workflow¶

Import Libraries¶

Transcribe Audio with Whisper¶

Clean the Transcript¶

Summarize with FLAN-T5 and Extract Top Keywords¶

Summary¶