Interview - ML/Data Science Engineer

1

🟢 Ľahká

0 / 4 body

What is k-Fold Cross-Validation?

Definition: A resampling technique that divides dataset into k equal parts (folds), trains on k-1 folds and validates on remaining fold

Process: Repeats k times, rotating which fold is used for validation, then averages the results

Purpose: Reduces overfitting and provides more reliable estimate of model performance on unseen data

Common values: Typically k=5 or k=10, balancing computational cost vs. variance reduction

2

🟢 Ľahká

0 / 4 body

What is the difference between supervised and unsupervised learning?

Supervised: Training data includes labeled examples (input-output pairs), model learns mapping from inputs to known outputs

Unsupervised: Training data has no labels, model finds patterns/structure in data without guidance

Supervised examples: Classification (spam detection), regression (price prediction), requires labeled training data

Unsupervised examples: Clustering (customer segmentation), dimensionality reduction (PCA), anomaly detection

3

🟡 Stredná

0 / 5 bodov

What is reinforcement learning? What is it used for?

Definition: Learning paradigm where agent learns to make decisions by interacting with environment and receiving rewards/penalties

Key components: Agent, environment, state, action, reward signal - agent learns optimal policy to maximize cumulative reward

Game AI: Playing chess, Go, video games (AlphaGo, OpenAI Dota 2) - learning winning strategies through self-play

Robotics & Control: Robot navigation, autonomous vehicles, resource management, optimizing complex sequential decisions

Business applications: Recommendation systems, trading strategies, resource allocation, dynamic pricing

4

🟡 Stredná

0 / 5 bodov

What is data/class imbalance? What consequences does it have? How to treat those?

Definition: When classes in dataset are not equally represented (e.g., 95% negative, 5% positive cases)

Consequences: Model biased toward majority class, poor performance on minority class, misleading accuracy metrics

Resampling: Oversampling minority class (SMOTE), undersampling majority class, or combination of both

Algorithm techniques: Class weights, cost-sensitive learning, ensemble methods (balanced random forest)

Better metrics: Use precision, recall, F1-score, ROC-AUC instead of accuracy for imbalanced datasets

5

🟢 Ľahká

0 / 4 body

What is feature engineering?

Definition: Process of creating, transforming, and selecting features from raw data to improve model performance

Techniques: Creating new features (interactions, polynomials), transforming existing ones (scaling, encoding), extracting from text/dates

Importance: Often more impactful than algorithm choice, requires domain knowledge, can significantly boost model accuracy

Examples: One-hot encoding categorical variables, extracting day/month from dates, creating ratio features, binning continuous variables

6

🟡 Stredná

0 / 5 bodov

What are transformer-based models?

Architecture: Neural network architecture based on self-attention mechanism, processes sequences without recurrence (unlike RNN/LSTM)

Self-attention: Allows model to weigh importance of different words in sequence, capturing long-range dependencies efficiently

Examples: BERT, GPT (GPT-3, GPT-4), T5, transformers power most modern NLP systems

Advantages: Parallel processing (faster training), better long-range context, transfer learning capabilities

Applications: Translation, text generation, question answering, summarization, increasingly used beyond NLP (vision transformers)

7

🔴 Ťažká

0 / 6 bodov

What is RAG (retrieval-augmented generation) and how does it relate to your expertise?

Definition: Technique combining retrieval systems with generative models - retrieves relevant documents then generates response based on retrieved context

Components: Vector database/search engine for retrieval, embeddings for similarity search, LLM for generation

Advantages: Reduces hallucinations, grounds responses in real data, can access up-to-date information beyond training cutoff

Use cases: Enterprise Q&A systems, chatbots with knowledge bases, document analysis, customer support automation

Technical implementation: Involves chunking documents, creating embeddings, storing in vector DB (Pinecone, Weaviate), semantic search + prompt engineering

Personal expertise: Experience building RAG systems, working with vector databases, optimizing retrieval quality, prompt engineering for better generation

8

🟡 Stredná

0 / 5 bodov

How is fine-tuning different from regular training?

Regular training: Model trained from scratch with randomly initialized weights on large dataset

Fine-tuning: Starts with pre-trained model, adapts it to specific task with smaller dataset, adjusts existing weights

Advantages: Faster training, requires less data, leverages knowledge learned from large datasets

Common approach: Freeze early layers (general features), train later layers (task-specific features), or use lower learning rates

Examples: Taking BERT and fine-tuning for sentiment analysis, using ResNet fine-tuned for medical imaging

9

🔴 Ťažká

0 / 6 bodov

How do the fine-tuning options and limitations differ between closed-source models (e.g., OpenAI GPT-4) and open-source models that you can self-host?

Closed-source (OpenAI): Limited to API-based fine-tuning, no access to model weights, restricted customization through provided interfaces only

Open-source: Full control over model architecture, can modify any layer, complete fine-tuning flexibility, access to all weights

Cost considerations: Closed-source has per-token costs, open-source requires infrastructure investment but no usage fees

Data privacy: Closed-source sends data to external servers, open-source allows complete on-premise deployment for sensitive data

Technical expertise: Closed-source easier to use with less ML knowledge required, open-source needs deeper understanding of training, hardware, optimization

Examples: Open-source options include Llama 2, Mistral, Falcon - can use LoRA, QLoRA, full fine-tuning with complete control

10

🔴 Ťažká

0 / 6 bodov

What parameter-efficient fine-tuning techniques exist for open models (e.g., LoRA, QLoRA), and how do they work conceptually?

LoRA (Low-Rank Adaptation): Freezes pre-trained weights, adds small trainable rank decomposition matrices to layers, drastically reduces trainable parameters

How LoRA works: Instead of updating weight matrix W, trains two smaller matrices A and B where W_new = W + AB, A and B much smaller

QLoRA: Combines LoRA with quantization, loads model in 4-bit precision, trains LoRA adapters in higher precision, enables fine-tuning on consumer GPUs

Memory benefits: Can fine-tune 65B parameter models on single GPU, reduces memory from hundreds of GB to tens of GB

Other techniques: Prefix tuning (add trainable tokens), adapter layers (insert small modules), prompt tuning (optimize soft prompts)

Practical benefits: Faster training, lower costs, multiple task-specific adapters can share base model, easier deployment and version control

11

🟢 Ľahká

0 / 4 body

List vs Tuple in Python? When do you use what?

List: Mutable (can modify after creation), uses square brackets [], slower, for collections that change

Tuple: Immutable (cannot modify after creation), uses parentheses (), faster, less memory, for fixed collections

Use list when: Data will be modified (append, remove, sort), working with dynamic collections, need list methods

Use tuple when: Data should not change (coordinates, RGB values), dictionary keys, function return multiple values, better performance needed

12

🟡 Stredná

0 / 5 bodov

What are decorators in Python?

Definition: Functions that modify behavior of other functions/methods, using @decorator_name syntax above function definition

How they work: Take function as input, add functionality, return modified function - implements wrapper pattern

Common uses: Logging, timing functions, authentication/authorization, caching (memoization), validation

Built-in examples: @property, @staticmethod, @classmethod, @lru_cache for performance optimization

Practical benefit: Clean code separation of concerns, reusable cross-cutting functionality, don't repeat yourself (DRY)

13

🟢 Ľahká

0 / 4 body

What is the purpose of the .groupby() method in the Pandas library?

Purpose: Groups rows of DataFrame based on column values, enabling aggregation and analysis per group

Common operations: sum(), mean(), count(), min(), max() applied to each group separately

Example use: df.groupby('category').mean() - calculates average for each category, like SQL GROUP BY

Advanced features: Can group by multiple columns, apply custom functions with .agg(), transform data within groups

14

🟡 Stredná

0 / 5 bodov

Explain the difference between a generator and a normal function that returns a list. Why are generators (which use the yield keyword) particularly beneficial when processing very large, potentially memory-intensive datasets in data science?

Normal function: Returns entire list at once, all values stored in memory simultaneously

Generator: Uses yield keyword, returns one value at a time (lazy evaluation), maintains state between calls

Memory efficiency: Generator stores only current item, not entire dataset - critical for large datasets that won't fit in RAM

Data science use: Processing large CSV files, streaming data, ETL pipelines, batch processing without loading everything into memory

Example: Reading 100GB file line-by-line with generator vs loading all into list (would crash), enables processing datasets larger than available RAM

🎯 ML/Data Science Interview

👤 Informácie o kandidátovi

📊 Celkové Skóre

📋 Interview Report - ML/Data Science Engineer