2022/2023
High Performance Computing and Big Data Analytics
Code: 43917
ECTS Credits: 12
Degree |
Type |
Year |
Semester |
4313473 Bioinformatics |
OT |
0 |
1 |
Use of Languages
- Principal working language:
- english (eng)
Teachers
- Santiago Marco Sola
External teachers
- Emanuele Raineri
- Oscar Lao
Prerequisites
To carry out this module is necessary to have passed previously both compulsory modules: Programming in Bioinformatics and Core Bioinformatics.
It is recommended you have a Level B2 of English or equivalent.
Objectives and Contextualisation
This module aims to provide students with the necessary knowledge and skills (1) to implement performance engineering approaches into modern computing platforms and (2) to perform statistical analyses of Big Data.
Competences
- Communicate research results clearly and effectively in English.
- Design and apply scientific methodology in resolving problems.
- Possess and understand knowledge that provides a basis or opportunity for originality in the development and/or application of ideas, often in a research context.
- Propose biocomputing solutions for problems deriving from omic research.
- Propose innovative and creative solutions in the field of study
- Use and manage bibliographical information and computer resources in the area of study
- Use operating systems, programs and tools in common use in biocomputing and be able to manage high performance computing platforms, programming languages and biocomputing analysis.
Learning Outcomes
- Apply advanced statistical methods (automatic learning, graph theory) to model and analyse bioinformatics problems involving massive biological data.
- Communicate research results clearly and effectively in English.
- Describe and apply clustering techniques and common classification algorithms.
- Describe the operation, characteristics and limitations of the techniques, tools and methodologies to describe, analyze and interpret the amount of data produced by high-throughput technologies.
- Design and apply scientific methodology in resolving problems.
- Generate efficient parallel computing algorithms and applications for CID.
- Know and handle open-source tools for parallel, distributed and scalable analysis through automatic learning.
- Know the principles of massive data storage and management.
- Know the principles of process parallelisation.
- Learn new ways to model, store, recover and analyse abstract data types (graphs).
- Learning to handle new platforms computing platforms, paradigms, and design applications that require massive computing and data handling.
- Possess and understand knowledge that provides a basis or opportunity for originality in the development and/or application of ideas, often in a research context.
- Propose innovative and creative solutions in the field of study
- Provide parallel solutions to specific bioinformatic problems.
- Train, evaluate and validate predictive models.
- Use and manage bibliographical information and computer resources in the area of study
Content
Modern Computer Architecture
- General-Purpose and specialized processor architecture
- Memory hierarchy
- Cluster systems
- Cloud infrastructures and system virtualization
- System Middleware and Programming Frameworks
Advanced Programming Models
- Shared-memory and distributed parallel programming
- Advanced shell scripting
- Using system tools for bioinformatics analysis
- Principles of performance engineering (tools and methods)
- High Performance Computing with Python
- Performance engineering applied to common bioinformatics algorithms and tools (genome indexing, read alignment…).
Big Data Analytics
- Theory and tools of advanced statistics in Big Data analytics (dimensionality reduction, variable selection and Spark)
- Machine learning theory and algorithms. Applications in Bioinformatics
- Predictive modelling: data mining, model evaluation and validation
- Data classification: naïve Bayes and decision trees learning
- Association rule learning
- Clustering analysis: k-means algorithm
- Graph Theory for Big Data
Methodology
By following a problem-oriented approach, students will get insight about efficient computational algorithms, methods and platforms and the statistical methods to be applied to challenging bioinformatics problems dealing with Big Data.
Annotation: Within the schedule set by the centre or degree programme, 15 minutes of one class will be reserved for students to evaluate their lecturers and their courses or modules through questionnaires.
Assessment
The evaluation system is organized in two main activities. There will be, in addition, a retake exam. The details of the activities are:
Main evaluation activities
- Student's portfolio (60%): works done and presented by the student all along the course. None of the individual assessment activities will account for more than 50% of the final mark.
- Individual theoretical and practical test (40%): a final exam will take place at the end of this module.
Retake exam
To be eligible for the retake process, the student should have been previously evaluated in a set of activities equaling at least two thirds of the final score of the module. The teacher will inform the procedure and deadlines for the retake process.
Not valuable
The student will be graded as "Not Valuable" if the weight of the evaluation is less than 67% of the final score.
Assessment Activities
Title |
Weighting |
Hours |
ECTS |
Learning Outcomes |
Individual theoretical and practical tests |
40% |
4
|
0.16 |
1, 15, 11, 10, 2, 9, 8, 7, 4, 3, 6, 14, 13, 12
|
Works done and presented by the student (student's portfolio) |
60% |
0
|
0 |
1, 15, 11, 10, 2, 9, 8, 7, 4, 3, 5, 6, 14, 13, 12, 16
|
Bibliography
Updated bibliography will be recommended in each session of this module by the professor, and links will be made available on the Student's Area of the MSc Bioinformatics official website
Software
Linux + SLURM and other tools from Linux enviroments
Python and other tools from its ecosystem
R and other tools from its ecosystem