roller  Processing...
Shroff Publishers & Distributors Pvt. Ltd.
Login
 
 
|
Books Expand/Collapse
Skip Navigation Links.
.Net Platform
3D Computer Graphics
ABAP Programming
Academics
ACCA (Association of Chartered Certified Accountants)
Accounting
ActionScript
Active Server Pages (ASP)
Administration
ADO .NET
Adobe
Adobe Acrobat
Adobe Indesign
Adobe Photoshop
Aeronautical Engineering & Aircraft Maintenance
Aeronautics
Agile Testing
AIEEE
Ajax
Algebra
Algorithms
Amazon
Android
Animation
ANSI
Apache
Apple
Apple Mobile
Application Development
Application Program Interface
Application Software
Applications
Architecture
Architecture & Analysis
Arduino
Artificial Intelligence
Arts & Photography
AS/400
ASP.NET
Assembly Language Programming
Astronomy
Audio Books
Autocad
Aviation
Aviation Weather
Banking
Beginners Level
Biographies & Memoirs
Biography
Bioinformatics
Biological Terrorism
Biology
Biotechnology Engineering
Body, Mind & Spirit
Book Publishing
BPEL (Business Process Execution Language)
Brand Management
Buddhism
Business
Business & Economics
Business & Investing
Business & Money
Business Application
Business Communications
Business Governance
Business Intelligence
Business Management
Business Process
Business Skills
Business Software
Business, Management & Finance
C & C++
C Programming
C# / Visual C# .Net
C++ Programming
C/ C++/ C#
CADD
Career
Career Development
Career Guides
Catering & Hotel Management
Certification
Chartered Accountancy
Chemical Engineering
Chemistry
CIA
CICS
CIMA (Chartered Institute of Management Accountants)
CISA
Cisco / Brocade
CISSP
Civil Aviation Requirement
Civil Engineering
Civil Services Aptitude Test (CSAT)
Client/Server
Clothing
Cloud Computing
Cloud Programming
CMMI (Capability Maturity Model Integration)
Cobol
Coldfusion
COM / DCOM / COM+
Communications
Competitive Examination
Complete Study Text
Complete Text
Computer Architecture
Computer Games
Computer Graphics
Computer Programming
Computer Science
Computer Security
Computer Vision
Computers
Computers & Internet
Computers & Technology
Computing
Configuration Management Software
Consumer Behaviour
Content Management System (CMS)
Cookbooks
CPIM
CQA
Crafts & Hobbies
Criminology
Crystal Reports
CSS (Cascading Style Sheets)
Current Affairs
Customer Relationship Management (CRM)
Data
Data Analysis
Data Modeling
Data Modeling & Design
Data Science
Data Warehousing
Database Management
Database Programming
Databases
Databases & Big Data
DB2
Decorative Arts & Design
Defence
Dental
Design
Designing with Data
Desktop Publishing (Macintosh & Windows)
DHTML
Digital Audio
Digital Media
Digital Photography
Digital Video
Distributed Computing
DIY Projects
DNS
Drafting
Dreamweaver Ultradev / Dreamweaver MX / Dreamweaver CS
Drupal
E- Commerce
E-Commerce
E-Learning
Eclipse
Economics
Education & Reference
Educational
Electrical Engineering
Electronic Project
Electronics
Electronics Engineering
ELT & Dictionary
Email
Embedded Systems
Engineering
Engineering & Transportation
English
English Language Teaching
Enterprise
Enterprise JavaBeans (EJB)
Enterprise Products and Platforms
Enterprise Service Bus (ESB)
Entrepreneurship
Environment
Environmental Studies
Ergonomic
ERP (Enterprise Resource Planning)
Exam Kit
Exam Preparation
Experiments & Projects
F#
Family & Relationships - Parenting
FAQ (Frequently Asked Questions)
Fashion Design
Fashion Technology
Fiction
Filemaker Pro
Finance
Financial Accounting
Financial Applications
Financial Management
Financial Operations
Financial Strategy
Flash
Flex
Flight
Food Recipes
Foreign Exchange
Forensics
French
Functional Programming
Game Development
Game Programming
Games & Strategy Guides
General
General Computing
Geographical Information Systems (GIS)
Geometry
Globalization
GMAT (Graduate Management Admission Test)
GNU
Google
Google Android
Graph Theory
Graphics
Graphics & Design
Graphics Design
Graphics Programming
GRE (Graduate Record Examination)
Green Computing
Hacking
Hardware
Hardware & DIY
Haskell
Health
Health & Fitness - Healing
Health IT
Health, Mind & Body
Hedge Funds
Hibernate
History
Hive
Hobbies
Home Improvement & Design
Hospitality
Hospitality Management
Hotel Management
HRD
HTML
HTML5
Human Resource Management
Human Resource Management Systems (HRMS)
IBM
IBM Content Navigator
IBM Mainframe
IBM Technical Resources
IBM WebSphere
IIT-JEE
Image Processing
IMAP (Internet Message Access Protocol)
Industrial Design
Industrial Engineering
Information Management
Information Management Software
Information Security
Information Technology
Information Theory
Infrastructure Solutions
Innovation Management
Inspirational
Insurance
Integration Software
Interactive Text
Interior Design
International Business
International Developemnt
International Trade
Internet
Internet & Digital Media
Internet & Web
Internet Advertising
Internet Programming
Internet Protocol
Internet Security
Introducing to Computers
Inventions
Investments
iOS Programming
iPad
iPhone
iPod
Islamic Finance
IT Management
ITIL
Jakarta Commons
Jakarta Struts
Java
Java 2 Enterprise Edition (J2EE)
Java Certification
Java Programming
Java Server Programming
JavaScript
JavaServer Faces (JSF)
JavaServer Pages (JSP)
JBoss
JDBC
Jini
Joomla!
Journalism
jQuery
Laboratory
labour
LAN (Local Area Network)
Language
Law
LDAP
Leadership
Learning Disability
Linux
Liquor
Literature
Logistics
Lotus Notes & Domino
Mac
Mac OS
Machine Learning
Macintosh
Main Exams
Management
Management Information System (MIS)
Marine
Marine Engineering
Marketing
Marketing Management
Mathematical & Statistical Software
Mathematics
Matlab
Maya
MCSA / MCSE / MCSD
Mechanical Engineering
Media
Medical
Microcontrollers
Microfinance
Microservices
Microsoft
Microsoft .Net Framework
Microsoft Access
Microsoft Certification
Microsoft Development
Microsoft Dynamics
Microsoft Excel
Microsoft Frontpage
Microsoft Office
Microsoft PowerPoint
Microsoft Programming
Microsoft Project
Microsoft Sharepoint
Microsoft Silverlight
Microsoft SQL Server
Microsoft Visual Basic
Microsoft Windows
Microsoft Word
Missing Manual
Mobile Computing
Mobile Development
Mobile Device
Mobile Enterprise
Mobile Programming
Mobile Security
Moodle
Motivational
Mulitmedia Development
Multimedia
Multithreaded
Murach
Musical Instruments
MySQL
Negotiating
Network
Network Administration
Networking
Networking & Cloud Computing
New Age
Non Fiction
NTSE
Nursing
Object Technology
Object-Oriented Programming
Office
Office Application
Online Marketing
OOP
Open Source
Open Source Programming
OpenGL Programming
Operating Systems
Operation Management
Operations Management
Oracle
Organization Development
Organizational Behavior
Organizational Management
Patent and Trademarks
Patterns
PC Hardware
PeopleSoft
Performance
Perl
Personal Computers
Personal Growth
Pharmacology
Philosophy
Photography
PHP
Physics
PMI-ACP Exam
PMP
Pocket Notes
Political Science
Politics & Government
Postfix
Practice & Revision Kit
Presentation Software
Product Design
Professionals
Programming
Programming & Software Development
Programming Languages
Programming Tools
Project Book
Project Management
PSAT/NMSQT
Psychology
Python
Quality
Quality Management
Quick Test Professional (QTP)
R Languages
Rails
Raspberry Pi
RDF
Real Estate/Home Buying Guides
Recipes
Reference
Relationship Marketing
Religion & Spirituality
Research Methods in Management
Revision Cards
RFID (Radio Frequency Identification)
Risk Management
Risk Management & Insurance
Robotics
Rockets
RPG (Report Program Generator)
RSS (Rich Site Summary)
Ruby
Sales and Distribution
Sales Management
Samba
SAP
SAP At Special Price
SAS
SAT (Scholastic Aptitude Test)
Science
Science & Math
Sciences, Technology & Medicine
Search Engine Marketing
Search Engine Optimization (SEO)
Security
Self-Help
Sendmail
SharePoint
Skills
SOA: Service-Oriented Architecture
SOAP
Social Aspects
Social Media
Social Networking
Social Sciences
Social Web
Software
Software Architecture
Software Design
Software Development
Software Documentation
Software Engineering
Software Project
Software Testing
Solaris
Special Discounts
Special Price
Spiritual
Sports
Spreadsheet
Spring
SQL
Statistics
Statistics Programming
Stock Market
Strategic Management
Structural Analysis & Design
Study Aids
Study Guide
Success
Supply Chain Management
Sybase
System Administration
System Programming
Systems Analysis & Design
TCP/IP
Tech Culture
Technical Writing
Technology
Technology & Engineering
Telecommunications
Telephony
Test Preparation
Text Processing
Textbooks
Time Management
Tivoli
TOEFL (Test of English as a Foreign Language)
Tomcat
Trade Business
Transportation
Travel and Tourism Management
UML (Unified Modeling Language)
Unix
Unix Programming
Unix System Administration
Unix Text Editing
Unix Utilities
Usability
Used Books
User Experience
User Interface Design
UX (User experience)
Valuation
VBA
VBScript
View Engine
Virtualization
Virtualization and Cloud
Visual Basic
Visual Basic .Net / VB .NET
Visual Basic Certification
Visual Basic Programming
Visual C++
Visual Studio .Net
Vmware
Vocational
VoIP
VPN (Virtual Private Networks)
WAP
Wearables
Web
Web Analytics
Web Application
Web Application Framework
Web Applications Testing
Web Authoring
Web Authoring & Design
Web Design
Web Development
Web Development & Design
Web Graphics & Video
Web Marketing
Web Programming
Web Publishing
Web Security
Web Server
Web Services
WebLogic
Website Optimization
WebSphere
Windows
Windows 2000
Windows 7
Windows 8
Windows 95
Windows Administration
Windows Applications
Windows Forms
Windows NT
Windows NT Administration
Windows Programming
Windows Server
Windows Server 2003
Windows Vista
Windows XP
Wireless
Women Empowerment
WordPress
Workplace Culture
World Wide Web
X Windows
XHTML
XML (Extensible Markup Language)
XSL
XSLT
Yoga
Titles By Year Expand/Collapse
Product Details
Advanced Analytics with Spark
Advanced Analytics with Spark
Patterns for Learning from Data at Scale
By Sean Owen, Josh Wills, Sandy Ryza, Uri Laserson
|
ISBN: 9789352130900
Paperback
Pages: 300
Size: 7 X 9
Shroff/O'Reilly (2015)
Arrival Date: June 01, 2015
List Price: Rs 525.00
Net Price: Rs 358.00    You save 31.82%
Usually shipped in 1-2 days

Add to cart    Add to wishlist
Description Table of Contents
In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniquesclassification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

Patterns include:
  • Recommending music and the Audioscrobbler data set
  • Predicting forest cover with decision trees
  • Anomaly detection in network traffic with K-means clustering
  • Understanding Wikipedia with Latent Semantic Analysis
  • Analyzing co-occurrence networks with GraphX
  • Geospatial and temporal data analysis on the New York City Taxi Trips data
  • Estimating financial risk through Monte Carlo simulation
  • Analyzing genomics data and the BDG project
  • Analyzing neuroimaging data with PySpark and Thunder
About the Authors
Sandy Ryza
is a data scientist at Cloudera and active contributor to the Apache Spark project. He recently led Spark development at Cloudera and now spends his time helping customers with a variety of analytic use cases on Spark. He is also a member of the Hadoop Project Management Committee.

Uri Laserson is a data scientist at Cloudera, where he focuses on Python in the Hadoop ecosystem. He also helps customers deploy Hadoop on a wide range of problems, focusing on life sciences and health care. Previously, Uri cofounded Good Start Genetics, a next generationdiagnostics company while working towards a PhD in biomedical engineering at MIT.

Sean Owen is Director of Data Science for EMEA at Cloudera. He has been a significant contributor to the Apache Mahout machine learning project since 2009, and authored its “Taste” recommender framework. He created the Oryx (formerly Myrrix) project for realtime large scale learning on Hadoop, built on lambda architecture principles, and has contributed to Spark and Spark’s MLlib project.

Josh Wills is Cloudera's Senior Director of Data Science, working with customers and engineers to develop Hadoop based solutions across a wide range of industries. He is the founder and VP of the Apache Crunch project for creating optimized MapReduce and Spark pipelines in Java.Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+.
Chapter 1. Analyzing Big Data
The Challenges of Data Science
Introducing Apache Spark
About This Book

Chapter 2. Introduction to Data Analysis with Scala and Spark
Scala for Data Scientists
The Spark Programming Model
Record Linkage
Getting Started: The Spark Shell and SparkContext
Bringing Data from the Cluster to the Client
Shipping Code from the Client to the Cluster
Structuring Data with Tuples and Case Classes
Aggregations
Creating Histograms
Summary Statistics for Continuous Variables
Creating Reusable Code for Computing Summary Statistics
Simple Variable Selection and Scoring
Where to Go from Here

Chapter 3. Recommending Music and the Audioscrobbler Data Set
Data Set
The Alternating Least Squares Recommender Algorithm
Preparing the Data
Building a First Model
Spot Checking Recommendations
Evaluating Recommendation Quality
Computing AUC
Hyperparameter Selection
Making Recommendations
Where to Go from Here

Chapter 4. Predicting Forest Cover with Decision Trees
Fast Forward to Regression
Vectors and Features
Training Examples
Decision Trees and Forests
Covtype Data Set
Preparing the Data
A First Decision Tree
Decision Tree Hyperparameters
Tuning Decision Trees
Categorical Features Revisited
Random Decision Forests
Making Predictions
Where to Go from Here

Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering
Anomaly Detection
K-means Clustering
Network Intrusion
KDD Cup 1999 Data Set
A First Take on Clustering
Choosing k
Visualization in R
Feature Normalization
Categorical Variables
Using Labels with Entropy
Clustering in Action
Where to Go from Here

Chapter 6. Understanding Wikipedia with Latent Semantic Analysis
The Term-Document Matrix
Getting the Data
Parsing and Preparing the Data
Lemmatization
Computing the TF-IDFs
Singular Value Decomposition
Finding Important Concepts
Querying and Scoring with the Low-Dimensional Representation
Term-Term Relevance
Document-Document Relevance
Term-Document Relevance
Multiple-Term Queries
Where to Go from Here

Chapter 7. Analyzing Co-occurrence Networks with GraphX
The MEDLINE Citation Index: A Network Analysis
Getting the Data
Parsing XML Documents with Scala’s XML Library
Analyzing the MeSH Major Topics and Their Co-occurrences
Constructing a Co-occurrence Network with GraphX
Understanding the Structure of Networks
Filtering Out Noisy Edges
Small-World Networks
Where to Go from Here

Chapter 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
Getting the Data
Working with Temporal and Geospatial Data in Spark
Temporal Data with JodaTime and NScalaTime
Geospatial Data with the Esri Geometry API and Spray
Preparing the New York City Taxi Trip Data
Sessionization in Spark
Where to Go from Here

Chapter 9. Estimating Financial Risk through Monte Carlo Simulation
Terminology
Methods for Calculating VaR
Our Model
Getting the Data
Preprocessing
Determining the Factor Weights
Sampling
Running the Trials
Visualizing the Distribution of Returns
Evaluating Our Results
Where to Go from Here

Chapter 10. Analyzing Genomics Data and the BDG Project
Decoupling Storage from Modeling
Ingesting Genomics Data with the ADAM CLI
Predicting Transcription Factor Binding Sites from ENCODE Data
Querying Genotypes from the 1000 Genomes Project
Where to Go from Here

Chapter 11. Analyzing Neuroimaging Data with PySpark and Thunder
Overview of PySpark
Overview and Installation of the Thunder Library
Loading Data with Thunder
Categorizing Neuron Types with Thunder
Where to Go from Here

Appendix Deeper into Spark
Serialization
Accumulators
Spark and the Data Scientist’s Workflow
File Formats
Spark Subprojects

Appendix Upcoming MLlib Pipelines API
Beyond Mere Modeling
The Pipelines API
Text Classification Example Walkthrough
MINI CART

Your cart is empty.
MINI WISHLIST

Your wishlist is empty.