Welcome!

Open Source Cloud Authors: Stackify Blog, Vaibhaw Pandey, Automic Blog, John Walsh, Mehdi Daoudi

Related Topics: @CloudExpo, Microsoft Cloud, Open Source Cloud

@CloudExpo: Blog Post

Solr vs Azure Search

Search-as-a-service from Microsoft Azure

Microsoft Azure, a cloud platform, is rapidly expanding its scope to include newer enterprise class services. Some of the significant new additions are:

  • Azure Search: Azure Search Service is a fully managed, cloud-based service that allows developers to build rich search applications using REST APIs. It includes full-text search scoped over your content, plus advanced search behaviors similar to those found in commercial web search engines, such as type-ahead, suggested queries based on near matches, and faceted navigation.
  • Azure Machine Learning: Azure Machine Learning makes it possible for people without deep data science backgrounds to start mining data for predictions. ML Studio, an integrated development environment, uses drag-and-drop gestures and simple data flow graphs to set up experiments. For many tasks, you don't have to write a single line of code.
  • Azure Stream Analytics: Azure Stream Analytics is a fully managed service providing low latency, highly available, scalable complex event processing over streaming data in the cloud.

All these new services with a road map for new ones will position Azure as a leading platform in the enterprise adoption of PaaS.

In the following notes, I compare the open source search platform Solr against the capabilities of Azure Search services and note some advantages enterprises may derive by adopting the PaaS implementation of search.

Solr Features Compared with Azure Search
Solr is a fast open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.

The following are the some of the aspects in the usage of Solr in enterprises against that of Azure Search. As the open source vs commercial software is a religious debate, the intent is not aimed at the argument, as the most enterprises define their own IT Policies between the choice of Open Source vs commercial products and same sense will prevail here also, the below notes are meant for understanding the new Azure service in the light of an existing proven search platform.

Feature

Usage In Solr

Usage In Azure Search

Installation & Setup

While Solr can be installed as a self-contained engine by using Jetty. Most sites utilize Tomcat as the container for the Solr web application.

As typical of many open source products, there are few more dependencies like Apache Commons, SLF4J and JDK needs to be installed as part of setup.

Being a PaaS platform, Azure Search is a fully managed and readily available service and any of the internal dependencies are managed by the Azure platform.

Schema

Solr works on a pre defined schema and every Solr instance of Solr requires a schema.xml file, which provides the structure of the documents that will be stored as part of that instance.

As typical of any database schema, this consists of two major sections.

Types section - Definition for all types.

Fields Section - Definition of document structures using types.

Solr also supports a Schema less mode , Solr's dynamic field capability reduces up-front configuration requirements for fields with predictable naming patterns. For example, the following dynamic field definition maps any field name with suffix "_i" to the "int" field type.

In Azure Search, a JSON schema that defines the index is needed. The schema specifies the field-attribute combinations supported in your search application. Fields contain searchable data, such as product names, descriptions, customer comments, brands, prices, promotional notifications, and so forth. Attributes inform the types of operations that can be performed. Examples of the more commonly used attributes include whether a field supports full-text search (searchable=true), filters (filterable=true), or facets (facetable=true).

Azure Search uses most typical enterprise data types like Edm.String, Collection(Edm.String), Edm.DateTimeOffset, Edm.Int32.

At this time there is no clear cut documentation on Schema less operations in Azure Search, but mostly this feature can be work around with appropriate field naming conventions.

Document Ingestion (Loading)

Solr provides command line utilities that will help in loading the documents.

There is a also Web Service api which can be invoked for Updating and deleting specific documents.

Solr schema defines a primary key for the document collection, which will be used for Update decisions.

We can upload, merge or delete documents from a specified index using HTTP POST. For large numbers of updates, batching of documents (up to 1000 documents per batch, or about 16 MB per batch) is recommended.

Much like Solr the request pay load will contain a "key_field_name" to uniquely identify a document for updating requests.

Azure Search supports, upload: An upload action is similar to an "upsert" where the document will be inserted if it is new and updated/replaced if it exists. Note that all fields are replaced in the update case.

Searching Documents

Solr is built for searching and hence has rich set of features to support search.

 

  • Faceted Searching based on unique field values, explicit queries, date ranges, numeric ranges or pivot
  • Spelling suggestions for user queries
  • Auto-suggest functionality for completing user queries
  • Simple join capability between two document types
  • Numeric field statistics such as min, max, average, standard deviation

 

  • Function Query - influence the score by user specified complex functions of numeric fields or query relevancy scores.

 

  • More Like This suggestions for given document

To query your search data, your application sends a request that includes the service URL and an api-key for authenticating the request, along with a search query formulated from either OData syntax or a simple query syntax that provides the same functionality. When a query is sent to the Search API, the search engine in Azure Search processes the query and returns the results in a JSON document which can then be parsed and added to the presentation layer of your application.

Azure Search uses a simple query syntax for search text. This syntax is designed to be end-user friendly and is processed in a way that is tolerant to errors.

Azure Search supports a subset of the OData expression syntax for $filter.

Some of the salient features of Solr are also fully supported in Azure Search.

  • Full-text search
  • Scoring profiles
  • Faceted navigation
  • Suggestions for type-ahead or autocomplete
  • Count of the search hits returned for a query
  • Highlighted hits

Value Proposition for Azure Search
As we see from above, Azure Search tries to match the features of Solr in most aspects, however Solr is a seasoned search engine and Azure Search is in its preview stage, so some small deficiencies may occur in the understanding and proper application of Azure Search, however there is one area where the Azure Search may be a real winner for enterprises, which is ‘Scalability & Availability'.

Solr installation require highly competent administrator to ensure that Solr installations scales to 10s of 1000s of documents and yet the searches are load balanced against multiple nodes and the performance is not affected.

Solr adopts a number of features to support this level of massive scalability.

When your data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index.

SolrCloud is the name of a set of new distributed capabilities in Solr. Passing parameters to enable these capabilities will enable you to set up a highly available, fault tolerant cluster of Solr servers. Use SolrCloud when you want high scale, fault tolerant, distributed indexing and search capabilities.

Implementing SolrCloud and associated maintenance requires good knowledge from administrators.

However Azure Search, really makes scalability a much simpler thing. When we provision a new Azure Search service, the following building blocks are automatically managed. A Standard search is allocated in user-defined bundles of partitions (storage) and replicas (service workloads). You can scale up on partitions or replicas independently, adding more of whatever resource is needed.

Every search service starts with a minimum of one replica and one partition. If you signed up for dedicated resources using the Standard pricing tier, you can click the SCALE tile in the service dashboard to readjust the number of partitions and replicas used by your service. When you add either resource, the service uses them automatically. No further action is required on your part.

Increasing queries per second (QPS) or achieving high availability is done by adding replicas. Each replica has one copy of an index, so adding one more replica translates to one more index that can be used to service query requests. Currently, the rule of thumb is that you need at least 3 replicas for high availability.

Most service applications have a built-in need for more replicas rather than partitions, as most applications that utilize search can fit easily into a single partition that can support up to 15 million documents. For those cases where an increased document count is required, you can add partitions.

Summary
As always utilizing a commercial PaaS option comes with a price, but enterprises do find a trade-off between the ease of maintenance and quick go to market on choosing a managed platform versus self-maintained products. Also Azure Search is currently in the beta and hence we may have to wait for deploying mission critical and production applications, but it is worth to get started with pilot projects and it will be in the best interest of Microsoft to quickly make the service to mission critical standards.

More Stories By Srinivasan Sundara Rajan

Highly passionate about utilizing Digital Technologies to enable next generation enterprise. Believes in enterprise transformation through the Natives (Cloud Native & Mobile Native).

@ThingsExpo Stories
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics gr...
When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things’). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing? IoT is not about the devices, it’s about the data consumed and generated. The devices are tools, mechanisms, conduits. In his session at Internet of Things at Cloud Expo | DXWor...
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution. In his session at @ThingsExpo, Akvelon expert and IoT industry leader Sergey Grebnov provided an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.