Day 12: Index your data with ElasticSearch

12_librecatElasticSearch is a flexible and powerful open source, distributed, real-time search and analytics engine. You can store structured JSON documents and by default ElasticSearch will try to detect the data structure and index the data. ElasticSearch uses Lucene to provide full text search capabilities with a powerful query language. Install guides for various platforms are available at the ElasticSearch reference. To install the corresponding Catmandu module run:

$ cpanm Catmandu::Store::ElasticSearch

[For those of you running the Catmandu VirtualBox this installation is not required. ElasticSearch is by default installed]
Now get some JSON data to work with:

$ wget -O banned_books.json http://www.berlin.de/rubrik/hauptstadt/verbannte_buecher/verbannte-buecher.json

First index the data with ElasticSearch. You have to specify an index (–index_name) and type (–bag):

$ catmandu import -v JSON --multiline 1 to ElasticSearch --index_name books --bag banned < banned_books.json

Now you can export all items from an index to different formats, like XLSX, YAML and XML:

$ catmandu export ElasticSearch --index_name books --bag banned to YAML
$ catmandu export ElasticSearch --index_name books --bag banned to XML
$ catmandu export -v ElasticSearch --index_name books --bag banned to XLSX --file banned_books.xlsx

You can count all indexed items or those which match a query:

$ catmandu count ElasticSearch --index_name books --bag banned
$ catmandu count ElasticSearch --index_name books --bag banned --query 'firstEditionPublicationYear: "1937"'
$ catmandu count ElasticSearch --index_name books --bag banned --query 'firstEditionPublicationPlace: "Berlin"'

You can search an index for a specific value and export all matching items:

$ catmandu export ElasticSearch --index_name books --bag banned --query 'firstEditionPublicationYear: "1937"' to JSON
$ catmandu export ElasticSearch --index_name books --bag banned --query 'firstEditionPublicationPlace: "Berlin"' to CSV --fields '_id,authorFirstname,authorLastname,title,firstEditionPublicationPlace'

Collections and items can be moved within ElasticSearch or even to other stores or search engines:

$ catmandu move ElasticSearch --index_name books -b-ag banned --query 'firstEditionPublicationPlace: "Berlin"' to Elasticsearch --index_name books --bag berlin
$ catmandu move ElasticSearch --index_name books --bag banned to MongoDB --database_name books --bag banned

You can delete whole collections from a database or just items which match a query:

$ catmandu delete ElasticSearch --index_name books --bag banned --query 'firstEditionPublicationPlace: "Berlin"'
$ catmandu delete ElasticSearch --index_name books --bag banned

Catmandu::Store::ElasticSearch supports CQL as query language. For setup and usage see documentation.

Continue to Day 13: Harvest data with OAI-PMH >>

Advertisements

8 comments

    • hochstenbach

      Yes, the ElasticSearch store implements all methods of Catmandu::Searchable so you can add these options on the command line:

      –query # Your search query
      –start # Start the result set (e.g. use this to skip results)
      –limit # Number of results to return
      –sort # Sorting field

      Like

  1. Pingback: Day 11: Store your data in MongoDB | LibreCat
  2. j2b

    Does Catmandu::Store::ElasticSearch support v5.4 (according to info from GitHub, it does not particularly state, that it’s not supported – meaning 2.x and newer…)

    Should Index mapping and/or template be created prior Catmandu import XML to ElasticSearch CLI command, as I can see, that index gets created, but no data are populated into it. Trying to troubleshoot this.

    Catmandu -v verbose output does not give full details of actions. Is there a way to get full output stack for troubleshooting?

    Like

      • j2b

        Thanks a bunch for follow up and feedback. I at least, moved a step further, actually getting something into ES index. Yet, there is unclear issue, why all of XML is imported as one record, despite pre-existing schema. Even tired conversion to JSON – result the same.

        The file is MARC21 format converted to kind of more human readable form (via Catmandu), to help troubleshoot.

        Originally, in XML records were split into . Conversion to JSON resulted in record[0,1,2,3,…] array. Both – appear in ES as a single record, with sub-indexes of it. Cannot find out a way to split this in a fields. marc_map/add() fixes still not working for me.

        Like

  3. j2b

    Thank you for valuable links and explanations. I actually discovered, that applying –fix depends on a side, where you do it. Formerly, I was using –fix on ES side, without success. Further, after mangling with convert functions, I moved required –fix to XML/YAML/MARC side of import function, and all things started to work, becoming clear.

    Yet, for answering my own initial question, there’s a need to map fields for ES (at least at this step of my evolution) to be accessible with queries, etc. But it should be done on original import’s side. Currently testing with catmandu import MARC --fix 'move_field(_id,my_id)' to ElasticSearch --index-name test --bag data < test.mrc and all happens as expected. I’m very glad finding this invaluable project!

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s