Working with Crawly

2023-09-26 09:19:35

I recently had to work with a scraper, named Crawly . And I had some fun with it, but there isn’t a lot of information, on how to work with it better, so, I wanted to write a small blog post, to help out some people who might want to work with it in the future. there are some small things/recommendations, on how to work with it.

How to set cookies

Sometimes, you need to set some cookies, so that the website will work. For example, where you got some kind of age verification- you always need that cookie to be set to true or some value, and it always needs to be sent, if you want to see some content.

to create this, I added this in my config file (but yeah, it would also be better if you can add it to your override_settings function instead)

config :crawly,
    ...
    middlewares: [
   	 ... ,
   	 {Crawly.Middlewares.RequestOptions,
   	 [
   	 timeout: 10_000,
   	 recv_timeout: 5000,
   	 hackney: [
   	 cookie: [{"YOUR_COOKIE_NAME", "1",[{:path, "/"}, {:domain, ".domain.com"}, {:secure, true}, {:max_age, 7}]}]]
   	 ]}
    ],
...

Basically, this will make sure that each of your requests has this cookie. it was a pretty nice feature, which helped me tremendously, this kind of wasn’t written in documentation and I had to search a lot to find a good way to do this.

How to run multiple crawlers through one project and do what you want

Also, while I worked on one of my work-related projects, I had to run multiple crawlers from multiple sites, with different designs (and run them dynamically, later on on that). And I had to persist the entities to a database. So, I needed to use pipelines. but not all crawlers, that I made, would be the same specific fields and so on, that wouldn’t cut it. So… I made my own pipes.

config :crawly,
  closespider_timeout: 1,
  concurrent_requests_per_domain: 20,
  middlewares: [
...
  ],
  pipelines: [
	{Project.Crawly.ValidateEntities},
	{Project.Crawly.DuplicateFilter},
	{Project.Crawly.PersistEntity}
  ],

And when a entity goes into the pipeline, my pipeline looks like this:

defmodule Project.Crawly.ValidateEntities do
  @behaviour Crawly.Pipeline

  require Logger

  @impl Crawly.Pipeline
  def run(item, state, opts \\ [])

  def run(%{type_1: true} = item, state, _opts) do
	Crawly.Pipelines.Validate.run(item, state, fields: [:video_views, :url, :title])
  end

  def run(%{type_2: true} = item, state, _opts) do
	Crawly.Pipelines.Validate.run(item, state, fields: [:title, :url, :text])
  end

  def run(%{type_3: true} = item, state, _opts) do
	Crawly.Pipelines.Validate.run(item, state, fields: [:text, :on_sale?])
  end

  def run(item, state, _opts) do
	Crawly.Pipelines.Validate.run(item, state, fields: [:type, :url, :name])
  end
end

as you can see, I use pattern matching to go trough the wanted flow. Might be a little hack, but that was the easiest way to go with a flow you want.

Running Crawly with Oban.

oh boy, this is also a fun part. The thing is, you can easily run crawly through oban, but there is a problem. Oban never knows when the crawly has finished crawling. So… If Oban doesn’t know if the crawl has finished, it just basically waits for a message. So, how do you fix this? By getting your oban jobs pid, giving it to the crawler, and when the project has finished the work, send pid a message, that it has finished.

for crawly, you got a function, which always calls, when it times out or is finished, which is on_spider_closed_callback

config :crawly,
  ...,
  middlewares: [
  ...
  ],
  pipelines: [
    ...
  ],
  on_spider_closed_callback: fn _spider_name, crawl_id, reason ->
	[{_, pid}] = :ets.lookup(:crawler_pid, crawl_id)

	send(pid, {:crawly_finished, reason})

	:ok
  end,

note- perhaps it’s better to add this function to the override_settings. So, basically, you get your pid (might not be the best variant of how i do it- I persist them inside a ets table and look it up by crawl_id. Perhaps it’s not even production safe, not a pro at that). and then you send your pid a message, that your crawly has finished it’s job.

and then, in the oban job, you can put in something like this.

  @impl Oban.Worker
  def perform(%Oban.Job{args: %{"thingy_id" => thingy_id}}) do
    ...
    
	receive do
  	{:crawly_finished, reason} ->
    	IO.inspect("Crawl finished #{reason}")
    	reason
	end

	:ok
  end

and this will make the job finish it. So you won’t need to deal with unfinished jobs.

Running multiple identical crawlers.

Oh boy, this also was a doosie. The thing is, that the Crawly uses UUID1 (Read here for more information about it). The thing is, I don’t know why, UUID1 wasn’t unique enough for me, and I had some collisions (when I was running crawly through oban in multiple instances), where the crawlers had the same id. I read the manual a bit more and found out, that you can set your own crawler id. so that’s what I did, and set my crawler id with not UUID1, but UUID4. They are the same length and they always are unique.

So… I basically added this code, when I started a crawler instance.

  @impl Oban.Worker
  def perform(%Oban.Job{args: %{"thingy_id" => thingy_id}}) do
	pid = self()
	entity = Project.get_the_thingy(thingy_id)
	id = UUID.uuid4()

	init_ets_if_required(:entity_spider)
	init_ets_if_required(:crawler_pid)
	:ets.insert(:crawler_pid, {id, pid})

	Crawly.Engine.start_spider(Project.Spiders.EntitySpider,
  	url: entity.url,
  	crawl_id: id
	)

    :ets.insert(:entity_spider, {id, data})

    ...

	:ok
  end

  defp init_ets_if_required(name) do
	if :undefined == :ets.info(name) do
  	:ets.new(name, [:named_table, :set, :public]) #public might not be the safest way, tho.
	end
  end

You might be asking- why I need a unique id? Well, because if I don’t have a unique id, and I have a collision, how the hell I will turn off my crawler, how will it read the correct data, and save it to a proper thingy? that’s why I need a unique pid. With a unique crawler_id, I know that it will be a specific crawler, I know it’s pid, and I can do all that I need/want.

Crawly and relations

This will also sound like a hack, but it worked for me. The thing is- if you have an id (for example, for a relation: you want to crawl some posts and save their comments, and you want to save the comments to the post), you can’t give it in the middle of the crawl. No, really. I had some problems giving the id of a “post” to the comments. And, the easiest way to do that, was also through :ets and just asking with your crawler id for the data you want. It was a quick victory, but might not be the best way to do that. I might later check some other ways, but currently, while my home project isn’t going on prod, this will work for me for a while.

Crawly and dynamic links.

This also was a small pain point. well, you sometimes want your crawler to work with some dynamic links (like I previously said, you want to crawl multiple posts.)

So, for example, when I worked with this, in my oban job, I called the crawly like this

	Crawly.Engine.start_spider(Project.EntitySpider,
  	url: thingy.url,
  	crawl_id: id
	)

Basically, I gave the spider a unique field (called url). and, then, I added this in the spider file:

defmodule Project.EntitySpider do
  use Crawly.Spider

  @impl Crawly.Spider
  def base_url(), do: "example.com"

  @impl Crawly.Spider
  def init(options) do
	url = Keyword.get(options, :url)

	posts =
  	Crawly.Request.new("#{url}/posts", %{},
    	hackney: [
      	cookie: [
        	{"good_cookie", "1",
         	[{:path, "/"}, {:domain, ".example.com"}, {:secure, true}, {:max_age, 7}]}
      	]
    	]
  	)

	[start_requests: [posts]]
  end
  ...
end

So, basically, I created a new Crawly.Request from the url which was inside the keyword list. And then it just works dynamically.

These are some of the problems that I saw while working with it. Perhaps my mumbling with the problems I saw might help somebody in the future. :)

Back