#archiveteam-ot 2020-02-06,Thu

↑back Search

Time Nickname Message
01:35 🔗 thuban2 has joined #archiveteam-ot
01:41 🔗 thuban1 has quit IRC (Read error: Operation timed out)
02:57 🔗 thuban3 has joined #archiveteam-ot
03:00 🔗 thuban2 has quit IRC (Read error: Operation timed out)
03:09 🔗 thuban4 has joined #archiveteam-ot
03:18 🔗 thuban3 has quit IRC (Ping timeout: 745 seconds)
04:06 🔗 godane has joined #archiveteam-ot
04:23 🔗 qw3rty_ has joined #archiveteam-ot
04:23 🔗 qw3rty_ has quit IRC (Connection closed)
04:23 🔗 qw3rty_ has joined #archiveteam-ot
04:27 🔗 qw3rty has quit IRC (Ping timeout: 276 seconds)
04:50 🔗 hata has quit IRC (Read error: Operation timed out)
04:51 🔗 jake_test has quit IRC (Read error: Operation timed out)
04:51 🔗 JAA has quit IRC (Read error: Operation timed out)
04:51 🔗 marked1 has quit IRC (Read error: Operation timed out)
04:51 🔗 simon816 has quit IRC (Read error: Operation timed out)
04:52 🔗 svchfoo1 has quit IRC (Read error: Operation timed out)
05:06 🔗 nicolas17 has quit IRC (Quit: zzz)
05:19 🔗 Tenebrae has quit IRC (Read error: Operation timed out)
05:20 🔗 Tenebrae has joined #archiveteam-ot
05:20 🔗 MrRadar2 has quit IRC (Write error: Broken pipe)
05:20 🔗 MrRadar2_ has joined #archiveteam-ot
05:32 🔗 jake_test has joined #archiveteam-ot
05:37 🔗 jake_test has quit IRC (Read error: Operation timed out)
05:55 🔗 thuban has joined #archiveteam-ot
06:02 🔗 thuban4 has quit IRC (Read error: Operation timed out)
06:06 🔗 fuzzy8021 has quit IRC (Read error: Connection reset by peer)
06:06 🔗 MrRadar has quit IRC (Read error: Operation timed out)
06:06 🔗 _niklas has quit IRC (Read error: Operation timed out)
06:07 🔗 MrRadar has joined #archiveteam-ot
06:09 🔗 _niklas has joined #archiveteam-ot
06:13 🔗 fuzzy8021 has joined #archiveteam-ot
06:33 🔗 hata has joined #archiveteam-ot
06:34 🔗 marked1 has joined #archiveteam-ot
06:34 🔗 svchfoo1 has joined #archiveteam-ot
06:34 🔗 simon816 has joined #archiveteam-ot
06:35 🔗 svchfoo3 sets mode: +o svchfoo1
06:35 🔗 jake_test has joined #archiveteam-ot
06:35 🔗 JAA has joined #archiveteam-ot
06:35 🔗 AlsoJAA sets mode: +o JAA
06:40 🔗 dhyan_nat has joined #archiveteam-ot
06:59 🔗 kiska has quit IRC (Remote host closed the connection)
06:59 🔗 Flashfire has quit IRC (Read error: Connection reset by peer)
06:59 🔗 kiska has joined #archiveteam-ot
07:00 🔗 Flashfire has joined #archiveteam-ot
07:00 🔗 svchfoo3 sets mode: +o kiska
07:00 🔗 Tenebrae has quit IRC (Read error: Operation timed out)
07:00 🔗 svchfoo1 sets mode: +o kiska
07:04 🔗 dhyan_nat has quit IRC (Remote host closed the connection)
07:07 🔗 Ctrl has quit IRC (Ping timeout: 864 seconds)
07:08 🔗 MrRadar2_ has quit IRC (Read error: Connection reset by peer)
07:09 🔗 Tenebrae has joined #archiveteam-ot
07:10 🔗 dhyan_nat has joined #archiveteam-ot
07:11 🔗 MrRadar2 has joined #archiveteam-ot
07:15 🔗 dashcloud has quit IRC (Read error: Operation timed out)
07:16 🔗 Raccoon` has quit IRC (Remote host closed the connection)
07:18 🔗 Datechnom has quit IRC (Quit: The Lounge - https://thelounge.chat)
07:25 🔗 dashcloud has joined #archiveteam-ot
07:46 🔗 Ctrl has joined #archiveteam-ot
09:24 🔗 Datechnom has joined #archiveteam-ot
09:55 🔗 godane has quit IRC (Ping timeout: 258 seconds)
10:07 🔗 ephemer0l has quit IRC (Ping timeout: 745 seconds)
10:22 🔗 ephemer0l has joined #archiveteam-ot
10:35 🔗 ephemer0l has quit IRC (Ping timeout: 745 seconds)
10:36 🔗 ephemer0l has joined #archiveteam-ot
10:49 🔗 ephemer0l has quit IRC (Ping timeout: 745 seconds)
10:58 🔗 ephemer0l has joined #archiveteam-ot
11:17 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
11:21 🔗 schbirid has joined #archiveteam-ot
12:01 🔗 Ctrl has quit IRC (Read error: Operation timed out)
12:01 🔗 dhyan_nat has quit IRC (Read error: Connection reset by peer)
12:03 🔗 Tenebrae has quit IRC (Read error: Operation timed out)
12:04 🔗 MrRadar2 has quit IRC (Remote host closed the connection)
12:10 🔗 Tenebrae has joined #archiveteam-ot
12:11 🔗 MrRadar2 has joined #archiveteam-ot
12:11 🔗 dhyan_nat has joined #archiveteam-ot
12:20 🔗 Tenebrae has quit IRC (Read error: Connection reset by peer)
12:27 🔗 MrRadar2 has quit IRC (Read error: Operation timed out)
12:27 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
12:27 🔗 zino has quit IRC (Quit: Leaving)
12:32 🔗 MrRadar2 has joined #archiveteam-ot
12:34 🔗 Tenebrae has joined #archiveteam-ot
12:36 🔗 dhyan_nat has joined #archiveteam-ot
12:53 🔗 Ctrl has joined #archiveteam-ot
12:57 🔗 thuban1 has joined #archiveteam-ot
12:59 🔗 thuban has quit IRC (Ping timeout: 255 seconds)
13:00 🔗 thuban1 is now known as thuban
13:17 🔗 Vito`_ has quit IRC (Read error: Connection reset by peer)
13:33 🔗 Vito`_ has joined #archiveteam-ot
14:11 🔗 atphoenix has quit IRC (Read error: Connection reset by peer)
14:25 🔗 VerifiedJ has joined #archiveteam-ot
14:34 🔗 godane has joined #archiveteam-ot
15:42 🔗 MaximeleG has joined #archiveteam-ot
15:50 🔗 MaximeleG has quit IRC (Quit: MaximeleG)
16:07 🔗 nicolas17 has joined #archiveteam-ot
17:11 🔗 nicolas17 has quit IRC (Read error: Operation timed out)
17:57 🔗 thuban brought to you by me digging through archivebox's github and deciding it's kind of a nightmare... what do you guys use for local archiving of web pages? god knows i'd like to be able to rely on wget, but heavy javascript dependence in the wild probably kneecaps it.
18:00 🔗 thuban wpull's phantomjs/youtube-dl integration would be ideal in theory, but phantomjs has been dead for years (and apparently had some weird packaging issues anyway?) and i know ytdl integration is broken in archivebot (which the docs don't reflect).
18:01 🔗 thuban usable anyway?
18:27 🔗 systwi Has anyone seen Fusl around lately?
18:30 🔗 antomatic has joined #archiveteam-ot
18:34 🔗 antomati_ has quit IRC (Ping timeout: 276 seconds)
18:41 🔗 Ctrl has quit IRC (Read error: Operation timed out)
18:43 🔗 Ctrl has joined #archiveteam-ot
18:45 🔗 atphoenix has joined #archiveteam-ot
18:46 🔗 Kaz systwi: on hackint, yes
18:46 🔗 Kaz on efnet, no
18:51 🔗 systwi Ah, thanks Kaz
18:51 🔗 JAA thuban: webrecorder and brozzler are supposed to be pretty decent, though I have no practical experience with either (though I've done local WARC playback with pywb before, which is part of webrecorder). webrecorder is more suitable for interactive usage, brozzler for automation. An alternative for interactive archival would be a MITM WARC-writing proxy (e.g. warcprox) with whatever browser you prefer.
19:10 🔗 hook54321 I wish there were a webrecorder instance that fed into IA automatically.
19:11 🔗 JAA Be the change etc.
19:11 🔗 JAA :-)
19:22 🔗 prq thuban: I've been investigating a similar usecase. the javascript concern is very real. I want to try brozzler, as it appears to solve this problem elegantly. I haven't fired it up yet and have only indirectly used it via the wayback machine "save page now" function.
19:23 🔗 prq I don't know if brozzler is capable of being a true crawler like the wget --mirror functionality is, but if it isn't, I may be able to fill the need by enumerating outlinks in brozzler and setting up a queueing/job history system.
19:27 🔗 atphoenix prq if you do develop something...keep URL prioritization in mind
19:27 🔗 prq yeah that makes sense.
19:28 🔗 prq I want different prioritization algorithms per project too.
19:28 🔗 prq if it's a blog with comments that stay open for a long time, I might want to revisit every several months (with exponential backoff), for example.
19:29 🔗 prq but I might change the coefficient if that particular post gets lots of attention from other parts of the project (detected by link weight)
19:30 🔗 prq I have so many requirements flying around in my head to try to help a content-rewriting-watchdog persona
19:32 🔗 prq my next steps are going to be to try to stand up an archivebot infrastructure so I understand how the archiveteam archivebot system works, same for warrior infrastructure and job trackers
19:32 🔗 prq and I want to stand up brozzler
19:32 🔗 prq and see how it works
19:32 🔗 thuban JAA: thanks!
19:32 🔗 thuban (i considered the MITM route, but i don't really have enough disk space to do totally omnivorous archiving. i wrote a weechat plugin to extract urls from certain buffers, which is a pretty good compromise for my use case--if it's interesting enough to send to a friend, it's interesting enough to archive, and vice versa. plus it would put a stop to dead links in my logs.)
19:32 🔗 thuban i'm mostly thinking about automation and it really doesn't look like webrecorder has that in mind, although i suppose i could always script headless chromium (or whatever) to hit the appropriate url.
19:32 🔗 thuban i _was_ digging around in brozzler to see if it could do what i wanted (the code looks like it accepts cookie dbs but where's the configuration for that??) but the documentation given isn't enough to, like, start it successfully...
19:32 🔗 prq part of me hopes I can use pieces from all those existing projects to create a decent distributed job infrastructure.
19:33 🔗 astrid prq: a thing we've been wanting for a very long time is a way for various crawlers to share a queue, so that archivebot doesn't have to run pipelines that can take a month to drain
19:33 🔗 prq but it's honestly a bit above my head.
19:33 🔗 prq yeah, that's a design goal (design hope?) that I have too.
19:33 🔗 astrid nice.
19:34 🔗 thuban prq: lol, i was just lamenting that the internet archive's distributed system means that brozzler is heavily overengineered for what i want to do
19:34 🔗 astrid if you make meaningful progress on that, you'll probably be able to round up some help from archiveteam dev folks :)
19:34 🔗 prq I figure I would have a preprocessing queue that would just be a list of URLs-- maybe some metadata like their source too.
19:34 🔗 hook54321 astrid: iirc there was an experimental distributed version of wpull
19:34 🔗 prq then that preprocess would check existing DB to see if that URL has been handled recently, and perhaps adjust its importance score.
19:35 🔗 astrid archivebot's main albatross is (1) long lived local queues and (2) need to trust pipeline operators to keep their wpull's running for a very long time
19:35 🔗 prq and then if it does need to be revisited, it'd get dropped into a brozzler worker queue
19:35 🔗 prq then brozzler's job would be to load exactly that page, spit out warc data, and inject outlinks back into the preprocess queue with appropriate metadata
19:36 🔗 prq the queueing system could be something that works on a distributed consensus model, but I don't have a good sense of how big those queue and historical databases would get over time to choose a technology.
19:37 🔗 prq plus, my personal usecase is less "rescue data from pending shutdown" and more "detect corporate-ish tomfoolery", but the archive queue architecture serves both purposes well I think.
19:38 🔗 astrid most jobs on the archivebot dashboard at the moment have 1k-10k urls in queue but i've seen single jobs get 10 million urls queued
19:38 🔗 prq this wget that I've been running against this big entity since november is in the millions of urls visited. it's wget, so I don't know how big the queue is.
19:39 🔗 astrid actually i was wrong, there are at least several active jobs right now with 1MM+ urls in q
19:39 🔗 atphoenix prq, also consider that sites can change behavior based on detected browser or IP range, so locking a crawl to a single instance or a single country (for geolocation issues) may matter
19:39 🔗 astrid biggest active job has 30MM queued urls
19:40 🔗 atphoenix even if preference is for a distributed crawler
19:40 🔗 astrid ^ also
19:40 🔗 prq oh very much aware of that. The specific sites that I am targetting for my usecase do this in a pretty predictable way-- they just have a ?lang=whatever and my recursive wget seems to pick up all the versions just fine.
19:40 🔗 astrid nice
19:40 🔗 prq but saying "this job should run in this country" ought to be easy enough to deal with with a bit of metadata
19:41 🔗 astrid oh wait no biggest active archivebot job has 125MM urls queued, 80MM fetches so far
19:41 🔗 astrid :)
19:44 🔗 prq also, my early goals in my project are less about being distributed and more about being stateless so I can get started running stuff in a homelab environment. even with UPS backup, that just ensures a higher chance of a clean shutdown in a power outage lasting more than a minute or so. I live in hurricane alley.
19:45 🔗 astrid restarting statelessly is a big wishlist item!
19:45 🔗 astrid and v useful
19:45 🔗 prq stateless and distributed go very well together.
19:45 🔗 prq so with all that said-- is this basic idea a sound architecture?
19:46 🔗 astrid i believe so
19:46 🔗 VADemon_ has joined #archiveteam-ot
19:46 🔗 prq oh, the preprocessing queue could also do fun things like have special handlers or middleware. happen to find a tiny.url? add a hook to tell the urlteam project somehow.
19:47 🔗 prq come across a domain you've never seen before? sit on it but keep a note and then ask the admin if it's of interest when it hits a certain threshold.
19:47 🔗 hook54321 prq: some sites block access to requests from the EU
19:48 🔗 prq "10 domains in your 150 domain archival project are linking to this other domain? want to add it to the list?"
19:48 🔗 prq yeah, that'd be covered in metadata about a job saying to only use workers in a certain region.
19:49 🔗 prq I signed up for the archive-it webinar to learn about how their process works.
19:49 🔗 prq I think it's a paid service
19:49 🔗 VADemon has quit IRC (Ping timeout: 255 seconds)
19:55 🔗 BlueMax has joined #archiveteam-ot
20:00 🔗 Raccoon has joined #archiveteam-ot
20:09 🔗 thuban lol, i got brozzler running but afaict it doesn't do headless. i don't think that's usable for me
20:13 🔗 DogsRNice has joined #archiveteam-ot
20:14 🔗 atphoenix Virtualbox can run headless
20:15 🔗 atphoenix and you can connect to them if needed. The VMs don't know they are headless.
20:15 🔗 thuban yay, even more bloat :/
20:16 🔗 atphoenix I make no claims of idealism. Isn't yet abstraction layers the way of the software world? It's basically turtles all the way down.
20:16 🔗 atphoenix Isn't yet more* abstraction layers the way of the software world?
20:17 🔗 thuban it's not the way of _my_ world
20:37 🔗 Raccoon Is there any such thing as a Chrome extension for page saving / page scraping content metadata automatically from pages visited of a specific site?
20:38 🔗 Raccoon I'm very handy with regex, just need a platform to create templates in to grabby the datas
20:39 🔗 Raccoon regex / page element references
21:08 🔗 schbirid has quit IRC (Quit: Leaving)
21:19 🔗 prq I thought the brozzler docs had instructions for running an xvnc server (it said there was a problem on xvfb taking screenshots)
21:30 🔗 JAA astrid: The big issue with distributed crawling is IP-dependent cookies. It'll run into even more session ID loops than AB already does.
21:31 🔗 JAA But it can work well for sites that are not affected by this, of course.
21:32 🔗 prq that seems like a proxy system might be helpful for jobs like that. still a distributed and stateless crawl, but that specific job would use a particular egress path.
21:32 🔗 JAA I modified wpull a while ago to run distributed workers with a central DB. It kind of worked, too, but I never properly finished it.
21:32 🔗 JAA prq: Yeah, that could work.
21:32 🔗 prq It's kind of on my list to try your wpull branch
21:32 🔗 JAA :-)
21:33 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
21:33 🔗 prq but some of the sites I pay attention to would benefit from brozzler due to js-dependencies in their rendering of outlinks.
21:34 🔗 prq the other part of the process that I haven't investigated quite yet is warc dedup
21:34 🔗 prq I wonder how flexible brozzler is with a shared session among instances.
21:36 🔗 JAA Browser-based crawling is definitely useful, but it also requires loads of resources compared to wget/wpull or qwarc.
21:37 🔗 prq this architecture I'm developing in my head would have multiple job types-- wget/wpull/brozzler/etc.
21:38 🔗 JAA We should move this to -dev.
22:17 🔗 VerifiedJ has quit IRC (Quit: Leaving)
22:46 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
23:18 🔗 arkhive1 has quit IRC (Read error: Operation timed out)
23:35 🔗 qw3rty has joined #archiveteam-ot
23:39 🔗 qw3rty_ has quit IRC (Read error: Operation timed out)

irclogger-viewer