Time |
Nickname |
Message |
01:35
🔗
|
|
thuban2 has joined #archiveteam-ot |
01:41
🔗
|
|
thuban1 has quit IRC (Read error: Operation timed out) |
02:57
🔗
|
|
thuban3 has joined #archiveteam-ot |
03:00
🔗
|
|
thuban2 has quit IRC (Read error: Operation timed out) |
03:09
🔗
|
|
thuban4 has joined #archiveteam-ot |
03:18
🔗
|
|
thuban3 has quit IRC (Ping timeout: 745 seconds) |
04:06
🔗
|
|
godane has joined #archiveteam-ot |
04:23
🔗
|
|
qw3rty_ has joined #archiveteam-ot |
04:23
🔗
|
|
qw3rty_ has quit IRC (Connection closed) |
04:23
🔗
|
|
qw3rty_ has joined #archiveteam-ot |
04:27
🔗
|
|
qw3rty has quit IRC (Ping timeout: 276 seconds) |
04:50
🔗
|
|
hata has quit IRC (Read error: Operation timed out) |
04:51
🔗
|
|
jake_test has quit IRC (Read error: Operation timed out) |
04:51
🔗
|
|
JAA has quit IRC (Read error: Operation timed out) |
04:51
🔗
|
|
marked1 has quit IRC (Read error: Operation timed out) |
04:51
🔗
|
|
simon816 has quit IRC (Read error: Operation timed out) |
04:52
🔗
|
|
svchfoo1 has quit IRC (Read error: Operation timed out) |
05:06
🔗
|
|
nicolas17 has quit IRC (Quit: zzz) |
05:19
🔗
|
|
Tenebrae has quit IRC (Read error: Operation timed out) |
05:20
🔗
|
|
Tenebrae has joined #archiveteam-ot |
05:20
🔗
|
|
MrRadar2 has quit IRC (Write error: Broken pipe) |
05:20
🔗
|
|
MrRadar2_ has joined #archiveteam-ot |
05:32
🔗
|
|
jake_test has joined #archiveteam-ot |
05:37
🔗
|
|
jake_test has quit IRC (Read error: Operation timed out) |
05:55
🔗
|
|
thuban has joined #archiveteam-ot |
06:02
🔗
|
|
thuban4 has quit IRC (Read error: Operation timed out) |
06:06
🔗
|
|
fuzzy8021 has quit IRC (Read error: Connection reset by peer) |
06:06
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
06:06
🔗
|
|
_niklas has quit IRC (Read error: Operation timed out) |
06:07
🔗
|
|
MrRadar has joined #archiveteam-ot |
06:09
🔗
|
|
_niklas has joined #archiveteam-ot |
06:13
🔗
|
|
fuzzy8021 has joined #archiveteam-ot |
06:33
🔗
|
|
hata has joined #archiveteam-ot |
06:34
🔗
|
|
marked1 has joined #archiveteam-ot |
06:34
🔗
|
|
svchfoo1 has joined #archiveteam-ot |
06:34
🔗
|
|
simon816 has joined #archiveteam-ot |
06:35
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
06:35
🔗
|
|
jake_test has joined #archiveteam-ot |
06:35
🔗
|
|
JAA has joined #archiveteam-ot |
06:35
🔗
|
|
AlsoJAA sets mode: +o JAA |
06:40
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
06:59
🔗
|
|
kiska has quit IRC (Remote host closed the connection) |
06:59
🔗
|
|
Flashfire has quit IRC (Read error: Connection reset by peer) |
06:59
🔗
|
|
kiska has joined #archiveteam-ot |
07:00
🔗
|
|
Flashfire has joined #archiveteam-ot |
07:00
🔗
|
|
svchfoo3 sets mode: +o kiska |
07:00
🔗
|
|
Tenebrae has quit IRC (Read error: Operation timed out) |
07:00
🔗
|
|
svchfoo1 sets mode: +o kiska |
07:04
🔗
|
|
dhyan_nat has quit IRC (Remote host closed the connection) |
07:07
🔗
|
|
Ctrl has quit IRC (Ping timeout: 864 seconds) |
07:08
🔗
|
|
MrRadar2_ has quit IRC (Read error: Connection reset by peer) |
07:09
🔗
|
|
Tenebrae has joined #archiveteam-ot |
07:10
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
07:11
🔗
|
|
MrRadar2 has joined #archiveteam-ot |
07:15
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
07:16
🔗
|
|
Raccoon` has quit IRC (Remote host closed the connection) |
07:18
🔗
|
|
Datechnom has quit IRC (Quit: The Lounge - https://thelounge.chat) |
07:25
🔗
|
|
dashcloud has joined #archiveteam-ot |
07:46
🔗
|
|
Ctrl has joined #archiveteam-ot |
09:24
🔗
|
|
Datechnom has joined #archiveteam-ot |
09:55
🔗
|
|
godane has quit IRC (Ping timeout: 258 seconds) |
10:07
🔗
|
|
ephemer0l has quit IRC (Ping timeout: 745 seconds) |
10:22
🔗
|
|
ephemer0l has joined #archiveteam-ot |
10:35
🔗
|
|
ephemer0l has quit IRC (Ping timeout: 745 seconds) |
10:36
🔗
|
|
ephemer0l has joined #archiveteam-ot |
10:49
🔗
|
|
ephemer0l has quit IRC (Ping timeout: 745 seconds) |
10:58
🔗
|
|
ephemer0l has joined #archiveteam-ot |
11:17
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
11:21
🔗
|
|
schbirid has joined #archiveteam-ot |
12:01
🔗
|
|
Ctrl has quit IRC (Read error: Operation timed out) |
12:01
🔗
|
|
dhyan_nat has quit IRC (Read error: Connection reset by peer) |
12:03
🔗
|
|
Tenebrae has quit IRC (Read error: Operation timed out) |
12:04
🔗
|
|
MrRadar2 has quit IRC (Remote host closed the connection) |
12:10
🔗
|
|
Tenebrae has joined #archiveteam-ot |
12:11
🔗
|
|
MrRadar2 has joined #archiveteam-ot |
12:11
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
12:20
🔗
|
|
Tenebrae has quit IRC (Read error: Connection reset by peer) |
12:27
🔗
|
|
MrRadar2 has quit IRC (Read error: Operation timed out) |
12:27
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
12:27
🔗
|
|
zino has quit IRC (Quit: Leaving) |
12:32
🔗
|
|
MrRadar2 has joined #archiveteam-ot |
12:34
🔗
|
|
Tenebrae has joined #archiveteam-ot |
12:36
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
12:53
🔗
|
|
Ctrl has joined #archiveteam-ot |
12:57
🔗
|
|
thuban1 has joined #archiveteam-ot |
12:59
🔗
|
|
thuban has quit IRC (Ping timeout: 255 seconds) |
13:00
🔗
|
|
thuban1 is now known as thuban |
13:17
🔗
|
|
Vito`_ has quit IRC (Read error: Connection reset by peer) |
13:33
🔗
|
|
Vito`_ has joined #archiveteam-ot |
14:11
🔗
|
|
atphoenix has quit IRC (Read error: Connection reset by peer) |
14:25
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
14:34
🔗
|
|
godane has joined #archiveteam-ot |
15:42
🔗
|
|
MaximeleG has joined #archiveteam-ot |
15:50
🔗
|
|
MaximeleG has quit IRC (Quit: MaximeleG) |
16:07
🔗
|
|
nicolas17 has joined #archiveteam-ot |
17:11
🔗
|
|
nicolas17 has quit IRC (Read error: Operation timed out) |
17:57
🔗
|
thuban |
brought to you by me digging through archivebox's github and deciding it's kind of a nightmare... what do you guys use for local archiving of web pages? god knows i'd like to be able to rely on wget, but heavy javascript dependence in the wild probably kneecaps it. |
18:00
🔗
|
thuban |
wpull's phantomjs/youtube-dl integration would be ideal in theory, but phantomjs has been dead for years (and apparently had some weird packaging issues anyway?) and i know ytdl integration is broken in archivebot (which the docs don't reflect). |
18:01
🔗
|
thuban |
usable anyway? |
18:27
🔗
|
systwi |
Has anyone seen Fusl around lately? |
18:30
🔗
|
|
antomatic has joined #archiveteam-ot |
18:34
🔗
|
|
antomati_ has quit IRC (Ping timeout: 276 seconds) |
18:41
🔗
|
|
Ctrl has quit IRC (Read error: Operation timed out) |
18:43
🔗
|
|
Ctrl has joined #archiveteam-ot |
18:45
🔗
|
|
atphoenix has joined #archiveteam-ot |
18:46
🔗
|
Kaz |
systwi: on hackint, yes |
18:46
🔗
|
Kaz |
on efnet, no |
18:51
🔗
|
systwi |
Ah, thanks Kaz |
18:51
🔗
|
JAA |
thuban: webrecorder and brozzler are supposed to be pretty decent, though I have no practical experience with either (though I've done local WARC playback with pywb before, which is part of webrecorder). webrecorder is more suitable for interactive usage, brozzler for automation. An alternative for interactive archival would be a MITM WARC-writing proxy (e.g. warcprox) with whatever browser you prefer. |
19:10
🔗
|
hook54321 |
I wish there were a webrecorder instance that fed into IA automatically. |
19:11
🔗
|
JAA |
Be the change etc. |
19:11
🔗
|
JAA |
:-) |
19:22
🔗
|
prq |
thuban: I've been investigating a similar usecase. the javascript concern is very real. I want to try brozzler, as it appears to solve this problem elegantly. I haven't fired it up yet and have only indirectly used it via the wayback machine "save page now" function. |
19:23
🔗
|
prq |
I don't know if brozzler is capable of being a true crawler like the wget --mirror functionality is, but if it isn't, I may be able to fill the need by enumerating outlinks in brozzler and setting up a queueing/job history system. |
19:27
🔗
|
atphoenix |
prq if you do develop something...keep URL prioritization in mind |
19:27
🔗
|
prq |
yeah that makes sense. |
19:28
🔗
|
prq |
I want different prioritization algorithms per project too. |
19:28
🔗
|
prq |
if it's a blog with comments that stay open for a long time, I might want to revisit every several months (with exponential backoff), for example. |
19:29
🔗
|
prq |
but I might change the coefficient if that particular post gets lots of attention from other parts of the project (detected by link weight) |
19:30
🔗
|
prq |
I have so many requirements flying around in my head to try to help a content-rewriting-watchdog persona |
19:32
🔗
|
prq |
my next steps are going to be to try to stand up an archivebot infrastructure so I understand how the archiveteam archivebot system works, same for warrior infrastructure and job trackers |
19:32
🔗
|
prq |
and I want to stand up brozzler |
19:32
🔗
|
prq |
and see how it works |
19:32
🔗
|
thuban |
JAA: thanks! |
19:32
🔗
|
thuban |
(i considered the MITM route, but i don't really have enough disk space to do totally omnivorous archiving. i wrote a weechat plugin to extract urls from certain buffers, which is a pretty good compromise for my use case--if it's interesting enough to send to a friend, it's interesting enough to archive, and vice versa. plus it would put a stop to dead links in my logs.) |
19:32
🔗
|
thuban |
i'm mostly thinking about automation and it really doesn't look like webrecorder has that in mind, although i suppose i could always script headless chromium (or whatever) to hit the appropriate url. |
19:32
🔗
|
thuban |
i _was_ digging around in brozzler to see if it could do what i wanted (the code looks like it accepts cookie dbs but where's the configuration for that??) but the documentation given isn't enough to, like, start it successfully... |
19:32
🔗
|
prq |
part of me hopes I can use pieces from all those existing projects to create a decent distributed job infrastructure. |
19:33
🔗
|
astrid |
prq: a thing we've been wanting for a very long time is a way for various crawlers to share a queue, so that archivebot doesn't have to run pipelines that can take a month to drain |
19:33
🔗
|
prq |
but it's honestly a bit above my head. |
19:33
🔗
|
prq |
yeah, that's a design goal (design hope?) that I have too. |
19:33
🔗
|
astrid |
nice. |
19:34
🔗
|
thuban |
prq: lol, i was just lamenting that the internet archive's distributed system means that brozzler is heavily overengineered for what i want to do |
19:34
🔗
|
astrid |
if you make meaningful progress on that, you'll probably be able to round up some help from archiveteam dev folks :) |
19:34
🔗
|
prq |
I figure I would have a preprocessing queue that would just be a list of URLs-- maybe some metadata like their source too. |
19:34
🔗
|
hook54321 |
astrid: iirc there was an experimental distributed version of wpull |
19:34
🔗
|
prq |
then that preprocess would check existing DB to see if that URL has been handled recently, and perhaps adjust its importance score. |
19:35
🔗
|
astrid |
archivebot's main albatross is (1) long lived local queues and (2) need to trust pipeline operators to keep their wpull's running for a very long time |
19:35
🔗
|
prq |
and then if it does need to be revisited, it'd get dropped into a brozzler worker queue |
19:35
🔗
|
prq |
then brozzler's job would be to load exactly that page, spit out warc data, and inject outlinks back into the preprocess queue with appropriate metadata |
19:36
🔗
|
prq |
the queueing system could be something that works on a distributed consensus model, but I don't have a good sense of how big those queue and historical databases would get over time to choose a technology. |
19:37
🔗
|
prq |
plus, my personal usecase is less "rescue data from pending shutdown" and more "detect corporate-ish tomfoolery", but the archive queue architecture serves both purposes well I think. |
19:38
🔗
|
astrid |
most jobs on the archivebot dashboard at the moment have 1k-10k urls in queue but i've seen single jobs get 10 million urls queued |
19:38
🔗
|
prq |
this wget that I've been running against this big entity since november is in the millions of urls visited. it's wget, so I don't know how big the queue is. |
19:39
🔗
|
astrid |
actually i was wrong, there are at least several active jobs right now with 1MM+ urls in q |
19:39
🔗
|
atphoenix |
prq, also consider that sites can change behavior based on detected browser or IP range, so locking a crawl to a single instance or a single country (for geolocation issues) may matter |
19:39
🔗
|
astrid |
biggest active job has 30MM queued urls |
19:40
🔗
|
atphoenix |
even if preference is for a distributed crawler |
19:40
🔗
|
astrid |
^ also |
19:40
🔗
|
prq |
oh very much aware of that. The specific sites that I am targetting for my usecase do this in a pretty predictable way-- they just have a ?lang=whatever and my recursive wget seems to pick up all the versions just fine. |
19:40
🔗
|
astrid |
nice |
19:40
🔗
|
prq |
but saying "this job should run in this country" ought to be easy enough to deal with with a bit of metadata |
19:41
🔗
|
astrid |
oh wait no biggest active archivebot job has 125MM urls queued, 80MM fetches so far |
19:41
🔗
|
astrid |
:) |
19:44
🔗
|
prq |
also, my early goals in my project are less about being distributed and more about being stateless so I can get started running stuff in a homelab environment. even with UPS backup, that just ensures a higher chance of a clean shutdown in a power outage lasting more than a minute or so. I live in hurricane alley. |
19:45
🔗
|
astrid |
restarting statelessly is a big wishlist item! |
19:45
🔗
|
astrid |
and v useful |
19:45
🔗
|
prq |
stateless and distributed go very well together. |
19:45
🔗
|
prq |
so with all that said-- is this basic idea a sound architecture? |
19:46
🔗
|
astrid |
i believe so |
19:46
🔗
|
|
VADemon_ has joined #archiveteam-ot |
19:46
🔗
|
prq |
oh, the preprocessing queue could also do fun things like have special handlers or middleware. happen to find a tiny.url? add a hook to tell the urlteam project somehow. |
19:47
🔗
|
prq |
come across a domain you've never seen before? sit on it but keep a note and then ask the admin if it's of interest when it hits a certain threshold. |
19:47
🔗
|
hook54321 |
prq: some sites block access to requests from the EU |
19:48
🔗
|
prq |
"10 domains in your 150 domain archival project are linking to this other domain? want to add it to the list?" |
19:48
🔗
|
prq |
yeah, that'd be covered in metadata about a job saying to only use workers in a certain region. |
19:49
🔗
|
prq |
I signed up for the archive-it webinar to learn about how their process works. |
19:49
🔗
|
prq |
I think it's a paid service |
19:49
🔗
|
|
VADemon has quit IRC (Ping timeout: 255 seconds) |
19:55
🔗
|
|
BlueMax has joined #archiveteam-ot |
20:00
🔗
|
|
Raccoon has joined #archiveteam-ot |
20:09
🔗
|
thuban |
lol, i got brozzler running but afaict it doesn't do headless. i don't think that's usable for me |
20:13
🔗
|
|
DogsRNice has joined #archiveteam-ot |
20:14
🔗
|
atphoenix |
Virtualbox can run headless |
20:15
🔗
|
atphoenix |
and you can connect to them if needed. The VMs don't know they are headless. |
20:15
🔗
|
thuban |
yay, even more bloat :/ |
20:16
🔗
|
atphoenix |
I make no claims of idealism. Isn't yet abstraction layers the way of the software world? It's basically turtles all the way down. |
20:16
🔗
|
atphoenix |
Isn't yet more* abstraction layers the way of the software world? |
20:17
🔗
|
thuban |
it's not the way of _my_ world |
20:37
🔗
|
Raccoon |
Is there any such thing as a Chrome extension for page saving / page scraping content metadata automatically from pages visited of a specific site? |
20:38
🔗
|
Raccoon |
I'm very handy with regex, just need a platform to create templates in to grabby the datas |
20:39
🔗
|
Raccoon |
regex / page element references |
21:08
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:19
🔗
|
prq |
I thought the brozzler docs had instructions for running an xvnc server (it said there was a problem on xvfb taking screenshots) |
21:30
🔗
|
JAA |
astrid: The big issue with distributed crawling is IP-dependent cookies. It'll run into even more session ID loops than AB already does. |
21:31
🔗
|
JAA |
But it can work well for sites that are not affected by this, of course. |
21:32
🔗
|
prq |
that seems like a proxy system might be helpful for jobs like that. still a distributed and stateless crawl, but that specific job would use a particular egress path. |
21:32
🔗
|
JAA |
I modified wpull a while ago to run distributed workers with a central DB. It kind of worked, too, but I never properly finished it. |
21:32
🔗
|
JAA |
prq: Yeah, that could work. |
21:32
🔗
|
prq |
It's kind of on my list to try your wpull branch |
21:32
🔗
|
JAA |
:-) |
21:33
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
21:33
🔗
|
prq |
but some of the sites I pay attention to would benefit from brozzler due to js-dependencies in their rendering of outlinks. |
21:34
🔗
|
prq |
the other part of the process that I haven't investigated quite yet is warc dedup |
21:34
🔗
|
prq |
I wonder how flexible brozzler is with a shared session among instances. |
21:36
🔗
|
JAA |
Browser-based crawling is definitely useful, but it also requires loads of resources compared to wget/wpull or qwarc. |
21:37
🔗
|
prq |
this architecture I'm developing in my head would have multiple job types-- wget/wpull/brozzler/etc. |
21:38
🔗
|
JAA |
We should move this to -dev. |
22:17
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
22:46
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
23:18
🔗
|
|
arkhive1 has quit IRC (Read error: Operation timed out) |
23:35
🔗
|
|
qw3rty has joined #archiveteam-ot |
23:39
🔗
|
|
qw3rty_ has quit IRC (Read error: Operation timed out) |