[00:05] Hey Fusl [00:05] kinda ot for everywhere else [00:05] i'm just curious, your fleet [00:05] what provider does it run on? [00:06] and what kind of stats are we looking at? [00:06] don't worry, i'm not going to steal your thunder, i'm just morbidly curious [00:06] primarily hetzner but its a mix of nforce, hetzner, aws, digitalocean, linode, atlantic, azure [00:06] ah [00:06] it automatically moves the fleet depending on the type of items that are going through it [00:07] that's really cool [00:08] is it safe to assume you dont have dedicated resources in the cloud providers [00:08] are the workers created on demand? [00:08] a bunch of them are static resources that are just idling around [00:09] oh and last question because its burning me up [00:09] e.g. 10 ex42-nvme servers that are used for rsync targets [00:09] how do you afford all of it [00:09] with a job [00:09] hard earned money [00:09] good answer lol [00:09] that just seems like quite a bit of money [00:09] i dont make nearly enough to be on your level, but i love to help archiveteam when i can [00:30] Fusl: did you change your profile picture on github (slightly)? It looks different but I'm not 100% sure :D [00:31] i did [00:32] looks nice [00:38] *** BlueMax has joined #archiveteam-ot [02:16] *** killsushi has quit IRC (Quit: Leaving) [02:41] *** Raccoon has joined #archiveteam-ot [02:47] *** Fusl has quit IRC (Quit: K-Lined) [02:47] *** Fusl__ has quit IRC (Quit: K-Lined) [02:48] *** Fusl has joined #archiveteam-ot [02:48] *** svchfoo3 sets mode: +o Fusl [02:48] *** svchfoo1 sets mode: +o Fusl [02:49] *** Fusl is now known as Fusl__ [02:49] *** Fusl_ sets mode: +o Fusl__ [02:49] *** Fusl has joined #archiveteam-ot [02:49] *** Fusl__ sets mode: +o Fusl [02:49] *** Fusl_ sets mode: +o Fusl [02:51] *** Fusl__ has quit IRC (Client Quit) [02:52] *** Fusl__ has joined #archiveteam-ot [02:52] *** Fusl_ sets mode: +o Fusl__ [02:52] *** Fusl sets mode: +o Fusl__ [03:41] *** m007a83_ has joined #archiveteam-ot [03:44] *** qw3rty116 has joined #archiveteam-ot [03:45] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [03:50] *** qw3rty115 has quit IRC (Ping timeout: 600 seconds) [03:56] *** odemg has quit IRC (Read error: Operation timed out) [04:10] *** odemg has joined #archiveteam-ot [04:46] *** Flashfloo has quit IRC (The Lounge - https://thelounge.chat) [04:46] *** Flashfire has quit IRC (Quit: The Lounge - https://thelounge.chat) [04:46] *** kiska has quit IRC (Quit: The Lounge - https://thelounge.chat) [04:46] *** Flashfloo has joined #archiveteam-ot [04:46] *** kiska has joined #archiveteam-ot [04:46] *** Fusl__ sets mode: +o kiska [04:46] *** Flashfire has joined #archiveteam-ot [04:46] *** Fusl sets mode: +o kiska [04:46] *** Fusl_ sets mode: +o kiska [04:49] lol my lounge client kept crashing because of a config issue... I let "maxHistory: -1" [04:51] OOM killer [05:12] rip [05:13] Thats what that was [05:37] *** dhyan_nat has joined #archiveteam-ot [05:42] *** Ivy has quit IRC (Quit: Connection closed for inactivity) [05:50] *** m007a83 has joined #archiveteam-ot [05:53] *** m007a83_ has quit IRC (Ping timeout: 252 seconds) [09:18] *** killsushi has joined #archiveteam-ot [09:56] *** dhyan_nat has quit IRC (Read error: Operation timed out) [10:29] *** Dragnog has quit IRC (Ping timeout: 246 seconds) [10:30] *** magus_bgf has joined #archiveteam-ot [10:31] I had some experience with heritrix, I think it was version 2, or maybe even 1 [10:31] but it creates warcs [10:32] I found no clear guide how to turn a collection of warcs into html, at least [10:42] *** h3ndr1k has joined #archiveteam-ot [10:51] *** magus_bgf has quit IRC (Read error: Connection reset by peer) [10:52] *** magus_bgf has joined #archiveteam-ot [10:54] generating static HTML output from WARCs is complicated by ?query params and potential overlaps like /path.html /path.html/thing [10:55] writing your own good WARC -> HTML tool seems like a feasible project though [10:55] *** magus_bgf has quit IRC (Remote host closed the connection) [10:59] *** magus_bgf has joined #archiveteam-ot [11:13] *** magus_bgf has quit IRC (Read error: Connection reset by peer) [11:17] *** magus_bgf has joined #archiveteam-ot [11:32] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [11:45] *** VADemon has quit IRC (Quit: left4dead) [12:10] *** godane has joined #archiveteam-ot [12:13] *** magus_bgf has joined #archiveteam-ot [12:31] *** BlueMax has quit IRC (Quit: Leaving) [12:37] _ivan well, that's why I don't like it. The point of archiving is to be able to restore, not archiving for the sake of archiving. [13:21] i like archiving for the sake of archiving 🤷 [13:23] Sorry, the conversation was continued from another channel. I'm looking for an advice. Need a tool to crawl/archive a few dozen sites continuously. [13:24] Incremental crawls, smart error handling, smart delays, smart url parameter handling. [13:25] And the most important is being able to restore a site, at least in html form. [14:13] *** magus_bgf has quit IRC (Read error: Connection reset by peer) [14:13] *** magus_bgf has joined #archiveteam-ot [14:21] What exactly do you mean by "incremental"? I assume you'll want to regrab certain pages and then follow new links from there? That could be done with wpull. [14:23] Not downloading and not saving pages that haven't changed... to possible extent. [14:30] That's pretty much impossible without site-specific code. For example, on a forum, you'd have to check the dates of the most recent post in each thread and compare that to what you grabbed before. [14:31] If you can't somehow check whether a page has changed since you last retrieved it, there's no way around redownloading and comparing. [14:32] (Check as in that forum example, where one page indicates what happened on another. Sitemaps can help with that as well if they exist and are implemented properly.) [14:33] I'd say that even date is not a guarantee, as posts can be edited, deleted, etc. Checking exact bytesize would be an imperfect, but possible heuristic, I think. [14:34] But really, is it the best there is - download the entire site all the time? [14:35] that doesn't scale well [14:36] there must be something better [14:37] Correct, even that wouldn't be perfect. The only way is indeed to redownload the whole thing every time, then possibly deduping against the previous grabs (assuming the server returns the same markup every time. [14:37] ) [14:38] If you run the site yourself, you can do it more efficiently of course. [14:38] That's what I currently (trying to) do with wget, but it doesn't work very well. [14:38] I'm not looking for perfect [14:39] Or if the site somehow offers an "activity stream" kind of page which accurately reflects all changes. [14:39] Define "doesn't work very well"? [14:45] Last time it ran for a month. I think due to some circular ../.. links and phpbb session ids which were generated anew all the time. And then the server had to be rebooted for some reason, and the result is a filedump which is not a backup, not an archive, just a dump. [14:46] And there's no telling what's there and what's not [14:47] My goal is: today a site goes down, tomorrow I put up a mirror. [14:47] and barring domain change, all links work [14:48] or close to that, on a best effort basis [14:59] *** killsushi has quit IRC (Read error: Connection reset by peer) [15:00] *** killsushi has joined #archiveteam-ot [15:19] *** jut has joined #archiveteam-ot [15:46] *** killsushi has quit IRC (Quit: Leaving) [16:05] *** dhyan_nat has joined #archiveteam-ot [16:42] *** godane has quit IRC (Ping timeout: 600 seconds) [16:46] *** dhyan_nat has quit IRC (Read error: Operation timed out) [18:16] digitalocean really doesnt make it easy to delete many droplets at once do they [19:10] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [19:18] *** magus_bgf has joined #archiveteam-ot [19:32] nyany: have you tried terraform? [19:43] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [19:51] *** magus_bgf has joined #archiveteam-ot [20:14] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [20:23] *** magus_bgf has joined #archiveteam-ot [20:29] *** ola_norsk has joined #archiveteam-ot [20:30] Anyone else having similar problem with tubeup ? https://i.imgur.com/STmMNyn.png (It's not filling in item titles anymore(?)) [20:31] currently having* [20:33] It appears tubeup sets item's titles to simply "_" :/ [20:34] * ola_norsk tubeup --version = 0.0.17 [20:35] or might there be something fucking on my end? [20:35] youtube-dl broke the other day because Google changed some data format or something. [20:35] grrrrrr [20:35] Or well, youtube-dl's title extraction broke. [20:36] can the items be fixed by 'tubeup'ing' them when it's working again? or? [20:37] No idea. Ask the tubeup authors? [20:37] (Or check if someone asked it already.) [20:37] But probably not. [20:38] https://github.com/bibanon/tubeup/issues/88 [20:39] i'm not sure if that's the same issue [20:43] JAA: but as long as the videos json files are in each item, it's possible to rework the item titles from that i suppose [20:43] That is a different issue. [20:43] that, or i guess there are now a slew of videos in 'Mirror Tube' collection named "_" :D [20:44] And the JSON files are probably affected as well since they're also written by youtube-dl. [20:44] Have fun fixing them. [20:44] :/ [20:46] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [20:47] JAA: I certainly does affect json as well :/ https://i.imgur.com/GHoRhox.png [20:47] It* [20:48] If tubeup doesn't re-write item metadata, that is [20:50] But if i'm not very much mistaken, tubeup doesn't update metadata. Just the files [20:53] *** Ivy has joined #archiveteam-ot [20:54] gonna try to re-tubeup archive.org/details/youtube-wWiZpenRGx8 now [20:54] i noticed there was a youtube-dl upgrade, so i pip'ed that first [20:55] *** magus_bgf has joined #archiveteam-ot [21:01] thankfully, i've only uploaded ~4 tubeup items affected.. But if the original video titles in no way resides in the item, that is quite a major issue [21:02] re-uploading using tubeup + same url, doesn't fix the title [21:18] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [21:22] JAA: left an issue ticket now https://github.com/bibanon/tubeup/issues/103 [21:27] *** magus_bgf has joined #archiveteam-ot [21:34] JAA: would you happen to know the youtube-dl version where it broke? But, it does indeed seem to have been an issue with that, since updating youtube-dl fixed the issue, though not on previous items [21:36] in any case, "youtube-dl 2019.7.16" doesn't work [21:38] ola_norsk: It broke because *Google* changed something. All youtube-dl versions before it got fixed are affected. [21:38] But no, I don't know which version fixed it. [21:38] Must be one of the most recent ones. [21:39] 2012-02-27 was the latest 'unbroken' one of mine, [21:40] 2019-07-30 was the first tubeup ive did that broke [21:42] *** ola_norsk has quit IRC (f.cking Google dude.. grrrrr https://youtu.be/pXnIB6O8PTc) [21:50] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [21:59] *** magus_bgf has joined #archiveteam-ot [22:22] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [22:30] *** magus_bgf has joined #archiveteam-ot [22:54] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [23:02] *** magus_bgf has joined #archiveteam-ot [23:26] *** magus_bgf has quit IRC (Ping timeout: 252 seconds) [23:34] *** magus_bgf has joined #archiveteam-ot [23:49] *** superkuh has joined #archiveteam-ot [23:58] *** magus_bgf has quit IRC (Ping timeout: 252 seconds)