#archiveteam-ot 2019-08-03,Sat

↑back Search

Time Nickname Message
00:05 🔗 nyany Hey Fusl
00:05 🔗 nyany kinda ot for everywhere else
00:05 🔗 nyany i'm just curious, your fleet
00:05 🔗 nyany what provider does it run on?
00:06 🔗 nyany and what kind of stats are we looking at?
00:06 🔗 nyany don't worry, i'm not going to steal your thunder, i'm just morbidly curious
00:06 🔗 Fusl primarily hetzner but its a mix of nforce, hetzner, aws, digitalocean, linode, atlantic, azure
00:06 🔗 nyany ah
00:06 🔗 Fusl it automatically moves the fleet depending on the type of items that are going through it
00:07 🔗 nyany that's really cool
00:08 🔗 nyany is it safe to assume you dont have dedicated resources in the cloud providers
00:08 🔗 nyany are the workers created on demand?
00:08 🔗 Fusl a bunch of them are static resources that are just idling around
00:09 🔗 nyany oh and last question because its burning me up
00:09 🔗 Fusl e.g. 10 ex42-nvme servers that are used for rsync targets
00:09 🔗 nyany how do you afford all of it
00:09 🔗 Fusl with a job
00:09 🔗 Fusl hard earned money
00:09 🔗 nyany good answer lol
00:09 🔗 nyany that just seems like quite a bit of money
00:09 🔗 nyany i dont make nearly enough to be on your level, but i love to help archiveteam when i can
00:30 🔗 kpcyrd Fusl: did you change your profile picture on github (slightly)? It looks different but I'm not 100% sure :D
00:31 🔗 Fusl i did
00:32 🔗 kpcyrd looks nice
00:38 🔗 BlueMax has joined #archiveteam-ot
02:16 🔗 killsushi has quit IRC (Quit: Leaving)
02:41 🔗 Raccoon has joined #archiveteam-ot
02:47 🔗 Fusl has quit IRC (Quit: K-Lined)
02:47 🔗 Fusl__ has quit IRC (Quit: K-Lined)
02:48 🔗 Fusl has joined #archiveteam-ot
02:48 🔗 svchfoo3 sets mode: +o Fusl
02:48 🔗 svchfoo1 sets mode: +o Fusl
02:49 🔗 Fusl is now known as Fusl__
02:49 🔗 Fusl_ sets mode: +o Fusl__
02:49 🔗 Fusl has joined #archiveteam-ot
02:49 🔗 Fusl__ sets mode: +o Fusl
02:49 🔗 Fusl_ sets mode: +o Fusl
02:51 🔗 Fusl__ has quit IRC (Client Quit)
02:52 🔗 Fusl__ has joined #archiveteam-ot
02:52 🔗 Fusl_ sets mode: +o Fusl__
02:52 🔗 Fusl sets mode: +o Fusl__
03:41 🔗 m007a83_ has joined #archiveteam-ot
03:44 🔗 qw3rty116 has joined #archiveteam-ot
03:45 🔗 m007a83 has quit IRC (Ping timeout: 252 seconds)
03:50 🔗 qw3rty115 has quit IRC (Ping timeout: 600 seconds)
03:56 🔗 odemg has quit IRC (Read error: Operation timed out)
04:10 🔗 odemg has joined #archiveteam-ot
04:46 🔗 Flashfloo has quit IRC (The Lounge - https://thelounge.chat)
04:46 🔗 Flashfire has quit IRC (Quit: The Lounge - https://thelounge.chat)
04:46 🔗 kiska has quit IRC (Quit: The Lounge - https://thelounge.chat)
04:46 🔗 Flashfloo has joined #archiveteam-ot
04:46 🔗 kiska has joined #archiveteam-ot
04:46 🔗 Fusl__ sets mode: +o kiska
04:46 🔗 Flashfire has joined #archiveteam-ot
04:46 🔗 Fusl sets mode: +o kiska
04:46 🔗 Fusl_ sets mode: +o kiska
04:49 🔗 kiska lol my lounge client kept crashing because of a config issue... I let "maxHistory: -1"
04:51 🔗 kiska OOM killer
05:12 🔗 Fusl rip
05:13 🔗 Flashfire Thats what that was
05:37 🔗 dhyan_nat has joined #archiveteam-ot
05:42 🔗 Ivy has quit IRC (Quit: Connection closed for inactivity)
05:50 🔗 m007a83 has joined #archiveteam-ot
05:53 🔗 m007a83_ has quit IRC (Ping timeout: 252 seconds)
09:18 🔗 killsushi has joined #archiveteam-ot
09:56 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
10:29 🔗 Dragnog has quit IRC (Ping timeout: 246 seconds)
10:30 🔗 magus_bgf has joined #archiveteam-ot
10:31 🔗 magus_bgf I had some experience with heritrix, I think it was version 2, or maybe even 1
10:31 🔗 magus_bgf but it creates warcs
10:32 🔗 magus_bgf I found no clear guide how to turn a collection of warcs into html, at least
10:42 🔗 h3ndr1k has joined #archiveteam-ot
10:51 🔗 magus_bgf has quit IRC (Read error: Connection reset by peer)
10:52 🔗 magus_bgf has joined #archiveteam-ot
10:54 🔗 ivan_ generating static HTML output from WARCs is complicated by ?query params and potential overlaps like /path.html /path.html/thing
10:55 🔗 ivan_ writing your own good WARC -> HTML tool seems like a feasible project though
10:55 🔗 magus_bgf has quit IRC (Remote host closed the connection)
10:59 🔗 magus_bgf has joined #archiveteam-ot
11:13 🔗 magus_bgf has quit IRC (Read error: Connection reset by peer)
11:17 🔗 magus_bgf has joined #archiveteam-ot
11:32 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
11:45 🔗 VADemon has quit IRC (Quit: left4dead)
12:10 🔗 godane has joined #archiveteam-ot
12:13 🔗 magus_bgf has joined #archiveteam-ot
12:31 🔗 BlueMax has quit IRC (Quit: Leaving)
12:37 🔗 magus_bgf _ivan well, that's why I don't like it. The point of archiving is to be able to restore, not archiving for the sake of archiving.
13:21 🔗 yano i like archiving for the sake of archiving 🤷
13:23 🔗 magus_bgf Sorry, the conversation was continued from another channel. I'm looking for an advice. Need a tool to crawl/archive a few dozen sites continuously.
13:24 🔗 magus_bgf Incremental crawls, smart error handling, smart delays, smart url parameter handling.
13:25 🔗 magus_bgf And the most important is being able to restore a site, at least in html form.
14:13 🔗 magus_bgf has quit IRC (Read error: Connection reset by peer)
14:13 🔗 magus_bgf has joined #archiveteam-ot
14:21 🔗 JAA What exactly do you mean by "incremental"? I assume you'll want to regrab certain pages and then follow new links from there? That could be done with wpull.
14:23 🔗 magus_bgf Not downloading and not saving pages that haven't changed... to possible extent.
14:30 🔗 JAA That's pretty much impossible without site-specific code. For example, on a forum, you'd have to check the dates of the most recent post in each thread and compare that to what you grabbed before.
14:31 🔗 JAA If you can't somehow check whether a page has changed since you last retrieved it, there's no way around redownloading and comparing.
14:32 🔗 JAA (Check as in that forum example, where one page indicates what happened on another. Sitemaps can help with that as well if they exist and are implemented properly.)
14:33 🔗 magus_bgf I'd say that even date is not a guarantee, as posts can be edited, deleted, etc. Checking exact bytesize would be an imperfect, but possible heuristic, I think.
14:34 🔗 magus_bgf But really, is it the best there is - download the entire site all the time?
14:35 🔗 magus_bgf that doesn't scale well
14:36 🔗 magus_bgf there must be something better
14:37 🔗 JAA Correct, even that wouldn't be perfect. The only way is indeed to redownload the whole thing every time, then possibly deduping against the previous grabs (assuming the server returns the same markup every time.
14:37 🔗 JAA )
14:38 🔗 JAA If you run the site yourself, you can do it more efficiently of course.
14:38 🔗 magus_bgf That's what I currently (trying to) do with wget, but it doesn't work very well.
14:38 🔗 magus_bgf I'm not looking for perfect
14:39 🔗 JAA Or if the site somehow offers an "activity stream" kind of page which accurately reflects all changes.
14:39 🔗 JAA Define "doesn't work very well"?
14:45 🔗 magus_bgf Last time it ran for a month. I think due to some circular ../.. links and phpbb session ids which were generated anew all the time. And then the server had to be rebooted for some reason, and the result is a filedump which is not a backup, not an archive, just a dump.
14:46 🔗 magus_bgf And there's no telling what's there and what's not
14:47 🔗 magus_bgf My goal is: today a site goes down, tomorrow I put up a mirror.
14:47 🔗 magus_bgf and barring domain change, all links work
14:48 🔗 magus_bgf or close to that, on a best effort basis
14:59 🔗 killsushi has quit IRC (Read error: Connection reset by peer)
15:00 🔗 killsushi has joined #archiveteam-ot
15:19 🔗 jut has joined #archiveteam-ot
15:46 🔗 killsushi has quit IRC (Quit: Leaving)
16:05 🔗 dhyan_nat has joined #archiveteam-ot
16:42 🔗 godane has quit IRC (Ping timeout: 600 seconds)
16:46 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
18:16 🔗 nyany digitalocean really doesnt make it easy to delete many droplets at once do they
19:10 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
19:18 🔗 magus_bgf has joined #archiveteam-ot
19:32 🔗 kpcyrd nyany: have you tried terraform?
19:43 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
19:51 🔗 magus_bgf has joined #archiveteam-ot
20:14 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
20:23 🔗 magus_bgf has joined #archiveteam-ot
20:29 🔗 ola_norsk has joined #archiveteam-ot
20:30 🔗 ola_norsk Anyone else having similar problem with tubeup ? https://i.imgur.com/STmMNyn.png (It's not filling in item titles anymore(?))
20:31 🔗 ola_norsk currently having*
20:33 🔗 ola_norsk It appears tubeup sets item's titles to simply "_" :/
20:34 🔗 * ola_norsk tubeup --version = 0.0.17
20:35 🔗 ola_norsk or might there be something fucking on my end?
20:35 🔗 JAA youtube-dl broke the other day because Google changed some data format or something.
20:35 🔗 ola_norsk grrrrrr
20:35 🔗 JAA Or well, youtube-dl's title extraction broke.
20:36 🔗 ola_norsk can the items be fixed by 'tubeup'ing' them when it's working again? or?
20:37 🔗 JAA No idea. Ask the tubeup authors?
20:37 🔗 JAA (Or check if someone asked it already.)
20:37 🔗 JAA But probably not.
20:38 🔗 ola_norsk https://github.com/bibanon/tubeup/issues/88
20:39 🔗 ola_norsk i'm not sure if that's the same issue
20:43 🔗 ola_norsk JAA: but as long as the videos json files are in each item, it's possible to rework the item titles from that i suppose
20:43 🔗 JAA That is a different issue.
20:43 🔗 ola_norsk that, or i guess there are now a slew of videos in 'Mirror Tube' collection named "_" :D
20:44 🔗 JAA And the JSON files are probably affected as well since they're also written by youtube-dl.
20:44 🔗 JAA Have fun fixing them.
20:44 🔗 ola_norsk :/
20:46 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
20:47 🔗 ola_norsk JAA: I certainly does affect json as well :/ https://i.imgur.com/GHoRhox.png
20:47 🔗 ola_norsk It*
20:48 🔗 ola_norsk If tubeup doesn't re-write item metadata, that is
20:50 🔗 ola_norsk But if i'm not very much mistaken, tubeup doesn't update metadata. Just the files
20:53 🔗 Ivy has joined #archiveteam-ot
20:54 🔗 ola_norsk gonna try to re-tubeup archive.org/details/youtube-wWiZpenRGx8 now
20:54 🔗 ola_norsk i noticed there was a youtube-dl upgrade, so i pip'ed that first
20:55 🔗 magus_bgf has joined #archiveteam-ot
21:01 🔗 ola_norsk thankfully, i've only uploaded ~4 tubeup items affected.. But if the original video titles in no way resides in the item, that is quite a major issue
21:02 🔗 ola_norsk re-uploading using tubeup + same url, doesn't fix the title
21:18 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
21:22 🔗 ola_norsk JAA: left an issue ticket now https://github.com/bibanon/tubeup/issues/103
21:27 🔗 magus_bgf has joined #archiveteam-ot
21:34 🔗 ola_norsk JAA: would you happen to know the youtube-dl version where it broke? But, it does indeed seem to have been an issue with that, since updating youtube-dl fixed the issue, though not on previous items
21:36 🔗 ola_norsk in any case, "youtube-dl 2019.7.16" doesn't work
21:38 🔗 JAA ola_norsk: It broke because *Google* changed something. All youtube-dl versions before it got fixed are affected.
21:38 🔗 JAA But no, I don't know which version fixed it.
21:38 🔗 JAA Must be one of the most recent ones.
21:39 🔗 ola_norsk 2012-02-27 was the latest 'unbroken' one of mine,
21:40 🔗 ola_norsk 2019-07-30 was the first tubeup ive did that broke
21:42 🔗 ola_norsk has quit IRC (f.cking Google dude.. grrrrr https://youtu.be/pXnIB6O8PTc)
21:50 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
21:59 🔗 magus_bgf has joined #archiveteam-ot
22:22 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
22:30 🔗 magus_bgf has joined #archiveteam-ot
22:54 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
23:02 🔗 magus_bgf has joined #archiveteam-ot
23:26 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
23:34 🔗 magus_bgf has joined #archiveteam-ot
23:49 🔗 superkuh has joined #archiveteam-ot
23:58 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)

irclogger-viewer