#archiveteam-ot 2019-11-27,Wed

↑back Search

Time Nickname Message
00:19 🔗 godane has joined #archiveteam-ot
00:45 🔗 SynMonger has quit IRC (Read error: Operation timed out)
01:05 🔗 manjaro-u has quit IRC (Konversation terminated!)
01:10 🔗 kpcyrd boring: we're running out of ipv4 addresses. hot: we're running out of twitter handles
01:13 🔗 * kpcyrd suggesting #TwitterIsFull, although there's probably a channel already
01:21 🔗 Video has joined #archiveteam-ot
01:22 🔗 Video i just found a web server that has a lot of linux-related mirror files
01:44 🔗 markedL really, I don't want one. https://medium.com/@N/how-i-lost-my-50-000-twitter-username-24eb09e026dd
02:31 🔗 BlueMax has quit IRC (Ping timeout: 745 seconds)
02:38 🔗 BlueMax has joined #archiveteam-ot
03:30 🔗 programme is now known as prq
04:17 🔗 qw3rty2 has joined #archiveteam-ot
04:26 🔗 qw3rty has quit IRC (Ping timeout: 745 seconds)
04:27 🔗 odemg has quit IRC (Ping timeout: 745 seconds)
04:31 🔗 odemg has joined #archiveteam-ot
04:50 🔗 kiska18 has quit IRC (Remote host closed the connection)
04:50 🔗 Ryz has quit IRC (Remote host closed the connection)
04:50 🔗 Ryz has joined #archiveteam-ot
04:51 🔗 kiska18 has joined #archiveteam-ot
05:08 🔗 SynMonger has joined #archiveteam-ot
05:16 🔗 SynMonger has quit IRC (Wait, what?)
05:43 🔗 akierig has joined #archiveteam-ot
06:01 🔗 kiska has quit IRC (Remote host closed the connection)
06:01 🔗 Flashfire has quit IRC (Remote host closed the connection)
06:02 🔗 kiska has joined #archiveteam-ot
06:02 🔗 Fusl sets mode: +o kiska
06:02 🔗 Fusl__ sets mode: +o kiska
06:02 🔗 Fusl_ sets mode: +o kiska
06:02 🔗 Flashfire has joined #archiveteam-ot
06:25 🔗 m007a83 has quit IRC (Quit: Fuck you Comcast)
06:33 🔗 akierig has quit IRC (Quit: later_gator)
06:47 🔗 SoraUta has quit IRC (Read error: Connection reset by peer)
06:50 🔗 SoraUta has joined #archiveteam-ot
07:54 🔗 dhyan_nat has joined #archiveteam-ot
08:45 🔗 MrRadar has quit IRC (Read error: Operation timed out)
09:20 🔗 magus_bgf has joined #archiveteam-ot
09:36 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
09:57 🔗 VADemon has joined #archiveteam-ot
09:57 🔗 magus_bgf has joined #archiveteam-ot
10:21 🔗 eientei95 has quit IRC (Remote host closed the connection)
10:27 🔗 eientei95 has joined #archiveteam-ot
10:27 🔗 eientei95 has quit IRC (Handshake flooding)
10:29 🔗 eientei95 has joined #archiveteam-ot
11:02 🔗 X-Scale` has joined #archiveteam-ot
11:04 🔗 X-Scale has quit IRC (Ping timeout: 252 seconds)
11:04 🔗 X-Scale` is now known as X-Scale
11:22 🔗 magus_bgf Is anyone here working on actually restoring websites, rather than just archiving?
11:31 🔗 X-Scale has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Organize your IRC)
11:43 🔗 X-Scale has joined #archiveteam-ot
11:46 🔗 X-Scale has quit IRC (Client Quit)
11:46 🔗 magus_bgf has quit IRC (Ping timeout: 252 seconds)
11:59 🔗 X-Scale has joined #archiveteam-ot
12:10 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
12:17 🔗 JAA magus_bgf (if you read logs): We don't even really have enough time to deal with saving the stuff, so generally, nobody here restores things. Besides, our archives all go into the Wayback Machine, which usually provides at least a somewhat usable display.
12:49 🔗 markedL Internet Trash Heap did a little playback
12:52 🔗 prq I have restored one or two websites. it isn't a very scalable endeavor.
12:55 🔗 JAA Yeah, probably takes more effort than the archival.
12:56 🔗 JAA The best route is to improve the playback in pywb and similar tools. And of course the WBM, but the code isn't open, so that has to be done by IA staff.
12:57 🔗 prq I have thought about running my own pywb instance for stuff I archive.
12:57 🔗 prq this one job I started weeks ago is still going. :/
12:58 🔗 JAA Hehe, try months. Hell, we have an ArchiveBot job that has its first anniversary in a few days.
12:58 🔗 prq I feel like there's several optimizations I could make to this one job that would help immensely-- run a different wget command for different parts of the tree
12:58 🔗 SynMonger has joined #archiveteam-ot
12:58 🔗 prq does that mean that job has been running in the same exact process for a year?
12:58 🔗 JAA Yeah, another reason why wpull is better than wget: it has parallelism.
12:58 🔗 prq or is it a distributed job that's been running on multiple machines?
12:59 🔗 JAA Single process on one machine.
12:59 🔗 prq wow
12:59 🔗 prq yeah, I want to do OS maintenance on the machine this job is running on. :/
12:59 🔗 JAA Distributed recursive crawling is a tricky problem.
12:59 🔗 prq indeed it is.
12:59 🔗 JAA First you have coordination, but that's easily solvable with a central URL database.
12:59 🔗 prq I feel like it should be very doable with a queue
13:00 🔗 prq rabbitmq or even redis
13:00 🔗 JAA But then you run into sessions and cookies tied to IP addresses etc.
13:00 🔗 JAA And it gets messy very quickly.
13:00 🔗 prq right-- many sites don't tie cookies to an ip. the ones I care about for this sphere of concern I have don't.
13:01 🔗 JAA Well yeah, it can work very well for individual sites, but doing it for the general case is hard.
13:01 🔗 JAA It's why ArchiveBot isn't distributed in that sense.
13:01 🔗 prq so opting in to a strategy that favors one ip vs not would be ideal.
13:01 🔗 prq and I could still run a distributed crawl, but put them all behind the same NAT'd wan ip
13:01 🔗 prq or run 100 threads on the same machine
13:02 🔗 prq or whatever
13:02 🔗 JAA Well yes, if you have the right tool for it.
13:02 🔗 JAA wget doesn't have any way to share its URL queue. wpull is single-threaded and doesn't handle DB locking at all. Not sure about other tools.
13:03 🔗 JAA Also, going highly parallel from a single IP only works if there are no rate limit issues.
13:03 🔗 JAA That year-long AB job I just mentioned has to go slow because otherwise the server collapses.
13:06 🔗 kiskaWee Which one is it? The gamefaq one?
13:06 🔗 JAA mozdev.org
13:06 🔗 kiskaWee Oh xD
13:08 🔗 markedL queue management is something I really want all tools to have different strategies for
13:09 🔗 markedL wget is probably the most entrenched example
13:09 🔗 prq yeah, it sounds more and more like the ability to pick a strategy would be ideal.
13:09 🔗 prq the job I'm doing seems to have *maybe* some mild ratelimiting, but it seems to be applied unevenly.
13:09 🔗 magus_bgf has joined #archiveteam-ot
13:11 🔗 prq some sections of the site I could generate every URL for easily without relying on the wget crawl mechanism.
13:12 🔗 prq looking at their robots.txt, they seem to disallow some stuff that ought to be captured. weird.
13:14 🔗 kiskaWee 23:17:35
13:14 🔗 kiskaWee <@JAA>magus_bgf (if you read logs): We don't even really have enough time to deal with saving the stuff, so generally, nobody here restores things. Besides, our archives all go into the Wayback Machine, which usually provides at least a somewhat usable display.
13:14 🔗 kiskaWee Oh xD
13:14 🔗 anonymiga has quit IRC (Quit: Lost terminal)
13:19 🔗 magus_bgf @kiskaWee thanks
13:19 🔗 magus_bgf shame to hear that, of course
13:20 🔗 JAA magus_bgf: There was a bit of a discussion after that message above, check the logs: https://archive.fart.website/bin/irclogger_log/archiveteam-ot?date=2019-11-27,Wed
13:20 🔗 bluefoo has quit IRC (Read error: Operation timed out)
13:20 🔗 JAA (Well, just a few sentences about that question.)
13:20 🔗 markedL WBM is not super performant but is decently stable . It would be hard to beat that combination
13:22 🔗 markedL if I wanted to "fix" something, I'd be curious whether tampermonkey to improve the playback on certain grabs but still from WBM
13:23 🔗 JAA That could work except for the security issues.
13:25 🔗 kiska18 has quit IRC (Read error: Operation timed out)
13:26 🔗 JAA Well, shit: https://bugs.python.org/issue36338#msg355322
13:26 🔗 kiska18 has joined #archiveteam-ot
13:26 🔗 Fusl sets mode: +o kiska18
13:26 🔗 Fusl__ sets mode: +o kiska18
13:26 🔗 Fusl_ sets mode: +o kiska18
13:26 🔗 Ryz has quit IRC (Read error: Connection reset by peer)
13:27 🔗 Ryz7 has joined #archiveteam-ot
13:28 🔗 JAA Er, wrong channel.
13:30 🔗 magus_bgf Read the log, too, thank you. IA is great of course, but I'd like to do better than "somewhat usable".
13:32 🔗 JAA Yeah, and I'm sure many people would be willing to work on that if it were possible.
13:32 🔗 JAA But I doubt the WBM code is going to be released anytime soon, if ever.
13:33 🔗 JAA Based on my tests and from what I've heard, pywb works significantly better than the WBM for JavaScript-heavy playback.
13:33 🔗 JAA But for some things, you'll always need some sort of fix scripts that remove or mangle certain URL components.
13:33 🔗 JAA For example, if the current timestamp is included in a URL.
13:34 🔗 JAA *Maybe* it's possible to build an archival browser that records and plays back any JS API calls that aren't constant.
13:35 🔗 JAA But that would be a *lot* of work obviously.
13:36 🔗 JAA Plus you'd need to use that patched browser for playback because you can't override many of those APIs dynamically.
13:37 🔗 markedL you mean same URL that gives 2 different results?
13:38 🔗 JAA No, I mean "cache busting" parameters like https://example.org/something?time=1574861908
13:39 🔗 JAA Common for example for JSON API requests launched from jQuery.
13:39 🔗 JAA (I think the parameter is _ there.)
13:51 🔗 Sanky is now known as Sanqui
14:00 🔗 SoraUta has quit IRC (Ping timeout: 252 seconds)
14:43 🔗 Craigle has quit IRC (Ping timeout: 496 seconds)
14:43 🔗 Craigle has joined #archiveteam-ot
14:44 🔗 magus_bgf has quit IRC (Quit: Leaving)
15:24 🔗 superkuh_ is now known as superkuh
15:56 🔗 Jamesatja has joined #archiveteam-ot
16:44 🔗 Ryz7 is now known as Ryz
16:56 🔗 * Raccoon mopes in relegation
17:18 🔗 bluefoo has joined #archiveteam-ot
17:29 🔗 Jamesatja has quit IRC (Read error: Connection reset by peer)
17:31 🔗 bluefoo has quit IRC (Remote host closed the connection)
17:36 🔗 martini has joined #archiveteam-ot
17:39 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
17:40 🔗 Zerote_ has joined #archiveteam-ot
17:40 🔗 kiska has quit IRC (Ping timeout: 252 seconds)
17:42 🔗 Zerote has quit IRC (Ping timeout: 252 seconds)
17:43 🔗 britmob_ has joined #archiveteam-ot
17:44 🔗 kiska has joined #archiveteam-ot
17:44 🔗 Fusl__ sets mode: +o kiska
17:44 🔗 Fusl sets mode: +o kiska
17:44 🔗 Fusl_ sets mode: +o kiska
17:46 🔗 bluefoo has joined #archiveteam-ot
17:46 🔗 britmob has quit IRC (Ping timeout: 252 seconds)
17:46 🔗 anarcat has quit IRC (Ping timeout: 252 seconds)
17:48 🔗 kiska has quit IRC (Ping timeout: 252 seconds)
17:53 🔗 _niklas has joined #archiveteam-ot
17:55 🔗 Flashfire has quit IRC (Ping timeout: 252 seconds)
17:57 🔗 anarcat has joined #archiveteam-ot
17:57 🔗 anarcat has quit IRC (Handshake flooding)
18:00 🔗 Flashfire has joined #archiveteam-ot
18:01 🔗 markedL oh, the opposite. 2 urls that look different that should be the same.
18:02 🔗 MrRadar has joined #archiveteam-ot
18:02 🔗 anarcat has joined #archiveteam-ot
18:03 🔗 Fusl has quit IRC (Quit: Moving to hackint)
18:04 🔗 Fusl__ has quit IRC (Quit: Moving to hackint)
18:04 🔗 Fusl_ has quit IRC (Quit: Moving to hackint)
18:04 🔗 systwiAL_ has joined #archiveteam-ot
18:09 🔗 systwiALT has quit IRC (Read error: Operation timed out)
18:12 🔗 systwiAL_ is now known as systwiALT
18:18 🔗 deevious has joined #archiveteam-ot
18:24 🔗 kiska has joined #archiveteam-ot
18:25 🔗 svchfoo3 sets mode: +o kiska
18:25 🔗 svchfoo1 sets mode: +o kiska
18:38 🔗 _niklas has quit IRC (Ping timeout: 258 seconds)
18:38 🔗 _niklas has joined #archiveteam-ot
18:38 🔗 mls has quit IRC (se.hub efnet.portlane.se)
18:38 🔗 Jon has quit IRC (se.hub efnet.portlane.se)
18:38 🔗 Laverne_ has quit IRC (se.hub efnet.portlane.se)
18:38 🔗 VoynichCr has quit IRC (se.hub efnet.portlane.se)
18:39 🔗 bluefoo has quit IRC (Ping timeout: 745 seconds)
18:55 🔗 MrRadar has quit IRC (Read error: Operation timed out)
18:56 🔗 systwiAL_ has joined #archiveteam-ot
18:58 🔗 MrRadar has joined #archiveteam-ot
18:58 🔗 bluefoo has joined #archiveteam-ot
19:02 🔗 MrRadar has quit IRC (Read error: Operation timed out)
19:03 🔗 systwiALT has quit IRC (Read error: Operation timed out)
19:13 🔗 Jon has joined #archiveteam-ot
19:37 🔗 mls has joined #archiveteam-ot
19:42 🔗 VoynichCr has joined #archiveteam-ot
19:49 🔗 martini has quit IRC (Quit: No Reasson)
19:58 🔗 Laverne has joined #archiveteam-ot
20:03 🔗 bluefoo has quit IRC (Read error: Connection reset by peer)
20:37 🔗 systwiAL_ is now known as systwiALT
20:45 🔗 Raccoon` has joined #archiveteam-ot
20:51 🔗 bluefoo has joined #archiveteam-ot
20:53 🔗 Raccoon has quit IRC (Ping timeout: 622 seconds)
20:53 🔗 Raccoon` is now known as Raccoon
20:55 🔗 Raccoon` has joined #archiveteam-ot
20:58 🔗 Raccoon has quit IRC (Ping timeout: 258 seconds)
20:58 🔗 Raccoon` is now known as Raccoon
21:02 🔗 MrRadar has joined #archiveteam-ot
21:19 🔗 Raccoon has quit IRC (Remote host closed the connection)
21:29 🔗 manjaro-u has joined #archiveteam-ot
23:05 🔗 ryry has joined #archiveteam-ot
23:05 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
23:15 🔗 VerifiedJ has quit IRC (Read error: Connection reset by peer)
23:21 🔗 dewdropaw has quit IRC (Quit: I object! That was... objectionable!)
23:37 🔗 BlueMax has joined #archiveteam-ot
23:50 🔗 SoraUta has joined #archiveteam-ot
23:53 🔗 manjaro-u has quit IRC (Ping timeout: 252 seconds)

irclogger-viewer