Time |
Nickname |
Message |
00:19
🔗
|
|
godane has joined #archiveteam-ot |
00:45
🔗
|
|
SynMonger has quit IRC (Read error: Operation timed out) |
01:05
🔗
|
|
manjaro-u has quit IRC (Konversation terminated!) |
01:10
🔗
|
kpcyrd |
boring: we're running out of ipv4 addresses. hot: we're running out of twitter handles |
01:13
🔗
|
* |
kpcyrd suggesting #TwitterIsFull, although there's probably a channel already |
01:21
🔗
|
|
Video has joined #archiveteam-ot |
01:22
🔗
|
Video |
i just found a web server that has a lot of linux-related mirror files |
01:44
🔗
|
markedL |
really, I don't want one. https://medium.com/@N/how-i-lost-my-50-000-twitter-username-24eb09e026dd |
02:31
🔗
|
|
BlueMax has quit IRC (Ping timeout: 745 seconds) |
02:38
🔗
|
|
BlueMax has joined #archiveteam-ot |
03:30
🔗
|
|
programme is now known as prq |
04:17
🔗
|
|
qw3rty2 has joined #archiveteam-ot |
04:26
🔗
|
|
qw3rty has quit IRC (Ping timeout: 745 seconds) |
04:27
🔗
|
|
odemg has quit IRC (Ping timeout: 745 seconds) |
04:31
🔗
|
|
odemg has joined #archiveteam-ot |
04:50
🔗
|
|
kiska18 has quit IRC (Remote host closed the connection) |
04:50
🔗
|
|
Ryz has quit IRC (Remote host closed the connection) |
04:50
🔗
|
|
Ryz has joined #archiveteam-ot |
04:51
🔗
|
|
kiska18 has joined #archiveteam-ot |
05:08
🔗
|
|
SynMonger has joined #archiveteam-ot |
05:16
🔗
|
|
SynMonger has quit IRC (Wait, what?) |
05:43
🔗
|
|
akierig has joined #archiveteam-ot |
06:01
🔗
|
|
kiska has quit IRC (Remote host closed the connection) |
06:01
🔗
|
|
Flashfire has quit IRC (Remote host closed the connection) |
06:02
🔗
|
|
kiska has joined #archiveteam-ot |
06:02
🔗
|
|
Fusl sets mode: +o kiska |
06:02
🔗
|
|
Fusl__ sets mode: +o kiska |
06:02
🔗
|
|
Fusl_ sets mode: +o kiska |
06:02
🔗
|
|
Flashfire has joined #archiveteam-ot |
06:25
🔗
|
|
m007a83 has quit IRC (Quit: Fuck you Comcast) |
06:33
🔗
|
|
akierig has quit IRC (Quit: later_gator) |
06:47
🔗
|
|
SoraUta has quit IRC (Read error: Connection reset by peer) |
06:50
🔗
|
|
SoraUta has joined #archiveteam-ot |
07:54
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
08:45
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
09:20
🔗
|
|
magus_bgf has joined #archiveteam-ot |
09:36
🔗
|
|
magus_bgf has quit IRC (Ping timeout: 252 seconds) |
09:57
🔗
|
|
VADemon has joined #archiveteam-ot |
09:57
🔗
|
|
magus_bgf has joined #archiveteam-ot |
10:21
🔗
|
|
eientei95 has quit IRC (Remote host closed the connection) |
10:27
🔗
|
|
eientei95 has joined #archiveteam-ot |
10:27
🔗
|
|
eientei95 has quit IRC (Handshake flooding) |
10:29
🔗
|
|
eientei95 has joined #archiveteam-ot |
11:02
🔗
|
|
X-Scale` has joined #archiveteam-ot |
11:04
🔗
|
|
X-Scale has quit IRC (Ping timeout: 252 seconds) |
11:04
🔗
|
|
X-Scale` is now known as X-Scale |
11:22
🔗
|
magus_bgf |
Is anyone here working on actually restoring websites, rather than just archiving? |
11:31
🔗
|
|
X-Scale has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Organize your IRC) |
11:43
🔗
|
|
X-Scale has joined #archiveteam-ot |
11:46
🔗
|
|
X-Scale has quit IRC (Client Quit) |
11:46
🔗
|
|
magus_bgf has quit IRC (Ping timeout: 252 seconds) |
11:59
🔗
|
|
X-Scale has joined #archiveteam-ot |
12:10
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
12:17
🔗
|
JAA |
magus_bgf (if you read logs): We don't even really have enough time to deal with saving the stuff, so generally, nobody here restores things. Besides, our archives all go into the Wayback Machine, which usually provides at least a somewhat usable display. |
12:49
🔗
|
markedL |
Internet Trash Heap did a little playback |
12:52
🔗
|
prq |
I have restored one or two websites. it isn't a very scalable endeavor. |
12:55
🔗
|
JAA |
Yeah, probably takes more effort than the archival. |
12:56
🔗
|
JAA |
The best route is to improve the playback in pywb and similar tools. And of course the WBM, but the code isn't open, so that has to be done by IA staff. |
12:57
🔗
|
prq |
I have thought about running my own pywb instance for stuff I archive. |
12:57
🔗
|
prq |
this one job I started weeks ago is still going. :/ |
12:58
🔗
|
JAA |
Hehe, try months. Hell, we have an ArchiveBot job that has its first anniversary in a few days. |
12:58
🔗
|
prq |
I feel like there's several optimizations I could make to this one job that would help immensely-- run a different wget command for different parts of the tree |
12:58
🔗
|
|
SynMonger has joined #archiveteam-ot |
12:58
🔗
|
prq |
does that mean that job has been running in the same exact process for a year? |
12:58
🔗
|
JAA |
Yeah, another reason why wpull is better than wget: it has parallelism. |
12:58
🔗
|
prq |
or is it a distributed job that's been running on multiple machines? |
12:59
🔗
|
JAA |
Single process on one machine. |
12:59
🔗
|
prq |
wow |
12:59
🔗
|
prq |
yeah, I want to do OS maintenance on the machine this job is running on. :/ |
12:59
🔗
|
JAA |
Distributed recursive crawling is a tricky problem. |
12:59
🔗
|
prq |
indeed it is. |
12:59
🔗
|
JAA |
First you have coordination, but that's easily solvable with a central URL database. |
12:59
🔗
|
prq |
I feel like it should be very doable with a queue |
13:00
🔗
|
prq |
rabbitmq or even redis |
13:00
🔗
|
JAA |
But then you run into sessions and cookies tied to IP addresses etc. |
13:00
🔗
|
JAA |
And it gets messy very quickly. |
13:00
🔗
|
prq |
right-- many sites don't tie cookies to an ip. the ones I care about for this sphere of concern I have don't. |
13:01
🔗
|
JAA |
Well yeah, it can work very well for individual sites, but doing it for the general case is hard. |
13:01
🔗
|
JAA |
It's why ArchiveBot isn't distributed in that sense. |
13:01
🔗
|
prq |
so opting in to a strategy that favors one ip vs not would be ideal. |
13:01
🔗
|
prq |
and I could still run a distributed crawl, but put them all behind the same NAT'd wan ip |
13:01
🔗
|
prq |
or run 100 threads on the same machine |
13:02
🔗
|
prq |
or whatever |
13:02
🔗
|
JAA |
Well yes, if you have the right tool for it. |
13:02
🔗
|
JAA |
wget doesn't have any way to share its URL queue. wpull is single-threaded and doesn't handle DB locking at all. Not sure about other tools. |
13:03
🔗
|
JAA |
Also, going highly parallel from a single IP only works if there are no rate limit issues. |
13:03
🔗
|
JAA |
That year-long AB job I just mentioned has to go slow because otherwise the server collapses. |
13:06
🔗
|
kiskaWee |
Which one is it? The gamefaq one? |
13:06
🔗
|
JAA |
mozdev.org |
13:06
🔗
|
kiskaWee |
Oh xD |
13:08
🔗
|
markedL |
queue management is something I really want all tools to have different strategies for |
13:09
🔗
|
markedL |
wget is probably the most entrenched example |
13:09
🔗
|
prq |
yeah, it sounds more and more like the ability to pick a strategy would be ideal. |
13:09
🔗
|
prq |
the job I'm doing seems to have *maybe* some mild ratelimiting, but it seems to be applied unevenly. |
13:09
🔗
|
|
magus_bgf has joined #archiveteam-ot |
13:11
🔗
|
prq |
some sections of the site I could generate every URL for easily without relying on the wget crawl mechanism. |
13:12
🔗
|
prq |
looking at their robots.txt, they seem to disallow some stuff that ought to be captured. weird. |
13:14
🔗
|
kiskaWee |
23:17:35 |
13:14
🔗
|
kiskaWee |
<@JAA>magus_bgf (if you read logs): We don't even really have enough time to deal with saving the stuff, so generally, nobody here restores things. Besides, our archives all go into the Wayback Machine, which usually provides at least a somewhat usable display. |
13:14
🔗
|
kiskaWee |
Oh xD |
13:14
🔗
|
|
anonymiga has quit IRC (Quit: Lost terminal) |
13:19
🔗
|
magus_bgf |
@kiskaWee thanks |
13:19
🔗
|
magus_bgf |
shame to hear that, of course |
13:20
🔗
|
JAA |
magus_bgf: There was a bit of a discussion after that message above, check the logs: https://archive.fart.website/bin/irclogger_log/archiveteam-ot?date=2019-11-27,Wed |
13:20
🔗
|
|
bluefoo has quit IRC (Read error: Operation timed out) |
13:20
🔗
|
JAA |
(Well, just a few sentences about that question.) |
13:20
🔗
|
markedL |
WBM is not super performant but is decently stable . It would be hard to beat that combination |
13:22
🔗
|
markedL |
if I wanted to "fix" something, I'd be curious whether tampermonkey to improve the playback on certain grabs but still from WBM |
13:23
🔗
|
JAA |
That could work except for the security issues. |
13:25
🔗
|
|
kiska18 has quit IRC (Read error: Operation timed out) |
13:26
🔗
|
JAA |
Well, shit: https://bugs.python.org/issue36338#msg355322 |
13:26
🔗
|
|
kiska18 has joined #archiveteam-ot |
13:26
🔗
|
|
Fusl sets mode: +o kiska18 |
13:26
🔗
|
|
Fusl__ sets mode: +o kiska18 |
13:26
🔗
|
|
Fusl_ sets mode: +o kiska18 |
13:26
🔗
|
|
Ryz has quit IRC (Read error: Connection reset by peer) |
13:27
🔗
|
|
Ryz7 has joined #archiveteam-ot |
13:28
🔗
|
JAA |
Er, wrong channel. |
13:30
🔗
|
magus_bgf |
Read the log, too, thank you. IA is great of course, but I'd like to do better than "somewhat usable". |
13:32
🔗
|
JAA |
Yeah, and I'm sure many people would be willing to work on that if it were possible. |
13:32
🔗
|
JAA |
But I doubt the WBM code is going to be released anytime soon, if ever. |
13:33
🔗
|
JAA |
Based on my tests and from what I've heard, pywb works significantly better than the WBM for JavaScript-heavy playback. |
13:33
🔗
|
JAA |
But for some things, you'll always need some sort of fix scripts that remove or mangle certain URL components. |
13:33
🔗
|
JAA |
For example, if the current timestamp is included in a URL. |
13:34
🔗
|
JAA |
*Maybe* it's possible to build an archival browser that records and plays back any JS API calls that aren't constant. |
13:35
🔗
|
JAA |
But that would be a *lot* of work obviously. |
13:36
🔗
|
JAA |
Plus you'd need to use that patched browser for playback because you can't override many of those APIs dynamically. |
13:37
🔗
|
markedL |
you mean same URL that gives 2 different results? |
13:38
🔗
|
JAA |
No, I mean "cache busting" parameters like https://example.org/something?time=1574861908 |
13:39
🔗
|
JAA |
Common for example for JSON API requests launched from jQuery. |
13:39
🔗
|
JAA |
(I think the parameter is _ there.) |
13:51
🔗
|
|
Sanky is now known as Sanqui |
14:00
🔗
|
|
SoraUta has quit IRC (Ping timeout: 252 seconds) |
14:43
🔗
|
|
Craigle has quit IRC (Ping timeout: 496 seconds) |
14:43
🔗
|
|
Craigle has joined #archiveteam-ot |
14:44
🔗
|
|
magus_bgf has quit IRC (Quit: Leaving) |
15:24
🔗
|
|
superkuh_ is now known as superkuh |
15:56
🔗
|
|
Jamesatja has joined #archiveteam-ot |
16:44
🔗
|
|
Ryz7 is now known as Ryz |
16:56
🔗
|
* |
Raccoon mopes in relegation |
17:18
🔗
|
|
bluefoo has joined #archiveteam-ot |
17:29
🔗
|
|
Jamesatja has quit IRC (Read error: Connection reset by peer) |
17:31
🔗
|
|
bluefoo has quit IRC (Remote host closed the connection) |
17:36
🔗
|
|
martini has joined #archiveteam-ot |
17:39
🔗
|
|
deevious has quit IRC (Ping timeout: 252 seconds) |
17:40
🔗
|
|
Zerote_ has joined #archiveteam-ot |
17:40
🔗
|
|
kiska has quit IRC (Ping timeout: 252 seconds) |
17:42
🔗
|
|
Zerote has quit IRC (Ping timeout: 252 seconds) |
17:43
🔗
|
|
britmob_ has joined #archiveteam-ot |
17:44
🔗
|
|
kiska has joined #archiveteam-ot |
17:44
🔗
|
|
Fusl__ sets mode: +o kiska |
17:44
🔗
|
|
Fusl sets mode: +o kiska |
17:44
🔗
|
|
Fusl_ sets mode: +o kiska |
17:46
🔗
|
|
bluefoo has joined #archiveteam-ot |
17:46
🔗
|
|
britmob has quit IRC (Ping timeout: 252 seconds) |
17:46
🔗
|
|
anarcat has quit IRC (Ping timeout: 252 seconds) |
17:48
🔗
|
|
kiska has quit IRC (Ping timeout: 252 seconds) |
17:53
🔗
|
|
_niklas has joined #archiveteam-ot |
17:55
🔗
|
|
Flashfire has quit IRC (Ping timeout: 252 seconds) |
17:57
🔗
|
|
anarcat has joined #archiveteam-ot |
17:57
🔗
|
|
anarcat has quit IRC (Handshake flooding) |
18:00
🔗
|
|
Flashfire has joined #archiveteam-ot |
18:01
🔗
|
markedL |
oh, the opposite. 2 urls that look different that should be the same. |
18:02
🔗
|
|
MrRadar has joined #archiveteam-ot |
18:02
🔗
|
|
anarcat has joined #archiveteam-ot |
18:03
🔗
|
|
Fusl has quit IRC (Quit: Moving to hackint) |
18:04
🔗
|
|
Fusl__ has quit IRC (Quit: Moving to hackint) |
18:04
🔗
|
|
Fusl_ has quit IRC (Quit: Moving to hackint) |
18:04
🔗
|
|
systwiAL_ has joined #archiveteam-ot |
18:09
🔗
|
|
systwiALT has quit IRC (Read error: Operation timed out) |
18:12
🔗
|
|
systwiAL_ is now known as systwiALT |
18:18
🔗
|
|
deevious has joined #archiveteam-ot |
18:24
🔗
|
|
kiska has joined #archiveteam-ot |
18:25
🔗
|
|
svchfoo3 sets mode: +o kiska |
18:25
🔗
|
|
svchfoo1 sets mode: +o kiska |
18:38
🔗
|
|
_niklas has quit IRC (Ping timeout: 258 seconds) |
18:38
🔗
|
|
_niklas has joined #archiveteam-ot |
18:38
🔗
|
|
mls has quit IRC (se.hub efnet.portlane.se) |
18:38
🔗
|
|
Jon has quit IRC (se.hub efnet.portlane.se) |
18:38
🔗
|
|
Laverne_ has quit IRC (se.hub efnet.portlane.se) |
18:38
🔗
|
|
VoynichCr has quit IRC (se.hub efnet.portlane.se) |
18:39
🔗
|
|
bluefoo has quit IRC (Ping timeout: 745 seconds) |
18:55
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
18:56
🔗
|
|
systwiAL_ has joined #archiveteam-ot |
18:58
🔗
|
|
MrRadar has joined #archiveteam-ot |
18:58
🔗
|
|
bluefoo has joined #archiveteam-ot |
19:02
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
19:03
🔗
|
|
systwiALT has quit IRC (Read error: Operation timed out) |
19:13
🔗
|
|
Jon has joined #archiveteam-ot |
19:37
🔗
|
|
mls has joined #archiveteam-ot |
19:42
🔗
|
|
VoynichCr has joined #archiveteam-ot |
19:49
🔗
|
|
martini has quit IRC (Quit: No Reasson) |
19:58
🔗
|
|
Laverne has joined #archiveteam-ot |
20:03
🔗
|
|
bluefoo has quit IRC (Read error: Connection reset by peer) |
20:37
🔗
|
|
systwiAL_ is now known as systwiALT |
20:45
🔗
|
|
Raccoon` has joined #archiveteam-ot |
20:51
🔗
|
|
bluefoo has joined #archiveteam-ot |
20:53
🔗
|
|
Raccoon has quit IRC (Ping timeout: 622 seconds) |
20:53
🔗
|
|
Raccoon` is now known as Raccoon |
20:55
🔗
|
|
Raccoon` has joined #archiveteam-ot |
20:58
🔗
|
|
Raccoon has quit IRC (Ping timeout: 258 seconds) |
20:58
🔗
|
|
Raccoon` is now known as Raccoon |
21:02
🔗
|
|
MrRadar has joined #archiveteam-ot |
21:19
🔗
|
|
Raccoon has quit IRC (Remote host closed the connection) |
21:29
🔗
|
|
manjaro-u has joined #archiveteam-ot |
23:05
🔗
|
|
ryry has joined #archiveteam-ot |
23:05
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
23:15
🔗
|
|
VerifiedJ has quit IRC (Read error: Connection reset by peer) |
23:21
🔗
|
|
dewdropaw has quit IRC (Quit: I object! That was... objectionable!) |
23:37
🔗
|
|
BlueMax has joined #archiveteam-ot |
23:50
🔗
|
|
SoraUta has joined #archiveteam-ot |
23:53
🔗
|
|
manjaro-u has quit IRC (Ping timeout: 252 seconds) |