Time |
Nickname |
Message |
00:42
🔗
|
|
Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) |
00:44
🔗
|
|
manwith1n is now known as ranma |
00:53
🔗
|
|
Maylay has joined #archiveteam-bs |
01:12
🔗
|
|
Maylay has quit IRC (Remote host closed the connection) |
01:15
🔗
|
|
Maylay has joined #archiveteam-bs |
01:18
🔗
|
|
X-Scale has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Organize your IRC) |
01:19
🔗
|
|
zino_ has quit IRC (Read error: Operation timed out) |
01:21
🔗
|
|
zino has joined #archiveteam-bs |
01:21
🔗
|
|
Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) |
01:24
🔗
|
|
Maylay has joined #archiveteam-bs |
01:32
🔗
|
|
X-Scale has joined #archiveteam-bs |
01:33
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
01:40
🔗
|
|
Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) |
01:45
🔗
|
|
sirvy_ has joined #archiveteam-bs |
01:50
🔗
|
|
sirvy has quit IRC (Ping timeout: 615 seconds) |
01:55
🔗
|
|
Maylay has joined #archiveteam-bs |
02:04
🔗
|
|
Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) |
02:06
🔗
|
|
Maylay has joined #archiveteam-bs |
02:14
🔗
|
|
Maylay has quit IRC (Remote host closed the connection) |
02:24
🔗
|
|
Maylay has joined #archiveteam-bs |
02:27
🔗
|
markedL |
someone smarter than me, how was this thing escaped: data-uix-load-more-href=\"\/browse_ajax?action_continuation=1\u0026amp;continuation=4qmFsgIqEhpWTExMTXU1Z1BtS3A1YXYwUUNBYWpLVE1odxoMZWdkUVZEcERUV2RD\"\u003e\u003c |
02:30
🔗
|
ivan |
markedL: an HTML-safe JSON encoder that also escapes / to \/, followed by html entity escaping |
02:31
🔗
|
ivan |
the reason some JSON encoders do that is to prevent </script> from closing a script (and starting a new one) |
02:36
🔗
|
|
kode54 has quit IRC (Quit: The Lounge - https://thelounge.chat) |
02:48
🔗
|
|
Maylay has quit IRC (Remote host closed the connection) |
02:51
🔗
|
|
Maylay has joined #archiveteam-bs |
03:06
🔗
|
|
Maylay has quit IRC (No Ping reply in 300 seconds.) |
03:08
🔗
|
|
Maylay has joined #archiveteam-bs |
03:12
🔗
|
|
kode54 has joined #archiveteam-bs |
03:15
🔗
|
|
Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) |
03:17
🔗
|
|
Maylay has joined #archiveteam-bs |
03:29
🔗
|
|
Maylay has quit IRC (Read error: Operation timed out) |
03:37
🔗
|
|
kode54 has quit IRC (Remote host closed the connection) |
03:38
🔗
|
|
kode54 has joined #archiveteam-bs |
04:00
🔗
|
|
Maylay has joined #archiveteam-bs |
04:00
🔗
|
|
Maylay has quit IRC (Remote host closed the connection!) |
04:02
🔗
|
|
DLoader has quit IRC (Quit: DLoader) |
04:07
🔗
|
|
cppchrisc has joined #archiveteam-bs |
04:07
🔗
|
|
cppchrisc has quit IRC (Connection closed) |
04:08
🔗
|
|
cppchrisc has joined #archiveteam-bs |
04:10
🔗
|
|
qw3rty has joined #archiveteam-bs |
04:19
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 745 seconds) |
04:27
🔗
|
|
odemgi_ has joined #archiveteam-bs |
04:28
🔗
|
|
Maylay has joined #archiveteam-bs |
04:32
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
04:34
🔗
|
|
sirvy has joined #archiveteam-bs |
04:38
🔗
|
|
sirvy_ has quit IRC (Ping timeout: 615 seconds) |
04:46
🔗
|
|
K4k has quit IRC (Read error: Connection reset by peer) |
04:49
🔗
|
|
killsushi has joined #archiveteam-bs |
05:18
🔗
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
05:59
🔗
|
SketchCow |
I know |
06:00
🔗
|
SketchCow |
They still need to qa the thing |
06:04
🔗
|
Kaz |
it's cool, looks like it's deployed |
06:10
🔗
|
markedL |
update on Youtube liked-lists, going to need a repo and target soon. 4.4TB of HTML in warc.gz |
06:12
🔗
|
Raccoon |
markedL: can you do a search of your scrapings for videos by channel name |
06:13
🔗
|
markedL |
can you join the project channel? |
06:13
🔗
|
Raccoon |
a channel that was recently removed has been backed up, but no metadata. |
06:13
🔗
|
Raccoon |
sort of a side question anyway |
06:14
🔗
|
Raccoon |
just wondering if you have any said metadata |
06:18
🔗
|
markedL |
I just process what I'm handed, you're welcome to ask the upstream data providers |
07:13
🔗
|
|
DLoader has joined #archiveteam-bs |
07:36
🔗
|
|
godane has quit IRC (Ping timeout: 252 seconds) |
07:49
🔗
|
|
klg has quit IRC (Ping timeout: 258 seconds) |
07:54
🔗
|
|
klg has joined #archiveteam-bs |
09:16
🔗
|
apache2 |
what's your favorite tool for crawling and producing WARC archives? |
09:16
🔗
|
apache2 |
(and is WARC the recommended format?) |
09:21
🔗
|
jodizzle |
apache2: Yes, WARC is the generally recommended format. There are a bunch of tools people use, but this one is pretty good for personal use: https://github.com/archiveteam/grab-site |
09:21
🔗
|
jodizzle |
There's also a tools section on the wiki: https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#Tools |
09:30
🔗
|
apache2 |
thank you jodizzle |
09:31
🔗
|
apache2 |
follow-up question: what do you personally use for searching/extracting from WARC dumps, and is there something like a browser-based application that can browse a WARC archive WITHOUT internet connections being made? |
09:32
🔗
|
JAA |
zgrep for searching WARCs because I'm insane. For playback, pywb is pretty good. |
09:33
🔗
|
JAA |
There are tools for extracting WARC contents into plain files, but I don't have much experience with them. |
09:33
🔗
|
JAA |
At most, I might dump the response contents to stdout (using my own tool, warc-tiny) and process them using grep, sed, awk, etc. to extract the information I'm interested in. |
09:34
🔗
|
apache2 |
I'd primarily be looking for something that could extract article content (so some sort of CSS3-like selector syntax) and media linked form those articles |
09:34
🔗
|
apache2 |
fair enough, I'm happy to hack something together myself too. I'll look into pywb and grab-site |
09:35
🔗
|
apache2 |
how does grab-site deal with media content like streaming video? |
09:36
🔗
|
JAA |
It's complicated. |
09:36
🔗
|
JAA |
If it's a simple <video> tag with a URL in <source> or whatever, wpull (the tool under the hood of grab-site and also ArchiveBot) will pick that up. |
09:37
🔗
|
JAA |
If it's done with an M3U(8) playlist, e.g. DASH, then not. |
09:38
🔗
|
apache2 |
I'm half-expecting to bump into articles with embedded youtube links |
09:38
🔗
|
JAA |
If JavaScript involved, it might. Basically, if the video file (not playlist) URL appears in JS as one string, it should normally find it. If there's escaping going on, that might still not work though. |
09:38
🔗
|
apache2 |
I believe youtube uses DASH? |
09:38
🔗
|
JAA |
Yeah |
09:38
🔗
|
JAA |
It definitely won't grab YouTube videos. |
09:39
🔗
|
apache2 |
is there a separate tool that can fix up an existing warc with embedded youtube tags using youtube-dl externally? |
09:39
🔗
|
JAA |
There is a youtube-dl integration in wpull (not sure if grab-site exposes it), but it's been broken for years. |
09:40
🔗
|
JAA |
It should probably still work with wpull 1.2.3 to a degree. (That version isn't supported by grab-site anymore though.) |
09:40
🔗
|
apache2 |
okay, I guess that's something to look into. thank you! |
09:43
🔗
|
apache2 |
https://github.com/ArchiveTeam/wpull/issues/392#issuecomment-428750850 okay, for a rainy day, yeah. that does not look like fun to debug. |
09:45
🔗
|
apache2 |
but it seems to me that it wouldn't be too troublesome to solve this in a second pass, looking for youtube links in the WARC, archiving those, and replacing at playback time (or rewriting the WARC) |
09:46
🔗
|
JAA |
Rewriting WARC contents is punishable by death. |
09:47
🔗
|
JAA |
A second pass would require adding a WARC parser, which wpull doesn't need at the moment. That's a lot of extra work. And it wouldn't really solve the issue either. |
09:47
🔗
|
JAA |
The proxy just needs to be implemented differently. |
09:47
🔗
|
JAA |
But with people constantly setting their websites on fire, there hasn't been any development on these underlying tools in a while. |
09:48
🔗
|
apache2 |
a WARC parser seems like a pretty fundamental part of the ecosystem. To be clear I\'m not suggesting that wpull or whatever should be changed, just hypothesizing how I could achieve my goal of playing back the websites correctly. |
09:49
🔗
|
apache2 |
but yeah, that thing fiddling around with the asyncio objects does seem kind of nasty |
09:49
🔗
|
JAA |
Oh yeah, there are several WARC parsers. warcio is a popular one. |
09:50
🔗
|
JAA |
Indeed. I'm not sure what issues it causes exactly (FalconK only mentioned it "breaks encapsulation or uses undefined behaviour" without going into details), but it's obviously not hard to imagine it will do just that. |
09:52
🔗
|
apache2 |
is this the reason that IA *sometimes* have worked with youtube, but usually doesn't? |
09:52
🔗
|
apache2 |
or is that a policy thing due to storage requirements? |
09:54
🔗
|
JAA |
No, IA/the WBM does its own thing with YouTube videos, and I'm not sure how it works. Doesn't have anything to do with this though, and I'm pretty sure even if wpull could use youtube-dl, the playback in the WBM wouldn't work based on that. |
10:03
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
10:59
🔗
|
|
odemgi has joined #archiveteam-bs |
11:00
🔗
|
|
odemgi_ has quit IRC (Read error: Connection reset by peer) |
11:03
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
11:11
🔗
|
|
godane has joined #archiveteam-bs |
11:25
🔗
|
|
lempamo has quit IRC (Read error: Operation timed out) |
11:26
🔗
|
|
lempamo has joined #archiveteam-bs |
11:42
🔗
|
|
Ryz has quit IRC (Quit: Ping timeout (120 seconds)) |
11:44
🔗
|
|
kiska18 has quit IRC (Ping timeout (120 seconds)) |
11:47
🔗
|
|
kiska18 has joined #archiveteam-bs |
11:47
🔗
|
|
svchfoo1 sets mode: +o kiska18 |
11:47
🔗
|
|
Ryz has joined #archiveteam-bs |
14:22
🔗
|
|
DLoader has quit IRC (Quit: DLoader) |
14:24
🔗
|
|
DLoader has joined #archiveteam-bs |
14:28
🔗
|
JAA |
LowLevelM: Lots of wikis have some forum-like thing, so in my opinion, that's not really an argument against handling it in #wikiteam. |
14:58
🔗
|
|
K4k has joined #archiveteam-bs |
15:16
🔗
|
LowLevelM |
kiska: Why not hackint? |
15:16
🔗
|
kiska |
Cause it's already here? |
15:16
🔗
|
LowLevelM |
OH |
15:16
🔗
|
LowLevelM |
Oops caps |
15:17
🔗
|
kiska |
Aka https://www.archiveteam.org/index.php?title=Soundcloud |
15:55
🔗
|
|
mls_ has quit IRC (Ping timeout: 258 seconds) |
16:00
🔗
|
|
mls_ has joined #archiveteam-bs |
16:36
🔗
|
|
jamiew has joined #archiveteam-bs |
16:36
🔗
|
|
jamiew has left Textual IRC Client: www.textualapp.com |
16:36
🔗
|
|
jamiew has joined #archiveteam-bs |
16:43
🔗
|
|
jamiew has quit IRC (Textual IRC Client: www.textualapp.com) |
16:43
🔗
|
|
jamiew has joined #archiveteam-bs |
16:45
🔗
|
|
HP_Archiv has joined #archiveteam-bs |
17:28
🔗
|
|
jamiew has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) |
17:47
🔗
|
|
jamiew has joined #archiveteam-bs |
17:52
🔗
|
|
ripdog has quit IRC (Ping timeout: 246 seconds) |
18:00
🔗
|
|
jamiew has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) |
18:01
🔗
|
|
mls_ has quit IRC (Ping timeout: 258 seconds) |
18:06
🔗
|
|
schbirid has joined #archiveteam-bs |
18:08
🔗
|
|
jamiew has joined #archiveteam-bs |
18:11
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
18:12
🔗
|
|
Jens has joined #archiveteam-bs |
18:15
🔗
|
|
mls_ has joined #archiveteam-bs |
18:27
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
18:57
🔗
|
|
antomati_ is now known as antomatic |
19:36
🔗
|
|
anarcat has quit IRC (Ping timeout: 252 seconds) |
19:43
🔗
|
|
anarcat has joined #archiveteam-bs |
19:47
🔗
|
|
DogsRNice has joined #archiveteam-bs |
20:57
🔗
|
|
godane1 has joined #archiveteam-bs |
20:57
🔗
|
|
chirlu has joined #archiveteam-bs |
20:59
🔗
|
|
godane has quit IRC (Read error: Connection reset by peer) |
21:37
🔗
|
|
mls_ has quit IRC (Ping timeout: 258 seconds) |
21:46
🔗
|
|
jamiew has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) |
21:50
🔗
|
|
mls_ has joined #archiveteam-bs |
21:53
🔗
|
|
mls_ has quit IRC (Client Quit) |
22:10
🔗
|
|
geniibunt has joined #archiveteam-bs |
22:11
🔗
|
geniibunt |
Not sure if this is the correct channel, but... "The developerWorks Connections platform will be sunset on December 31, 2019. On January 1, 2020, this forum will no longer be available." https://www.ibm.com/developerworks/community/forums/html/forum?id=2eb0f36d-9534-471b-8b27-c21e6c5b9b2b I used the option in Wayback to store that specific page now but I do not know if the system there will now auto-crawl it |
22:12
🔗
|
geniibunt |
Should I do something like try to run a recursive wget and then later submit whatever I get to Wayback? |
22:13
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
22:16
🔗
|
geniibunt |
( or possibly if it's already known that this site would be taken down and steps are/have been taken this would useful to know also ) |
22:27
🔗
|
|
mls_ has joined #archiveteam-bs |
22:38
🔗
|
JAA |
Yeah, this is the correct channel. I think I looked into it a week or two ago and saw a lot of interactive stuff which wouldn't be grabbed with wget/wpull/ArchiveBot, but maybe I'm confusing it with something else. Lots of shit shutting down at the moment. |
22:44
🔗
|
geniibunt |
Looks like the larger issue is the entire /community/ hierarchy |
23:01
🔗
|
|
geniibunt has quit IRC () |
23:07
🔗
|
|
Ravenloft has joined #archiveteam-bs |
23:17
🔗
|
|
deevious has quit IRC (Ping timeout: 252 seconds) |
23:19
🔗
|
|
deevious has joined #archiveteam-bs |
23:21
🔗
|
|
eientei95 has quit IRC (Read error: Connection reset by peer) |
23:26
🔗
|
|
eientei95 has joined #archiveteam-bs |
23:29
🔗
|
|
tech234a has joined #archiveteam-bs |
23:41
🔗
|
wp494 |
so is there any reason why we're starting to split stuff between here on EFnet and hackint all of a sudden? |
23:41
🔗
|
JAA |
EFnet sucks, hackint sucks less. |
23:41
🔗
|
JAA |
Services, more stable, etc. |
23:42
🔗
|
wp494 |
then what's stopping us from choosing some day as a sort of a "d-day" to switch all the things over |
23:42
🔗
|
wp494 |
I mean I understand the benefits of why we would want something like services/registration/etc |
23:43
🔗
|
JAA |
Well, moving hundreds of people over isn't going to happen in one day anyway. |
23:43
🔗
|
JAA |
And the other question of course is whether everyone/almost everyone is actually okay with that. |
23:43
🔗
|
JAA |
So far, hackint has been mostly an experiment. |
23:44
🔗
|
JAA |
A couple nerds went there end of August when the netsplits here annoyed us enough, and then we created a few channels there to get some real-world experience with using it instead of just artificial test channels. It's been working very well so far. |
23:45
🔗
|
JAA |
Which caused others to want to create new project channels there as well, etc. |
23:46
🔗
|
JAA |
Actually moving existing channels over hasn't really been discussed so far (although #archiveteam, -bs, and -ot all exist over there, but they're hardly used). |
23:49
🔗
|
wp494 |
I imagine moving #archivebot over would be the biggest PITA as far as moving anything existing over |
23:49
🔗
|
wp494 |
though I guess you could just warn in the topic that the channel's gonna get muted on <x> date and then on that date set +m on the channel |
23:51
🔗
|
|
wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) |
23:52
🔗
|
JAA |
But nobody reads topics. :-P |
23:52
🔗
|
JAA |
pls |
23:53
🔗
|
JAA |
Moving ArchiveBot would actually be one of the easier channels I think. There's only a handful of people that are actually active, and moving the bot is trivial. |
23:56
🔗
|
|
wp494 has joined #archiveteam-bs |