#archiveteam-bs 2019-12-04,Wed

↑back Search

Time Nickname Message
00:42 🔗 Maylay has quit IRC (Quit: No Ping reply in 300 seconds.)
00:44 🔗 manwith1n is now known as ranma
00:53 🔗 Maylay has joined #archiveteam-bs
01:12 🔗 Maylay has quit IRC (Remote host closed the connection)
01:15 🔗 Maylay has joined #archiveteam-bs
01:18 🔗 X-Scale has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Organize your IRC)
01:19 🔗 zino_ has quit IRC (Read error: Operation timed out)
01:21 🔗 zino has joined #archiveteam-bs
01:21 🔗 Maylay has quit IRC (Quit: No Ping reply in 300 seconds.)
01:24 🔗 Maylay has joined #archiveteam-bs
01:32 🔗 X-Scale has joined #archiveteam-bs
01:33 🔗 VerifiedJ has quit IRC (Quit: Leaving)
01:40 🔗 Maylay has quit IRC (Quit: No Ping reply in 300 seconds.)
01:45 🔗 sirvy_ has joined #archiveteam-bs
01:50 🔗 sirvy has quit IRC (Ping timeout: 615 seconds)
01:55 🔗 Maylay has joined #archiveteam-bs
02:04 🔗 Maylay has quit IRC (Quit: No Ping reply in 300 seconds.)
02:06 🔗 Maylay has joined #archiveteam-bs
02:14 🔗 Maylay has quit IRC (Remote host closed the connection)
02:24 🔗 Maylay has joined #archiveteam-bs
02:27 🔗 markedL someone smarter than me, how was this thing escaped: data-uix-load-more-href=\"\/browse_ajax?action_continuation=1\u0026amp;continuation=4qmFsgIqEhpWTExMTXU1Z1BtS3A1YXYwUUNBYWpLVE1odxoMZWdkUVZEcERUV2RD\"\u003e\u003c
02:30 🔗 ivan markedL: an HTML-safe JSON encoder that also escapes / to \/, followed by html entity escaping
02:31 🔗 ivan the reason some JSON encoders do that is to prevent </script> from closing a script (and starting a new one)
02:36 🔗 kode54 has quit IRC (Quit: The Lounge - https://thelounge.chat)
02:48 🔗 Maylay has quit IRC (Remote host closed the connection)
02:51 🔗 Maylay has joined #archiveteam-bs
03:06 🔗 Maylay has quit IRC (No Ping reply in 300 seconds.)
03:08 🔗 Maylay has joined #archiveteam-bs
03:12 🔗 kode54 has joined #archiveteam-bs
03:15 🔗 Maylay has quit IRC (Quit: No Ping reply in 300 seconds.)
03:17 🔗 Maylay has joined #archiveteam-bs
03:29 🔗 Maylay has quit IRC (Read error: Operation timed out)
03:37 🔗 kode54 has quit IRC (Remote host closed the connection)
03:38 🔗 kode54 has joined #archiveteam-bs
04:00 🔗 Maylay has joined #archiveteam-bs
04:00 🔗 Maylay has quit IRC (Remote host closed the connection!)
04:02 🔗 DLoader has quit IRC (Quit: DLoader)
04:07 🔗 cppchrisc has joined #archiveteam-bs
04:07 🔗 cppchrisc has quit IRC (Connection closed)
04:08 🔗 cppchrisc has joined #archiveteam-bs
04:10 🔗 qw3rty has joined #archiveteam-bs
04:19 🔗 qw3rty2 has quit IRC (Ping timeout: 745 seconds)
04:27 🔗 odemgi_ has joined #archiveteam-bs
04:28 🔗 Maylay has joined #archiveteam-bs
04:32 🔗 odemgi has quit IRC (Read error: Operation timed out)
04:34 🔗 sirvy has joined #archiveteam-bs
04:38 🔗 sirvy_ has quit IRC (Ping timeout: 615 seconds)
04:46 🔗 K4k has quit IRC (Read error: Connection reset by peer)
04:49 🔗 killsushi has joined #archiveteam-bs
05:18 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
05:59 🔗 SketchCow I know
06:00 🔗 SketchCow They still need to qa the thing
06:04 🔗 Kaz it's cool, looks like it's deployed
06:10 🔗 markedL update on Youtube liked-lists, going to need a repo and target soon. 4.4TB of HTML in warc.gz
06:12 🔗 Raccoon markedL: can you do a search of your scrapings for videos by channel name
06:13 🔗 markedL can you join the project channel?
06:13 🔗 Raccoon a channel that was recently removed has been backed up, but no metadata.
06:13 🔗 Raccoon sort of a side question anyway
06:14 🔗 Raccoon just wondering if you have any said metadata
06:18 🔗 markedL I just process what I'm handed, you're welcome to ask the upstream data providers
07:13 🔗 DLoader has joined #archiveteam-bs
07:36 🔗 godane has quit IRC (Ping timeout: 252 seconds)
07:49 🔗 klg has quit IRC (Ping timeout: 258 seconds)
07:54 🔗 klg has joined #archiveteam-bs
09:16 🔗 apache2 what's your favorite tool for crawling and producing WARC archives?
09:16 🔗 apache2 (and is WARC the recommended format?)
09:21 🔗 jodizzle apache2: Yes, WARC is the generally recommended format. There are a bunch of tools people use, but this one is pretty good for personal use: https://github.com/archiveteam/grab-site
09:21 🔗 jodizzle There's also a tools section on the wiki: https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#Tools
09:30 🔗 apache2 thank you jodizzle
09:31 🔗 apache2 follow-up question: what do you personally use for searching/extracting from WARC dumps, and is there something like a browser-based application that can browse a WARC archive WITHOUT internet connections being made?
09:32 🔗 JAA zgrep for searching WARCs because I'm insane. For playback, pywb is pretty good.
09:33 🔗 JAA There are tools for extracting WARC contents into plain files, but I don't have much experience with them.
09:33 🔗 JAA At most, I might dump the response contents to stdout (using my own tool, warc-tiny) and process them using grep, sed, awk, etc. to extract the information I'm interested in.
09:34 🔗 apache2 I'd primarily be looking for something that could extract article content (so some sort of CSS3-like selector syntax) and media linked form those articles
09:34 🔗 apache2 fair enough, I'm happy to hack something together myself too. I'll look into pywb and grab-site
09:35 🔗 apache2 how does grab-site deal with media content like streaming video?
09:36 🔗 JAA It's complicated.
09:36 🔗 JAA If it's a simple <video> tag with a URL in <source> or whatever, wpull (the tool under the hood of grab-site and also ArchiveBot) will pick that up.
09:37 🔗 JAA If it's done with an M3U(8) playlist, e.g. DASH, then not.
09:38 🔗 apache2 I'm half-expecting to bump into articles with embedded youtube links
09:38 🔗 JAA If JavaScript involved, it might. Basically, if the video file (not playlist) URL appears in JS as one string, it should normally find it. If there's escaping going on, that might still not work though.
09:38 🔗 apache2 I believe youtube uses DASH?
09:38 🔗 JAA Yeah
09:38 🔗 JAA It definitely won't grab YouTube videos.
09:39 🔗 apache2 is there a separate tool that can fix up an existing warc with embedded youtube tags using youtube-dl externally?
09:39 🔗 JAA There is a youtube-dl integration in wpull (not sure if grab-site exposes it), but it's been broken for years.
09:40 🔗 JAA It should probably still work with wpull 1.2.3 to a degree. (That version isn't supported by grab-site anymore though.)
09:40 🔗 apache2 okay, I guess that's something to look into. thank you!
09:43 🔗 apache2 https://github.com/ArchiveTeam/wpull/issues/392#issuecomment-428750850 okay, for a rainy day, yeah. that does not look like fun to debug.
09:45 🔗 apache2 but it seems to me that it wouldn't be too troublesome to solve this in a second pass, looking for youtube links in the WARC, archiving those, and replacing at playback time (or rewriting the WARC)
09:46 🔗 JAA Rewriting WARC contents is punishable by death.
09:47 🔗 JAA A second pass would require adding a WARC parser, which wpull doesn't need at the moment. That's a lot of extra work. And it wouldn't really solve the issue either.
09:47 🔗 JAA The proxy just needs to be implemented differently.
09:47 🔗 JAA But with people constantly setting their websites on fire, there hasn't been any development on these underlying tools in a while.
09:48 🔗 apache2 a WARC parser seems like a pretty fundamental part of the ecosystem. To be clear I\'m not suggesting that wpull or whatever should be changed, just hypothesizing how I could achieve my goal of playing back the websites correctly.
09:49 🔗 apache2 but yeah, that thing fiddling around with the asyncio objects does seem kind of nasty
09:49 🔗 JAA Oh yeah, there are several WARC parsers. warcio is a popular one.
09:50 🔗 JAA Indeed. I'm not sure what issues it causes exactly (FalconK only mentioned it "breaks encapsulation or uses undefined behaviour" without going into details), but it's obviously not hard to imagine it will do just that.
09:52 🔗 apache2 is this the reason that IA *sometimes* have worked with youtube, but usually doesn't?
09:52 🔗 apache2 or is that a policy thing due to storage requirements?
09:54 🔗 JAA No, IA/the WBM does its own thing with YouTube videos, and I'm not sure how it works. Doesn't have anything to do with this though, and I'm pretty sure even if wpull could use youtube-dl, the playback in the WBM wouldn't work based on that.
10:03 🔗 icedice has quit IRC (Quit: Leaving)
10:59 🔗 odemgi has joined #archiveteam-bs
11:00 🔗 odemgi_ has quit IRC (Read error: Connection reset by peer)
11:03 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
11:11 🔗 godane has joined #archiveteam-bs
11:25 🔗 lempamo has quit IRC (Read error: Operation timed out)
11:26 🔗 lempamo has joined #archiveteam-bs
11:42 🔗 Ryz has quit IRC (Quit: Ping timeout (120 seconds))
11:44 🔗 kiska18 has quit IRC (Ping timeout (120 seconds))
11:47 🔗 kiska18 has joined #archiveteam-bs
11:47 🔗 svchfoo1 sets mode: +o kiska18
11:47 🔗 Ryz has joined #archiveteam-bs
14:22 🔗 DLoader has quit IRC (Quit: DLoader)
14:24 🔗 DLoader has joined #archiveteam-bs
14:28 🔗 JAA LowLevelM: Lots of wikis have some forum-like thing, so in my opinion, that's not really an argument against handling it in #wikiteam.
14:58 🔗 K4k has joined #archiveteam-bs
15:16 🔗 LowLevelM kiska: Why not hackint?
15:16 🔗 kiska Cause it's already here?
15:16 🔗 LowLevelM OH
15:16 🔗 LowLevelM Oops caps
15:17 🔗 kiska Aka https://www.archiveteam.org/index.php?title=Soundcloud
15:55 🔗 mls_ has quit IRC (Ping timeout: 258 seconds)
16:00 🔗 mls_ has joined #archiveteam-bs
16:36 🔗 jamiew has joined #archiveteam-bs
16:36 🔗 jamiew has left Textual IRC Client: www.textualapp.com
16:36 🔗 jamiew has joined #archiveteam-bs
16:43 🔗 jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
16:43 🔗 jamiew has joined #archiveteam-bs
16:45 🔗 HP_Archiv has joined #archiveteam-bs
17:28 🔗 jamiew has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
17:47 🔗 jamiew has joined #archiveteam-bs
17:52 🔗 ripdog has quit IRC (Ping timeout: 246 seconds)
18:00 🔗 jamiew has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
18:01 🔗 mls_ has quit IRC (Ping timeout: 258 seconds)
18:06 🔗 schbirid has joined #archiveteam-bs
18:08 🔗 jamiew has joined #archiveteam-bs
18:11 🔗 Jens has quit IRC (Remote host closed the connection)
18:12 🔗 Jens has joined #archiveteam-bs
18:15 🔗 mls_ has joined #archiveteam-bs
18:27 🔗 killsushi has quit IRC (Quit: Leaving)
18:57 🔗 antomati_ is now known as antomatic
19:36 🔗 anarcat has quit IRC (Ping timeout: 252 seconds)
19:43 🔗 anarcat has joined #archiveteam-bs
19:47 🔗 DogsRNice has joined #archiveteam-bs
20:57 🔗 godane1 has joined #archiveteam-bs
20:57 🔗 chirlu has joined #archiveteam-bs
20:59 🔗 godane has quit IRC (Read error: Connection reset by peer)
21:37 🔗 mls_ has quit IRC (Ping timeout: 258 seconds)
21:46 🔗 jamiew has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
21:50 🔗 mls_ has joined #archiveteam-bs
21:53 🔗 mls_ has quit IRC (Client Quit)
22:10 🔗 geniibunt has joined #archiveteam-bs
22:11 🔗 geniibunt Not sure if this is the correct channel, but... "The developerWorks Connections platform will be sunset on December 31, 2019. On January 1, 2020, this forum will no longer be available." https://www.ibm.com/developerworks/community/forums/html/forum?id=2eb0f36d-9534-471b-8b27-c21e6c5b9b2b I used the option in Wayback to store that specific page now but I do not know if the system there will now auto-crawl it
22:12 🔗 geniibunt Should I do something like try to run a recursive wget and then later submit whatever I get to Wayback?
22:13 🔗 schbirid has quit IRC (Quit: Leaving)
22:16 🔗 geniibunt ( or possibly if it's already known that this site would be taken down and steps are/have been taken this would useful to know also )
22:27 🔗 mls_ has joined #archiveteam-bs
22:38 🔗 JAA Yeah, this is the correct channel. I think I looked into it a week or two ago and saw a lot of interactive stuff which wouldn't be grabbed with wget/wpull/ArchiveBot, but maybe I'm confusing it with something else. Lots of shit shutting down at the moment.
22:44 🔗 geniibunt Looks like the larger issue is the entire /community/ hierarchy
23:01 🔗 geniibunt has quit IRC ()
23:07 🔗 Ravenloft has joined #archiveteam-bs
23:17 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
23:19 🔗 deevious has joined #archiveteam-bs
23:21 🔗 eientei95 has quit IRC (Read error: Connection reset by peer)
23:26 🔗 eientei95 has joined #archiveteam-bs
23:29 🔗 tech234a has joined #archiveteam-bs
23:41 🔗 wp494 so is there any reason why we're starting to split stuff between here on EFnet and hackint all of a sudden?
23:41 🔗 JAA EFnet sucks, hackint sucks less.
23:41 🔗 JAA Services, more stable, etc.
23:42 🔗 wp494 then what's stopping us from choosing some day as a sort of a "d-day" to switch all the things over
23:42 🔗 wp494 I mean I understand the benefits of why we would want something like services/registration/etc
23:43 🔗 JAA Well, moving hundreds of people over isn't going to happen in one day anyway.
23:43 🔗 JAA And the other question of course is whether everyone/almost everyone is actually okay with that.
23:43 🔗 JAA So far, hackint has been mostly an experiment.
23:44 🔗 JAA A couple nerds went there end of August when the netsplits here annoyed us enough, and then we created a few channels there to get some real-world experience with using it instead of just artificial test channels. It's been working very well so far.
23:45 🔗 JAA Which caused others to want to create new project channels there as well, etc.
23:46 🔗 JAA Actually moving existing channels over hasn't really been discussed so far (although #archiveteam, -bs, and -ot all exist over there, but they're hardly used).
23:49 🔗 wp494 I imagine moving #archivebot over would be the biggest PITA as far as moving anything existing over
23:49 🔗 wp494 though I guess you could just warn in the topic that the channel's gonna get muted on <x> date and then on that date set +m on the channel
23:51 🔗 wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
23:52 🔗 JAA But nobody reads topics. :-P
23:52 🔗 JAA pls
23:53 🔗 JAA Moving ArchiveBot would actually be one of the easier channels I think. There's only a handful of people that are actually active, and moving the bot is trivial.
23:56 🔗 wp494 has joined #archiveteam-bs

irclogger-viewer