#archiveteam-bs 2019-11-18,Mon

↑back Search

Time Nickname Message
00:41 🔗 Sokar has quit IRC (Read error: Connection reset by peer)
00:45 🔗 Sokar has joined #archiveteam-bs
00:48 🔗 Raccoon What are some tools that make the creation of page-scraping templates easier?
00:58 🔗 raeyulca has joined #archiveteam-bs
03:04 🔗 manjaro-u has quit IRC (Read error: Operation timed out)
03:11 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
03:30 🔗 LowLevelM We should have a BAT wallet, so brave rewards users can donate.
04:24 🔗 odemgi_ has joined #archiveteam-bs
04:27 🔗 qw3rty has joined #archiveteam-bs
04:28 🔗 odemgi has quit IRC (Read error: Operation timed out)
04:34 🔗 qw3rty2 has quit IRC (Ping timeout: 745 seconds)
04:54 🔗 icedice has quit IRC (Quit: Leaving)
05:38 🔗 benjins has quit IRC (Read error: Connection reset by peer)
06:31 🔗 m007a83 has quit IRC (Quit: Fuck you Comcast)
07:30 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
07:38 🔗 schbirid has joined #archiveteam-bs
07:52 🔗 OrIdow6 has quit IRC (Quit: Leaving.)
07:52 🔗 d5f4a3622 has quit IRC (Read error: Connection reset by peer)
07:53 🔗 d5f4a3622 has joined #archiveteam-bs
08:11 🔗 deevious has joined #archiveteam-bs
08:30 🔗 d5f4a3622 has quit IRC (Ping timeout: 246 seconds)
08:32 🔗 d5f4a3622 has joined #archiveteam-bs
09:00 🔗 d5f4a3622 has quit IRC (Read error: Connection reset by peer)
09:01 🔗 d5f4a3622 has joined #archiveteam-bs
09:13 🔗 OrIdow6 has joined #archiveteam-bs
09:21 🔗 d5f4a3622 has quit IRC (Read error: Connection reset by peer)
09:44 🔗 Dragnog2 has quit IRC (Quit: Connection closed for inactivity)
09:49 🔗 d5f4a3622 has joined #archiveteam-bs
11:43 🔗 odemg has joined #archiveteam-bs
12:58 🔗 erkinalp has joined #archiveteam-bs
13:05 🔗 benjins has joined #archiveteam-bs
14:17 🔗 Flashfire has quit IRC (Remote host closed the connection)
14:17 🔗 kiska has quit IRC (Remote host closed the connection)
14:18 🔗 Flashfire has joined #archiveteam-bs
14:18 🔗 kiska has joined #archiveteam-bs
14:18 🔗 Fusl__ sets mode: +o kiska
14:18 🔗 Fusl sets mode: +o kiska
14:18 🔗 Fusl_ sets mode: +o kiska
15:04 🔗 akierig has joined #archiveteam-bs
15:15 🔗 Dallas has quit IRC (Quit: The Lounge - https://thelounge.chat)
15:30 🔗 dashcloud has joined #archiveteam-bs
15:32 🔗 jrwr dashcloud: Ya, We have been getting it from all sides this morning
15:32 🔗 jrwr Even Jason/SketchCow has been getting a ton of emails
15:46 🔗 SketchCow Well, I always get a bunch
15:47 🔗 SketchCow Make sure we're ACTUALLY getting them, though - that collection is a fucking nightmare of CGI
15:48 🔗 jrwr Sounds about like what we are dealing with and the FCC
15:52 🔗 Raccoon Is there any precedence or rapport to contact Intel and ask for an archive directly
15:55 🔗 VerifiedJ has joined #archiveteam-bs
15:56 🔗 jrwr We had grabbed a copy a few months ago
15:56 🔗 jrwr We are currently in the process of verifying it
15:59 🔗 Raccoon Is that just BIOS firmware, or all of the GFX, Sound, Networking etc drivers too?
16:04 🔗 jrwr From what I understand, It was everything
16:04 🔗 jrwr about 300GB
16:04 🔗 Raccoon wowza.
16:04 🔗 jrwr We did it as a thing, but now that its in danger, we will verify the archive and download anything missing
16:05 🔗 Raccoon Still, would be neat to have that sort of direct contact and build of rapport.
16:05 🔗 Raccoon "Dear Intel: Hi, ArchiveTeam here -- folks from the Internet Archive, you know, "The Wayback Machine." Anyway, it'd be really nifty if we could arrange some access to make a backup copy the support software and drivers you guys intend to delete this month. We regularly work with businesses that, for whatever reason, need to delete large portions of The Internet. It's our specialty and we're here to help."
16:06 🔗 Raccoon (Help us help you help us all.)
16:08 🔗 JAA jrwr: We did it because of that deletion. It's old news that's just now brought up again as the deadline comes closer.
16:08 🔗 jrwr Ah
16:08 🔗 jrwr Figures as much
17:01 🔗 Ryz Hello, I would like to make a special request, with Google wanting to index anything Adobe Flash related ( https://webmasters.googleblog.com/2019/10/goodbye-flash.html ) , I would like to have a scraping of Adobe Flash results of using normal searches, image searches (maybe?0), and search by filetype;
17:02 🔗 Ryz I might want to say use Bing since in the past they snagged results from Google, but they might've changed that formula in response of being found out
17:02 🔗 Ryz *Google wanting to de-index anything Adobe Flash related
17:03 🔗 Ryz I'm also doing it in behalf of filtering websites for the people at Flashpoint to scrape off websites from
17:04 🔗 Ryz I ask this because it's too tedious to filter the websites manually to submit stuff to their lists - unfortunately, Google has anti-scraping technologies
17:09 🔗 Ryz Huh, interesting, "Anyway, this is specific to textual content / full sites that are in Flash; videos & animations generally wouldn't be indexed in web-search anyway.": https://twitter.com/JohnMu/status/1188795137591259136
17:16 🔗 akierig has quit IRC (Quit: later_gator)
17:20 🔗 akierig has joined #archiveteam-bs
17:48 🔗 Raccoon Ryz: there are a few swf archives, ie swfchan.com, born out of the silly 2chan 4chan and newgrounds music video animated shorts scene / school projects. I spent several years archiving anything that appeared on 4chan/f/ which I hope to convert to mp4 some day.
17:48 🔗 Ryz To clarify, I meant scraping search results from Google regarding anything Adobe Flash related~ oo;
17:49 🔗 Raccoon Oh. I thought Google's response was basically that embedded flash videos don't index on google
17:52 🔗 Ryz It would be awesome if there was a dictionary-style scraping, like say '[dictionary word] flash', '[dictionary word] flash game', 'filetype:swf [dictionary word]' just for the coverage~
17:52 🔗 Raccoon I wonder
17:54 🔗 Raccoon you know how Adobe (and other sites) would give you copypasta <embed><object> tags that included a link back to downloading and installing Adobe Flash Plugin / Adobe Shockwave Plugin etc
17:54 🔗 Raccoon Maybe find some of those standard "Please install..." templates, and search google for those quoted text.
18:01 🔗 manjaro-u has joined #archiveteam-bs
18:03 🔗 Ryz Well, that's a clever idea; unfortunately, I do know that not all of them do that procedure, or not at all from finding the more obscure websites~ ><;
18:04 🔗 Raccoon aye. and it'd mean finding all the various iterations of that template, between <embed> and <object> and Shockwave and Flash, version to version
18:05 🔗 Ryz Surely there's a specialized search engine in which you search by HTML or web code instead of just the text content ><;
18:06 🔗 Raccoon I wonder if IA takes bribes to keyword scan the entire WBM
18:08 🔗 markedL I'm not an expert in the common crawl, but it's worth asking someone who is
18:11 🔗 JAA Ryz: If you want me to scrape anything from Bing, let me know.
18:11 🔗 JAA (Or well, the script is in my little-things repo.)
18:12 🔗 Ryz Does Bing still have copied results from Google? oo;
18:12 🔗 JAA No idea.
18:13 🔗 Ryz I would ideally want Google, because it may have an ever-changing distinct take of what it does ><;
18:14 🔗 JAA Good luck. Scraping Google is hard.
18:15 🔗 Raccoon Apparently there are other file extensions besides .swf, such as .dcr (per https://wikiext.com/dcr) (per https://ia800709.us.archive.org/2/items/shockwave-games.net/shockwave-games.net.html)
18:23 🔗 erkinalp has quit IRC (Ping timeout: 260 seconds)
18:26 🔗 Ryz Hm, when doing the search query for 'filetype:swf water game' on both Bing and Google, they both have different results, but the latter is far more reaching than the former, the former can go up to 6 pages for this search query while the latter is up to 20 pages with default filtering of very similar entires xX;
18:35 🔗 Larsenv Hey, so I gave Ryz this site to archive: http://inchwormanimation.com/
18:35 🔗 Raccoon And 32 pages on Google with "omitted results included", not really seeing repeat domain names.
18:36 🔗 Larsenv they said to ask here about someone about archiving it
18:36 🔗 Ryz Yeah, this seems a bit more complicated than on first glance~
18:37 🔗 Larsenv "might need some specialized help since this looks ancient; probably ask people for help"
18:38 🔗 Larsenv that's what Ryz said
18:38 🔗 Ryz The thing is, it looks like it's barricaded by a mix of JavaScript and .swf - though it seems it's not using .swf anymore and is playing videos in .mp4 instead
18:38 🔗 Raccoon also "for Nintendo DSi"?
18:39 🔗 Larsenv yeah, inchworm animation is a flipnote-like animation software for the DSi
18:39 🔗 Larsenv and the 3DS
18:40 🔗 Raccoon oh, an editor software
18:40 🔗 Larsenv this seems to be the successor
18:40 🔗 Larsenv http://www.butterflyanimation.com/
18:43 🔗 Raccoon the entire site seems rather tiny. you want it in a zip or on the waybackmachine
18:43 🔗 JAA Eww, zip.
18:44 🔗 Raccoon If zip is so eeew, then why doesn't IA support "look inside" on 7z files? :p
18:44 🔗 JAA I mean the idea of using zip for a website archive.
18:44 🔗 JAA WARC or bust
18:45 🔗 Raccoon I'm told that's a write-only container :p
18:46 🔗 JAA pywb does a good job of local playback, and the WBM is, well, the WBM.
18:47 🔗 JAA Still, it's the most suitable format for archiving web content.
18:49 🔗 Raccoon oh wee, wget does warc cuz it's badass
18:49 🔗 JAA Yeah, unfortunately it does it wrong, but oh well.
18:49 🔗 JAA Or well, "wrong".
18:49 🔗 Raccoon but the wiki says it does it right
18:49 🔗 JAA Then the wiki is outdated and hasn't been updated with the 1.19.4 debacle.
18:50 🔗 Raccoon ", with the notable exception of Wget, a popular WARC-producing program, which, since February 2016, has used the angle brackets [correctly as specified]"
18:51 🔗 JAA Well, here's the thing, the standard was broken. The grammar didn't match the examples. So it's both correct and wrong, in a sense.
18:51 🔗 Raccoon oh, i guess v1.1 eliminates angle brackets for backwards compatability with badly written software
18:51 🔗 Raccoon instead of correcting the software, uncorrect the standard ;)
18:52 🔗 Raccoon what happened in 1.19.4 specifically?
18:52 🔗 JAA Considering that the software authors were involved in the creation of the WARC standard, it was clearly a bug in the standard since *everyone* agreed that there shouldn't be angle brackets around the WARC-Target-URI.
18:52 🔗 erkinalp has joined #archiveteam-bs
18:53 🔗 JAA The angle brackets were added in 1.19.4.
18:53 🔗 Raccoon oh, that recently? wow
18:53 🔗 Raccoon 28500:2007 is 2007 isn't it
18:54 🔗 JAA Yep, WARC/1.0 is over a decade old.
18:56 🔗 Raccoon so wget 1.19.4 was committed on 23-Jan-2018, after WARC 1.1 ISO 28500:2017
18:56 🔗 JAA I've been considering sending a minimal patch to wget that bumps the WARC version and removes the angle brackets. No other changes would be needed I believe.
18:56 🔗 JAA But I'm waiting for wumpus's strict conformity check tool to verify that those files are then indeed according to spec.
18:56 🔗 Raccoon does darnir even know about this drama?
18:56 🔗 JAA No clue.
18:57 🔗 Raccoon he seems like the kinda guy who would drop everything to fix it if just asked. freenode/#wget
18:57 🔗 Raccoon well, I mean, they all have been very responsive over the years. micahcowan and giuseppe included.
18:57 🔗 JAA Here's the pywb issue about this, which also links to an email on bug-wget: https://github.com/webrecorder/pywb/issues/294
18:59 🔗 JAA I believe the WARC code in wget was originally written by someone from ArchiveTeam and then mostly just stayed in the source without maintenance. Could be wrong though since that was well before my time here.
18:59 🔗 Raccoon still weird that the "problem" began in 2018, already after the 2017 ISO came out deprecating the 1.0 format
19:00 🔗 JAA Yup, problem was noticed and fixed, and then someone said "yup, we'll now write WARCs that break all existing tooling".
19:02 🔗 Raccoon so wget is still broken as of 1.20.3?
19:06 🔗 Raccoon i wonder if there's a way to patch existing anglebracketcontaining WARCs by binary search-and-replace
19:06 🔗 Raccoon ie, replacing the < > with spaces 0x20
19:07 🔗 JAA Yes, it's still producing angle-bracketed WARCs.
19:07 🔗 Raccoon saving time and money recompiling every tainted archive in the past 2.9 years
19:07 🔗 JAA No, firstly because WARCs are (almost always) compressed, and secondly because you'll only want to replace it in specific headers. So it'd need some different tooling.
19:08 🔗 Raccoon ah, gzip
19:08 🔗 JAA Yup
19:08 🔗 Raccoon wasn't the original ARC something not-gzip compression?
19:08 🔗 JAA And you can't do zcat file.warc.gz | replace | gzip >out.warc.gz either because it's per-record compression.
19:08 🔗 JAA No idea, I never looked into ARC.
19:09 🔗 Raccoon (i haven't used ARC in 25-30 years)
19:14 🔗 Dallas has joined #archiveteam-bs
19:53 🔗 Raccoon JAA: darnir said he'll look into [fixing] it when he gets home.
19:54 🔗 Raccoon linked him to this channel, should he show up
19:54 🔗 JAA Raccoon: Nice, thank you!
19:55 🔗 Raccoon twuz nutin
19:58 🔗 Raccoon incidentally, 1.19.x is when the project was transitioning from giuseppe to darnir, so probably a simple case of institutional memory loss
19:59 🔗 JAA Ah, interesting, and makes sense.
20:09 🔗 JAA So about Intel: apparently they've already begun deleting stuff. Various entries that existed in early September are gone now.
20:11 🔗 Raccoon That's what https://www.vogons.org/viewtopic.php?f=46&t=69184 was saying. "... I also discovered Intel had already removed all their drivers just two days ago (!!) (on September 13th, 2019)."
20:12 🔗 Raccoon you got lucky
20:12 🔗 JAA Indeed, that grab finished on the 9th.
20:12 🔗 Raccoon JAA: do you want to talk more about WARC
20:13 🔗 JAA Raccoon: In general, absolutely, but not right now.
20:14 🔗 Raccoon rockdaboot has minutia questions I don't have intelligent answers to. /join #wget on freenode
20:14 🔗 Raccoon also requesting a copy of the ISO PDF (behind paywall)
20:15 🔗 JAA Yeah, I don't have that.
20:16 🔗 JAA All I know is it's supposedly virtually identical to http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
20:17 🔗 JAA And http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf is the "latest draft" as suggested by the filename.
20:17 🔗 Raccoon ah, i think that'll be reasonable
20:18 🔗 Raccoon thanks
20:20 🔗 Raccoon there's really only that singular change in behavior, or a whole nest of things between 1.0 and 1.1?
20:26 🔗 Zerote__ has joined #archiveteam-bs
20:26 🔗 Nick-PC_ has joined #archiveteam-bs
20:29 🔗 Nick-PC has quit IRC (Ping timeout: 252 seconds)
20:29 🔗 Zerote_ has quit IRC (Ping timeout: 252 seconds)
20:30 🔗 JAA I believe that's the only critical change. Everything else was added features (like more flexible and precise timestamps and new header fields) which are optional. Not 100 % sure though.
20:43 🔗 prq just double checking-- a 'wget --mirror' that writes to a warc file can't be stopped and resumed, right?
20:44 🔗 prq so if there's a power outage or other issue, I can't resume exactly where I left off?
21:01 🔗 erkinalp has quit IRC (Ping timeout: 260 seconds)
21:11 🔗 odemgi has joined #archiveteam-bs
21:16 🔗 odemgi_ has quit IRC (Read error: Operation timed out)
21:26 🔗 Hani111 has joined #archiveteam-bs
21:34 🔗 Raccoon prq: <darnir[m]> I think, yes. I'd have to test it, but if not, it shouldn't be too difficult to implement it.
21:35 🔗 Raccoon prq: if you want a certain behavior that isn't, now is the best time to find the bugs you want fixed.
21:37 🔗 Hani has quit IRC (Ping timeout: 745 seconds)
21:37 🔗 Hani111 is now known as Hani
21:41 🔗 prq I'm happy discussing in either location, lol
21:42 🔗 manjaro-u has quit IRC (Read error: Connection reset by peer)
21:43 🔗 manjaro-u has joined #archiveteam-bs
21:43 🔗 Raccoon i was suggesting maybe starting a second test instance of wget --mirror --warc-file that you intentionally interrupt and then restart again :)
21:44 🔗 prq I've tried that, and it definitely seems to start over from the beginning
21:44 🔗 prq so if I've done 300,000 requests, it'll do them again. as long as the content hasn't changed, the warc file doesn't seem to get bigger, which is good
21:45 🔗 prq but with a --wait delay built in, it can take a very very long time to go through all those requests again.
21:45 🔗 Raccoon yeah, that's kinda just how wget works (right now), it doesn't store a fast_resume_cache to restart an interrupted mirror
21:45 🔗 prq as far as I can tell, wget must keep its progress state in memory
21:45 🔗 Raccoon right
21:46 🔗 Raccoon all --mirrors work that away. session stateless, utilize either timestamps or noclobber
21:46 🔗 Raccoon skip as needed in the present
21:50 🔗 prq I've been reading about the wpull project-- not sure if it can address my concern, but it is in python. I'm not great with code, but python is probably my strongest.
21:56 🔗 JAA One of the main reasons why wpull was written was to have resumable crawls. So yes. Use --database. *However*, cookies will not survive.
21:56 🔗 JAA --mirror doesn't exist on wpull though.
22:06 🔗 JAA Soo, my Intel downloads ID scan turned up over 2k files that are apparently missing from the previous retrieval.
22:07 🔗 JAA (To be precise: it found 2187 URLs on downloadmirror.intel.com which were not retrieved successfully on the previous attempt. Possibly some of them are 404s or have other issues.)
22:12 🔗 akierig has quit IRC (Quit: later_gator)
22:47 🔗 X-Scale has quit IRC (Read error: Operation timed out)
23:25 🔗 Smiley has joined #archiveteam-bs
23:28 🔗 schbirid has quit IRC (Quit: Leaving)
23:29 🔗 SmileyG has quit IRC (Read error: Operation timed out)
23:58 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer