[00:41] *** Sokar has quit IRC (Read error: Connection reset by peer) [00:45] *** Sokar has joined #archiveteam-bs [00:48] What are some tools that make the creation of page-scraping templates easier? [00:58] *** raeyulca has joined #archiveteam-bs [03:04] *** manjaro-u has quit IRC (Read error: Operation timed out) [03:11] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:30] We should have a BAT wallet, so brave rewards users can donate. [04:24] *** odemgi_ has joined #archiveteam-bs [04:27] *** qw3rty has joined #archiveteam-bs [04:28] *** odemgi has quit IRC (Read error: Operation timed out) [04:34] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds) [04:54] *** icedice has quit IRC (Quit: Leaving) [05:38] *** benjins has quit IRC (Read error: Connection reset by peer) [06:31] *** m007a83 has quit IRC (Quit: Fuck you Comcast) [07:30] *** BlueMax has quit IRC (Read error: Connection reset by peer) [07:38] *** schbirid has joined #archiveteam-bs [07:52] *** OrIdow6 has quit IRC (Quit: Leaving.) [07:52] *** d5f4a3622 has quit IRC (Read error: Connection reset by peer) [07:53] *** d5f4a3622 has joined #archiveteam-bs [08:11] *** deevious has joined #archiveteam-bs [08:30] *** d5f4a3622 has quit IRC (Ping timeout: 246 seconds) [08:32] *** d5f4a3622 has joined #archiveteam-bs [09:00] *** d5f4a3622 has quit IRC (Read error: Connection reset by peer) [09:01] *** d5f4a3622 has joined #archiveteam-bs [09:13] *** OrIdow6 has joined #archiveteam-bs [09:21] *** d5f4a3622 has quit IRC (Read error: Connection reset by peer) [09:44] *** Dragnog2 has quit IRC (Quit: Connection closed for inactivity) [09:49] *** d5f4a3622 has joined #archiveteam-bs [11:43] *** odemg has joined #archiveteam-bs [12:58] *** erkinalp has joined #archiveteam-bs [13:05] *** benjins has joined #archiveteam-bs [14:17] *** Flashfire has quit IRC (Remote host closed the connection) [14:17] *** kiska has quit IRC (Remote host closed the connection) [14:18] *** Flashfire has joined #archiveteam-bs [14:18] *** kiska has joined #archiveteam-bs [14:18] *** Fusl__ sets mode: +o kiska [14:18] *** Fusl sets mode: +o kiska [14:18] *** Fusl_ sets mode: +o kiska [15:04] *** akierig has joined #archiveteam-bs [15:15] *** Dallas has quit IRC (Quit: The Lounge - https://thelounge.chat) [15:30] *** dashcloud has joined #archiveteam-bs [15:32] dashcloud: Ya, We have been getting it from all sides this morning [15:32] Even Jason/SketchCow has been getting a ton of emails [15:46] Well, I always get a bunch [15:47] Make sure we're ACTUALLY getting them, though - that collection is a fucking nightmare of CGI [15:48] Sounds about like what we are dealing with and the FCC [15:52] Is there any precedence or rapport to contact Intel and ask for an archive directly [15:55] *** VerifiedJ has joined #archiveteam-bs [15:56] We had grabbed a copy a few months ago [15:56] We are currently in the process of verifying it [15:59] Is that just BIOS firmware, or all of the GFX, Sound, Networking etc drivers too? [16:04] From what I understand, It was everything [16:04] about 300GB [16:04] wowza. [16:04] We did it as a thing, but now that its in danger, we will verify the archive and download anything missing [16:05] Still, would be neat to have that sort of direct contact and build of rapport. [16:05] "Dear Intel: Hi, ArchiveTeam here -- folks from the Internet Archive, you know, "The Wayback Machine." Anyway, it'd be really nifty if we could arrange some access to make a backup copy the support software and drivers you guys intend to delete this month. We regularly work with businesses that, for whatever reason, need to delete large portions of The Internet. It's our specialty and we're here to help." [16:06] (Help us help you help us all.) [16:08] jrwr: We did it because of that deletion. It's old news that's just now brought up again as the deadline comes closer. [16:08] Ah [16:08] Figures as much [17:01] Hello, I would like to make a special request, with Google wanting to index anything Adobe Flash related ( https://webmasters.googleblog.com/2019/10/goodbye-flash.html ) , I would like to have a scraping of Adobe Flash results of using normal searches, image searches (maybe?0), and search by filetype; [17:02] I might want to say use Bing since in the past they snagged results from Google, but they might've changed that formula in response of being found out [17:02] *Google wanting to de-index anything Adobe Flash related [17:03] I'm also doing it in behalf of filtering websites for the people at Flashpoint to scrape off websites from [17:04] I ask this because it's too tedious to filter the websites manually to submit stuff to their lists - unfortunately, Google has anti-scraping technologies [17:09] Huh, interesting, "Anyway, this is specific to textual content / full sites that are in Flash; videos & animations generally wouldn't be indexed in web-search anyway.": https://twitter.com/JohnMu/status/1188795137591259136 [17:16] *** akierig has quit IRC (Quit: later_gator) [17:20] *** akierig has joined #archiveteam-bs [17:48] Ryz: there are a few swf archives, ie swfchan.com, born out of the silly 2chan 4chan and newgrounds music video animated shorts scene / school projects. I spent several years archiving anything that appeared on 4chan/f/ which I hope to convert to mp4 some day. [17:48] To clarify, I meant scraping search results from Google regarding anything Adobe Flash related~ oo; [17:49] Oh. I thought Google's response was basically that embedded flash videos don't index on google [17:52] It would be awesome if there was a dictionary-style scraping, like say '[dictionary word] flash', '[dictionary word] flash game', 'filetype:swf [dictionary word]' just for the coverage~ [17:52] I wonder [17:54] you know how Adobe (and other sites) would give you copypasta tags that included a link back to downloading and installing Adobe Flash Plugin / Adobe Shockwave Plugin etc [17:54] Maybe find some of those standard "Please install..." templates, and search google for those quoted text. [18:01] *** manjaro-u has joined #archiveteam-bs [18:03] Well, that's a clever idea; unfortunately, I do know that not all of them do that procedure, or not at all from finding the more obscure websites~ ><; [18:04] aye. and it'd mean finding all the various iterations of that template, between and and Shockwave and Flash, version to version [18:05] Surely there's a specialized search engine in which you search by HTML or web code instead of just the text content ><; [18:06] I wonder if IA takes bribes to keyword scan the entire WBM [18:08] I'm not an expert in the common crawl, but it's worth asking someone who is [18:11] Ryz: If you want me to scrape anything from Bing, let me know. [18:11] (Or well, the script is in my little-things repo.) [18:12] Does Bing still have copied results from Google? oo; [18:12] No idea. [18:13] I would ideally want Google, because it may have an ever-changing distinct take of what it does ><; [18:14] Good luck. Scraping Google is hard. [18:15] Apparently there are other file extensions besides .swf, such as .dcr (per https://wikiext.com/dcr) (per https://ia800709.us.archive.org/2/items/shockwave-games.net/shockwave-games.net.html) [18:23] *** erkinalp has quit IRC (Ping timeout: 260 seconds) [18:26] Hm, when doing the search query for 'filetype:swf water game' on both Bing and Google, they both have different results, but the latter is far more reaching than the former, the former can go up to 6 pages for this search query while the latter is up to 20 pages with default filtering of very similar entires xX; [18:35] Hey, so I gave Ryz this site to archive: http://inchwormanimation.com/ [18:35] And 32 pages on Google with "omitted results included", not really seeing repeat domain names. [18:36] they said to ask here about someone about archiving it [18:36] Yeah, this seems a bit more complicated than on first glance~ [18:37] "might need some specialized help since this looks ancient; probably ask people for help" [18:38] that's what Ryz said [18:38] The thing is, it looks like it's barricaded by a mix of JavaScript and .swf - though it seems it's not using .swf anymore and is playing videos in .mp4 instead [18:38] also "for Nintendo DSi"? [18:39] yeah, inchworm animation is a flipnote-like animation software for the DSi [18:39] and the 3DS [18:40] oh, an editor software [18:40] this seems to be the successor [18:40] http://www.butterflyanimation.com/ [18:43] the entire site seems rather tiny. you want it in a zip or on the waybackmachine [18:43] Eww, zip. [18:44] If zip is so eeew, then why doesn't IA support "look inside" on 7z files? :p [18:44] I mean the idea of using zip for a website archive. [18:44] WARC or bust [18:45] I'm told that's a write-only container :p [18:46] pywb does a good job of local playback, and the WBM is, well, the WBM. [18:47] Still, it's the most suitable format for archiving web content. [18:49] oh wee, wget does warc cuz it's badass [18:49] Yeah, unfortunately it does it wrong, but oh well. [18:49] Or well, "wrong". [18:49] but the wiki says it does it right [18:49] Then the wiki is outdated and hasn't been updated with the 1.19.4 debacle. [18:50] ", with the notable exception of Wget, a popular WARC-producing program, which, since February 2016, has used the angle brackets [correctly as specified]" [18:51] Well, here's the thing, the standard was broken. The grammar didn't match the examples. So it's both correct and wrong, in a sense. [18:51] oh, i guess v1.1 eliminates angle brackets for backwards compatability with badly written software [18:51] instead of correcting the software, uncorrect the standard ;) [18:52] what happened in 1.19.4 specifically? [18:52] Considering that the software authors were involved in the creation of the WARC standard, it was clearly a bug in the standard since *everyone* agreed that there shouldn't be angle brackets around the WARC-Target-URI. [18:52] *** erkinalp has joined #archiveteam-bs [18:53] The angle brackets were added in 1.19.4. [18:53] oh, that recently? wow [18:53] 28500:2007 is 2007 isn't it [18:54] Yep, WARC/1.0 is over a decade old. [18:56] so wget 1.19.4 was committed on 23-Jan-2018, after WARC 1.1 ISO 28500:2017 [18:56] I've been considering sending a minimal patch to wget that bumps the WARC version and removes the angle brackets. No other changes would be needed I believe. [18:56] But I'm waiting for wumpus's strict conformity check tool to verify that those files are then indeed according to spec. [18:56] does darnir even know about this drama? [18:56] No clue. [18:57] he seems like the kinda guy who would drop everything to fix it if just asked. freenode/#wget [18:57] well, I mean, they all have been very responsive over the years. micahcowan and giuseppe included. [18:57] Here's the pywb issue about this, which also links to an email on bug-wget: https://github.com/webrecorder/pywb/issues/294 [18:59] I believe the WARC code in wget was originally written by someone from ArchiveTeam and then mostly just stayed in the source without maintenance. Could be wrong though since that was well before my time here. [18:59] still weird that the "problem" began in 2018, already after the 2017 ISO came out deprecating the 1.0 format [19:00] Yup, problem was noticed and fixed, and then someone said "yup, we'll now write WARCs that break all existing tooling". [19:02] so wget is still broken as of 1.20.3? [19:06] i wonder if there's a way to patch existing anglebracketcontaining WARCs by binary search-and-replace [19:06] ie, replacing the < > with spaces 0x20 [19:07] Yes, it's still producing angle-bracketed WARCs. [19:07] saving time and money recompiling every tainted archive in the past 2.9 years [19:07] No, firstly because WARCs are (almost always) compressed, and secondly because you'll only want to replace it in specific headers. So it'd need some different tooling. [19:08] ah, gzip [19:08] Yup [19:08] wasn't the original ARC something not-gzip compression? [19:08] And you can't do zcat file.warc.gz | replace | gzip >out.warc.gz either because it's per-record compression. [19:08] No idea, I never looked into ARC. [19:09] (i haven't used ARC in 25-30 years) [19:14] *** Dallas has joined #archiveteam-bs [19:53] JAA: darnir said he'll look into [fixing] it when he gets home. [19:54] linked him to this channel, should he show up [19:54] Raccoon: Nice, thank you! [19:55] twuz nutin [19:58] incidentally, 1.19.x is when the project was transitioning from giuseppe to darnir, so probably a simple case of institutional memory loss [19:59] Ah, interesting, and makes sense. [20:09] So about Intel: apparently they've already begun deleting stuff. Various entries that existed in early September are gone now. [20:11] That's what https://www.vogons.org/viewtopic.php?f=46&t=69184 was saying. "... I also discovered Intel had already removed all their drivers just two days ago (!!) (on September 13th, 2019)." [20:12] you got lucky [20:12] Indeed, that grab finished on the 9th. [20:12] JAA: do you want to talk more about WARC [20:13] Raccoon: In general, absolutely, but not right now. [20:14] rockdaboot has minutia questions I don't have intelligent answers to. /join #wget on freenode [20:14] also requesting a copy of the ISO PDF (behind paywall) [20:15] Yeah, I don't have that. [20:16] All I know is it's supposedly virtually identical to http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/ [20:17] And http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf is the "latest draft" as suggested by the filename. [20:17] ah, i think that'll be reasonable [20:18] thanks [20:20] there's really only that singular change in behavior, or a whole nest of things between 1.0 and 1.1? [20:26] *** Zerote__ has joined #archiveteam-bs [20:26] *** Nick-PC_ has joined #archiveteam-bs [20:29] *** Nick-PC has quit IRC (Ping timeout: 252 seconds) [20:29] *** Zerote_ has quit IRC (Ping timeout: 252 seconds) [20:30] I believe that's the only critical change. Everything else was added features (like more flexible and precise timestamps and new header fields) which are optional. Not 100 % sure though. [20:43] just double checking-- a 'wget --mirror' that writes to a warc file can't be stopped and resumed, right? [20:44] so if there's a power outage or other issue, I can't resume exactly where I left off? [21:01] *** erkinalp has quit IRC (Ping timeout: 260 seconds) [21:11] *** odemgi has joined #archiveteam-bs [21:16] *** odemgi_ has quit IRC (Read error: Operation timed out) [21:26] *** Hani111 has joined #archiveteam-bs [21:34] prq: I think, yes. I'd have to test it, but if not, it shouldn't be too difficult to implement it. [21:35] prq: if you want a certain behavior that isn't, now is the best time to find the bugs you want fixed. [21:37] *** Hani has quit IRC (Ping timeout: 745 seconds) [21:37] *** Hani111 is now known as Hani [21:41] I'm happy discussing in either location, lol [21:42] *** manjaro-u has quit IRC (Read error: Connection reset by peer) [21:43] *** manjaro-u has joined #archiveteam-bs [21:43] i was suggesting maybe starting a second test instance of wget --mirror --warc-file that you intentionally interrupt and then restart again :) [21:44] I've tried that, and it definitely seems to start over from the beginning [21:44] so if I've done 300,000 requests, it'll do them again. as long as the content hasn't changed, the warc file doesn't seem to get bigger, which is good [21:45] but with a --wait delay built in, it can take a very very long time to go through all those requests again. [21:45] yeah, that's kinda just how wget works (right now), it doesn't store a fast_resume_cache to restart an interrupted mirror [21:45] as far as I can tell, wget must keep its progress state in memory [21:45] right [21:46] all --mirrors work that away. session stateless, utilize either timestamps or noclobber [21:46] skip as needed in the present [21:50] I've been reading about the wpull project-- not sure if it can address my concern, but it is in python. I'm not great with code, but python is probably my strongest. [21:56] One of the main reasons why wpull was written was to have resumable crawls. So yes. Use --database. *However*, cookies will not survive. [21:56] --mirror doesn't exist on wpull though. [22:06] Soo, my Intel downloads ID scan turned up over 2k files that are apparently missing from the previous retrieval. [22:07] (To be precise: it found 2187 URLs on downloadmirror.intel.com which were not retrieved successfully on the previous attempt. Possibly some of them are 404s or have other issues.) [22:12] *** akierig has quit IRC (Quit: later_gator) [22:47] *** X-Scale has quit IRC (Read error: Operation timed out) [23:25] *** Smiley has joined #archiveteam-bs [23:28] *** schbirid has quit IRC (Quit: Leaving) [23:29] *** SmileyG has quit IRC (Read error: Operation timed out) [23:58] *** BlueMax has joined #archiveteam-bs