#archiveteam-bs 2019-11-18,Mon

↑back Search

Time	Nickname	Message
00:41 ^🔗		Sokar has quit IRC (Read error: Connection reset by peer)
00:45 ^🔗		Sokar has joined #archiveteam-bs
00:48 ^🔗	Raccoon	What are some tools that make the creation of page-scraping templates easier?
00:58 ^🔗		raeyulca has joined #archiveteam-bs
03:04 ^🔗		manjaro-u has quit IRC (Read error: Operation timed out)
03:11 ^🔗		DogsRNice has quit IRC (Read error: Connection reset by peer)
03:30 ^🔗	LowLevelM	We should have a BAT wallet, so brave rewards users can donate.
04:24 ^🔗		odemgi_ has joined #archiveteam-bs
04:27 ^🔗		qw3rty has joined #archiveteam-bs
04:28 ^🔗		odemgi has quit IRC (Read error: Operation timed out)
04:34 ^🔗		qw3rty2 has quit IRC (Ping timeout: 745 seconds)
04:54 ^🔗		icedice has quit IRC (Quit: Leaving)
05:38 ^🔗		benjins has quit IRC (Read error: Connection reset by peer)
06:31 ^🔗		m007a83 has quit IRC (Quit: Fuck you Comcast)
07:30 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
07:38 ^🔗		schbirid has joined #archiveteam-bs
07:52 ^🔗		OrIdow6 has quit IRC (Quit: Leaving.)
07:52 ^🔗		d5f4a3622 has quit IRC (Read error: Connection reset by peer)
07:53 ^🔗		d5f4a3622 has joined #archiveteam-bs
08:11 ^🔗		deevious has joined #archiveteam-bs
08:30 ^🔗		d5f4a3622 has quit IRC (Ping timeout: 246 seconds)
08:32 ^🔗		d5f4a3622 has joined #archiveteam-bs
09:00 ^🔗		d5f4a3622 has quit IRC (Read error: Connection reset by peer)
09:01 ^🔗		d5f4a3622 has joined #archiveteam-bs
09:13 ^🔗		OrIdow6 has joined #archiveteam-bs
09:21 ^🔗		d5f4a3622 has quit IRC (Read error: Connection reset by peer)
09:44 ^🔗		Dragnog2 has quit IRC (Quit: Connection closed for inactivity)
09:49 ^🔗		d5f4a3622 has joined #archiveteam-bs
11:43 ^🔗		odemg has joined #archiveteam-bs
12:58 ^🔗		erkinalp has joined #archiveteam-bs
13:05 ^🔗		benjins has joined #archiveteam-bs
14:17 ^🔗		Flashfire has quit IRC (Remote host closed the connection)
14:17 ^🔗		kiska has quit IRC (Remote host closed the connection)
14:18 ^🔗		Flashfire has joined #archiveteam-bs
14:18 ^🔗		kiska has joined #archiveteam-bs
14:18 ^🔗		Fusl__ sets mode: +o kiska
14:18 ^🔗		Fusl sets mode: +o kiska
14:18 ^🔗		Fusl_ sets mode: +o kiska
15:04 ^🔗		akierig has joined #archiveteam-bs
15:15 ^🔗		Dallas has quit IRC (Quit: The Lounge - https://thelounge.chat)
15:30 ^🔗		dashcloud has joined #archiveteam-bs
15:32 ^🔗	jrwr	dashcloud: Ya, We have been getting it from all sides this morning
15:32 ^🔗	jrwr	Even Jason/SketchCow has been getting a ton of emails
15:46 ^🔗	SketchCow	Well, I always get a bunch
15:47 ^🔗	SketchCow	Make sure we're ACTUALLY getting them, though - that collection is a fucking nightmare of CGI
15:48 ^🔗	jrwr	Sounds about like what we are dealing with and the FCC
15:52 ^🔗	Raccoon	Is there any precedence or rapport to contact Intel and ask for an archive directly
15:55 ^🔗		VerifiedJ has joined #archiveteam-bs
15:56 ^🔗	jrwr	We had grabbed a copy a few months ago
15:56 ^🔗	jrwr	We are currently in the process of verifying it
15:59 ^🔗	Raccoon	Is that just BIOS firmware, or all of the GFX, Sound, Networking etc drivers too?
16:04 ^🔗	jrwr	From what I understand, It was everything
16:04 ^🔗	jrwr	about 300GB
16:04 ^🔗	Raccoon	wowza.
16:04 ^🔗	jrwr	We did it as a thing, but now that its in danger, we will verify the archive and download anything missing
16:05 ^🔗	Raccoon	Still, would be neat to have that sort of direct contact and build of rapport.
16:05 ^🔗	Raccoon	"Dear Intel: Hi, ArchiveTeam here -- folks from the Internet Archive, you know, "The Wayback Machine." Anyway, it'd be really nifty if we could arrange some access to make a backup copy the support software and drivers you guys intend to delete this month. We regularly work with businesses that, for whatever reason, need to delete large portions of The Internet. It's our specialty and we're here to help."
16:06 ^🔗	Raccoon	(Help us help you help us all.)
16:08 ^🔗	JAA	jrwr: We did it because of that deletion. It's old news that's just now brought up again as the deadline comes closer.
16:08 ^🔗	jrwr	Ah
16:08 ^🔗	jrwr	Figures as much
17:01 ^🔗	Ryz	Hello, I would like to make a special request, with Google wanting to index anything Adobe Flash related ( https://webmasters.googleblog.com/2019/10/goodbye-flash.html ) , I would like to have a scraping of Adobe Flash results of using normal searches, image searches (maybe?0), and search by filetype;
17:02 ^🔗	Ryz	I might want to say use Bing since in the past they snagged results from Google, but they might've changed that formula in response of being found out
17:02 ^🔗	Ryz	*Google wanting to de-index anything Adobe Flash related
17:03 ^🔗	Ryz	I'm also doing it in behalf of filtering websites for the people at Flashpoint to scrape off websites from
17:04 ^🔗	Ryz	I ask this because it's too tedious to filter the websites manually to submit stuff to their lists - unfortunately, Google has anti-scraping technologies
17:09 ^🔗	Ryz	Huh, interesting, "Anyway, this is specific to textual content / full sites that are in Flash; videos & animations generally wouldn't be indexed in web-search anyway.": https://twitter.com/JohnMu/status/1188795137591259136
17:16 ^🔗		akierig has quit IRC (Quit: later_gator)
17:20 ^🔗		akierig has joined #archiveteam-bs
17:48 ^🔗	Raccoon	Ryz: there are a few swf archives, ie swfchan.com, born out of the silly 2chan 4chan and newgrounds music video animated shorts scene / school projects. I spent several years archiving anything that appeared on 4chan/f/ which I hope to convert to mp4 some day.
17:48 ^🔗	Ryz	To clarify, I meant scraping search results from Google regarding anything Adobe Flash related~ oo;
17:49 ^🔗	Raccoon	Oh. I thought Google's response was basically that embedded flash videos don't index on google
17:52 ^🔗	Ryz	It would be awesome if there was a dictionary-style scraping, like say '[dictionary word] flash', '[dictionary word] flash game', 'filetype:swf [dictionary word]' just for the coverage~
17:52 ^🔗	Raccoon	I wonder
17:54 ^🔗	Raccoon	you know how Adobe (and other sites) would give you copypasta <embed><object> tags that included a link back to downloading and installing Adobe Flash Plugin / Adobe Shockwave Plugin etc
17:54 ^🔗	Raccoon	Maybe find some of those standard "Please install..." templates, and search google for those quoted text.
18:01 ^🔗		manjaro-u has joined #archiveteam-bs
18:03 ^🔗	Ryz	Well, that's a clever idea; unfortunately, I do know that not all of them do that procedure, or not at all from finding the more obscure websites~ ><;
18:04 ^🔗	Raccoon	aye. and it'd mean finding all the various iterations of that template, between <embed> and <object> and Shockwave and Flash, version to version
18:05 ^🔗	Ryz	Surely there's a specialized search engine in which you search by HTML or web code instead of just the text content ><;
18:06 ^🔗	Raccoon	I wonder if IA takes bribes to keyword scan the entire WBM
18:08 ^🔗	markedL	I'm not an expert in the common crawl, but it's worth asking someone who is
18:11 ^🔗	JAA	Ryz: If you want me to scrape anything from Bing, let me know.
18:11 ^🔗	JAA	(Or well, the script is in my little-things repo.)
18:12 ^🔗	Ryz	Does Bing still have copied results from Google? oo;
18:12 ^🔗	JAA	No idea.
18:13 ^🔗	Ryz	I would ideally want Google, because it may have an ever-changing distinct take of what it does ><;
18:14 ^🔗	JAA	Good luck. Scraping Google is hard.
18:15 ^🔗	Raccoon	Apparently there are other file extensions besides .swf, such as .dcr (per https://wikiext.com/dcr) (per https://ia800709.us.archive.org/2/items/shockwave-games.net/shockwave-games.net.html)
18:23 ^🔗		erkinalp has quit IRC (Ping timeout: 260 seconds)
18:26 ^🔗	Ryz	Hm, when doing the search query for 'filetype:swf water game' on both Bing and Google, they both have different results, but the latter is far more reaching than the former, the former can go up to 6 pages for this search query while the latter is up to 20 pages with default filtering of very similar entires xX;
18:35 ^🔗	Larsenv	Hey, so I gave Ryz this site to archive: http://inchwormanimation.com/
18:35 ^🔗	Raccoon	And 32 pages on Google with "omitted results included", not really seeing repeat domain names.
18:36 ^🔗	Larsenv	they said to ask here about someone about archiving it
18:36 ^🔗	Ryz	Yeah, this seems a bit more complicated than on first glance~
18:37 ^🔗	Larsenv	"might need some specialized help since this looks ancient; probably ask people for help"
18:38 ^🔗	Larsenv	that's what Ryz said
18:38 ^🔗	Ryz	The thing is, it looks like it's barricaded by a mix of JavaScript and .swf - though it seems it's not using .swf anymore and is playing videos in .mp4 instead
18:38 ^🔗	Raccoon	also "for Nintendo DSi"?
18:39 ^🔗	Larsenv	yeah, inchworm animation is a flipnote-like animation software for the DSi
18:39 ^🔗	Larsenv	and the 3DS
18:40 ^🔗	Raccoon	oh, an editor software
18:40 ^🔗	Larsenv	this seems to be the successor
18:40 ^🔗	Larsenv	http://www.butterflyanimation.com/
18:43 ^🔗	Raccoon	the entire site seems rather tiny. you want it in a zip or on the waybackmachine
18:43 ^🔗	JAA	Eww, zip.
18:44 ^🔗	Raccoon	If zip is so eeew, then why doesn't IA support "look inside" on 7z files? :p
18:44 ^🔗	JAA	I mean the idea of using zip for a website archive.
18:44 ^🔗	JAA	WARC or bust
18:45 ^🔗	Raccoon	I'm told that's a write-only container :p
18:46 ^🔗	JAA	pywb does a good job of local playback, and the WBM is, well, the WBM.
18:47 ^🔗	JAA	Still, it's the most suitable format for archiving web content.
18:49 ^🔗	Raccoon	oh wee, wget does warc cuz it's badass
18:49 ^🔗	JAA	Yeah, unfortunately it does it wrong, but oh well.
18:49 ^🔗	JAA	Or well, "wrong".
18:49 ^🔗	Raccoon	but the wiki says it does it right
18:49 ^🔗	JAA	Then the wiki is outdated and hasn't been updated with the 1.19.4 debacle.
18:50 ^🔗	Raccoon	", with the notable exception of Wget, a popular WARC-producing program, which, since February 2016, has used the angle brackets [correctly as specified]"
18:51 ^🔗	JAA	Well, here's the thing, the standard was broken. The grammar didn't match the examples. So it's both correct and wrong, in a sense.
18:51 ^🔗	Raccoon	oh, i guess v1.1 eliminates angle brackets for backwards compatability with badly written software
18:51 ^🔗	Raccoon	instead of correcting the software, uncorrect the standard ;)
18:52 ^🔗	Raccoon	what happened in 1.19.4 specifically?
18:52 ^🔗	JAA	Considering that the software authors were involved in the creation of the WARC standard, it was clearly a bug in the standard since everyone agreed that there shouldn't be angle brackets around the WARC-Target-URI.
18:52 ^🔗		erkinalp has joined #archiveteam-bs
18:53 ^🔗	JAA	The angle brackets were added in 1.19.4.
18:53 ^🔗	Raccoon	oh, that recently? wow
18:53 ^🔗	Raccoon	28500:2007 is 2007 isn't it
18:54 ^🔗	JAA	Yep, WARC/1.0 is over a decade old.
18:56 ^🔗	Raccoon	so wget 1.19.4 was committed on 23-Jan-2018, after WARC 1.1 ISO 28500:2017
18:56 ^🔗	JAA	I've been considering sending a minimal patch to wget that bumps the WARC version and removes the angle brackets. No other changes would be needed I believe.
18:56 ^🔗	JAA	But I'm waiting for wumpus's strict conformity check tool to verify that those files are then indeed according to spec.
18:56 ^🔗	Raccoon	does darnir even know about this drama?
18:56 ^🔗	JAA	No clue.
18:57 ^🔗	Raccoon	he seems like the kinda guy who would drop everything to fix it if just asked. freenode/#wget
18:57 ^🔗	Raccoon	well, I mean, they all have been very responsive over the years. micahcowan and giuseppe included.
18:57 ^🔗	JAA	Here's the pywb issue about this, which also links to an email on bug-wget: https://github.com/webrecorder/pywb/issues/294
18:59 ^🔗	JAA	I believe the WARC code in wget was originally written by someone from ArchiveTeam and then mostly just stayed in the source without maintenance. Could be wrong though since that was well before my time here.
18:59 ^🔗	Raccoon	still weird that the "problem" began in 2018, already after the 2017 ISO came out deprecating the 1.0 format
19:00 ^🔗	JAA	Yup, problem was noticed and fixed, and then someone said "yup, we'll now write WARCs that break all existing tooling".
19:02 ^🔗	Raccoon	so wget is still broken as of 1.20.3?
19:06 ^🔗	Raccoon	i wonder if there's a way to patch existing anglebracketcontaining WARCs by binary search-and-replace
19:06 ^🔗	Raccoon	ie, replacing the < > with spaces 0x20
19:07 ^🔗	JAA	Yes, it's still producing angle-bracketed WARCs.
19:07 ^🔗	Raccoon	saving time and money recompiling every tainted archive in the past 2.9 years
19:07 ^🔗	JAA	No, firstly because WARCs are (almost always) compressed, and secondly because you'll only want to replace it in specific headers. So it'd need some different tooling.
19:08 ^🔗	Raccoon	ah, gzip
19:08 ^🔗	JAA	Yup
19:08 ^🔗	Raccoon	wasn't the original ARC something not-gzip compression?
19:08 ^🔗	JAA	And you can't do zcat file.warc.gz \| replace \| gzip >out.warc.gz either because it's per-record compression.
19:08 ^🔗	JAA	No idea, I never looked into ARC.
19:09 ^🔗	Raccoon	(i haven't used ARC in 25-30 years)
19:14 ^🔗		Dallas has joined #archiveteam-bs
19:53 ^🔗	Raccoon	JAA: darnir said he'll look into [fixing] it when he gets home.
19:54 ^🔗	Raccoon	linked him to this channel, should he show up
19:54 ^🔗	JAA	Raccoon: Nice, thank you!
19:55 ^🔗	Raccoon	twuz nutin
19:58 ^🔗	Raccoon	incidentally, 1.19.x is when the project was transitioning from giuseppe to darnir, so probably a simple case of institutional memory loss
19:59 ^🔗	JAA	Ah, interesting, and makes sense.
20:09 ^🔗	JAA	So about Intel: apparently they've already begun deleting stuff. Various entries that existed in early September are gone now.
20:11 ^🔗	Raccoon	That's what https://www.vogons.org/viewtopic.php?f=46&t=69184 was saying. "... I also discovered Intel had already removed all their drivers just two days ago (!!) (on September 13th, 2019)."
20:12 ^🔗	Raccoon	you got lucky
20:12 ^🔗	JAA	Indeed, that grab finished on the 9th.
20:12 ^🔗	Raccoon	JAA: do you want to talk more about WARC
20:13 ^🔗	JAA	Raccoon: In general, absolutely, but not right now.
20:14 ^🔗	Raccoon	rockdaboot has minutia questions I don't have intelligent answers to. /join #wget on freenode
20:14 ^🔗	Raccoon	also requesting a copy of the ISO PDF (behind paywall)
20:15 ^🔗	JAA	Yeah, I don't have that.
20:16 ^🔗	JAA	All I know is it's supposedly virtually identical to http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
20:17 ^🔗	JAA	And http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf is the "latest draft" as suggested by the filename.
20:17 ^🔗	Raccoon	ah, i think that'll be reasonable
20:18 ^🔗	Raccoon	thanks
20:20 ^🔗	Raccoon	there's really only that singular change in behavior, or a whole nest of things between 1.0 and 1.1?
20:26 ^🔗		Zerote__ has joined #archiveteam-bs
20:26 ^🔗		Nick-PC_ has joined #archiveteam-bs
20:29 ^🔗		Nick-PC has quit IRC (Ping timeout: 252 seconds)
20:29 ^🔗		Zerote_ has quit IRC (Ping timeout: 252 seconds)
20:30 ^🔗	JAA	I believe that's the only critical change. Everything else was added features (like more flexible and precise timestamps and new header fields) which are optional. Not 100 % sure though.
20:43 ^🔗	prq	just double checking-- a 'wget --mirror' that writes to a warc file can't be stopped and resumed, right?
20:44 ^🔗	prq	so if there's a power outage or other issue, I can't resume exactly where I left off?
21:01 ^🔗		erkinalp has quit IRC (Ping timeout: 260 seconds)
21:11 ^🔗		odemgi has joined #archiveteam-bs
21:16 ^🔗		odemgi_ has quit IRC (Read error: Operation timed out)
21:26 ^🔗		Hani111 has joined #archiveteam-bs
21:34 ^🔗	Raccoon	prq: <darnir[m]> I think, yes. I'd have to test it, but if not, it shouldn't be too difficult to implement it.
21:35 ^🔗	Raccoon	prq: if you want a certain behavior that isn't, now is the best time to find the bugs you want fixed.
21:37 ^🔗		Hani has quit IRC (Ping timeout: 745 seconds)
21:37 ^🔗		Hani111 is now known as Hani
21:41 ^🔗	prq	I'm happy discussing in either location, lol
21:42 ^🔗		manjaro-u has quit IRC (Read error: Connection reset by peer)
21:43 ^🔗		manjaro-u has joined #archiveteam-bs
21:43 ^🔗	Raccoon	i was suggesting maybe starting a second test instance of wget --mirror --warc-file that you intentionally interrupt and then restart again :)
21:44 ^🔗	prq	I've tried that, and it definitely seems to start over from the beginning
21:44 ^🔗	prq	so if I've done 300,000 requests, it'll do them again. as long as the content hasn't changed, the warc file doesn't seem to get bigger, which is good
21:45 ^🔗	prq	but with a --wait delay built in, it can take a very very long time to go through all those requests again.
21:45 ^🔗	Raccoon	yeah, that's kinda just how wget works (right now), it doesn't store a fast_resume_cache to restart an interrupted mirror
21:45 ^🔗	prq	as far as I can tell, wget must keep its progress state in memory
21:45 ^🔗	Raccoon	right
21:46 ^🔗	Raccoon	all --mirrors work that away. session stateless, utilize either timestamps or noclobber
21:46 ^🔗	Raccoon	skip as needed in the present
21:50 ^🔗	prq	I've been reading about the wpull project-- not sure if it can address my concern, but it is in python. I'm not great with code, but python is probably my strongest.
21:56 ^🔗	JAA	One of the main reasons why wpull was written was to have resumable crawls. So yes. Use --database. However, cookies will not survive.
21:56 ^🔗	JAA	--mirror doesn't exist on wpull though.
22:06 ^🔗	JAA	Soo, my Intel downloads ID scan turned up over 2k files that are apparently missing from the previous retrieval.
22:07 ^🔗	JAA	(To be precise: it found 2187 URLs on downloadmirror.intel.com which were not retrieved successfully on the previous attempt. Possibly some of them are 404s or have other issues.)
22:12 ^🔗		akierig has quit IRC (Quit: later_gator)
22:47 ^🔗		X-Scale has quit IRC (Read error: Operation timed out)
23:25 ^🔗		Smiley has joined #archiveteam-bs
23:28 ^🔗		schbirid has quit IRC (Quit: Leaving)
23:29 ^🔗		SmileyG has quit IRC (Read error: Operation timed out)
23:58 ^🔗		BlueMax has joined #archiveteam-bs

irclogger-viewer