#archiveteam-bs 2017-05-29,Mon

↑back Search

Time	Nickname	Message
00:02 ^🔗		j08nY has quit IRC (Quit: Leaving)
00:10 ^🔗		GLaDOS has quit IRC (Ping timeout: 194 seconds)
00:19 ^🔗	i0npulse	So, is IA massaging HTML in the web archive? Code is being "corrected" on render, with code tags being changed from upper-case to lower-case, quotes being added, and end tags being added.
00:21 ^🔗	i0npulse	It has been really frustrating diffing code from another archive againist content in web.archive.org due to the alterations in the code. I would think this goes against IA being so picky about everything having a warc file so it could preserve the current historical state of the file in its purest form.
00:22 ^🔗	i0npulse	I attempted to submit a feedback request on the matter, but I think internal standards with developers at IA and how serve the data to users has warped the presentation of historically relevant content.
00:23 ^🔗	i0npulse	I wish they would just leave the code broken and malformed like it originally was.
00:26 ^🔗	i0npulse	Original Code: <IMG ALIGN=left border=0 hspace=5 SRC="epicsky2.jpg" width="106" height="66">
00:26 ^🔗	i0npulse	IA code: <img align="left" border="0" hspace="5" src="/web/19971010052947im_/http://www.epicgames.com:80/images/unreal/epicsky2.jpg">
00:34 ^🔗		Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
00:36 ^🔗		Xibalba has joined #archiveteam-bs
00:40 ^🔗		Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
00:45 ^🔗		Xibalba has joined #archiveteam-bs
00:45 ^🔗		BlueMaxim has joined #archiveteam-bs
00:48 ^🔗	timmc	i0npulse: Maybe this is a distinction between web.archive.org-the-browsable-version and the archive-as-you-can-download-it-via-API
00:48 ^🔗		GLaDOS has joined #archiveteam-bs
00:51 ^🔗	arkiver	yipdw: yeah, python warc library is horrible
00:51 ^🔗	arkiver	it needs to load the full record into memory
00:51 ^🔗	arkiver	which just sucks
00:52 ^🔗	arkiver	there is now https://github.com/webrecorder/warcio which does not do this afaik
00:52 ^🔗	arkiver	but there was a problem with writing records that are not request or response records
00:52 ^🔗	arkiver	I believe that was fixed a few weeks ago
00:52 ^🔗	arkiver	so I will give warcio a second try soon
00:53 ^🔗	arkiver	but until then I don't know if we can trust it enough to actually replace the deduplication warc stuff we are using now
00:56 ^🔗	arkiver	so I hope we can have that tested soon and after that totally abandon https://github.com/internetarchive/warc
00:57 ^🔗	arkiver	i0npulse: what URL did you use?
00:59 ^🔗	i0npulse	let me see
01:01 ^🔗	i0npulse	https://web.archive.org/web/19971010052947/http://www.epicgames.com/unreal.htm
01:01 ^🔗	arkiver	and adding id_ gives the original version
01:01 ^🔗	arkiver	https://web.archive.org/web/19971010052947id_/http://www.epicgames.com/unreal.htm
01:02 ^🔗	i0npulse	ok nice
01:03 ^🔗	arkiver	yeah I see the problem
01:03 ^🔗	arkiver	<A HREF="index.htm"><IMG ALIGN=left border=0 hspace=5 SRC="images/unreal/epicsky2.jpg">
01:03 ^🔗	arkiver	to
01:03 ^🔗	arkiver	<a href="index.htm"><img align="left" border="0" hspace="5" src="/web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg">
01:03 ^🔗	arkiver	which is strange
01:04 ^🔗	arkiver	thanks
01:04 ^🔗	i0npulse	Yea I originally noticed it when repaired an old SNK Official Homepage for the company back in 2001, that someone had archived way back then with wget.
01:04 ^🔗	i0npulse	Some files were missing, so I was patching corrections in via IA... and thats when I noticed the syntax discrepancies.
01:05 ^🔗	arkiver	yeah
01:06 ^🔗	i0npulse	Eventually I will build code in node to reach IA's API, but at the moment my entire toolchain is in bash with gnu tools.
01:06 ^🔗	arkiver	nice find ;)
01:06 ^🔗	arkiver	:)*
01:08 ^🔗	arkiver	will keep you informed on this
01:08 ^🔗		ndiddy has joined #archiveteam-bs
01:09 ^🔗	joepie91	i0npulse: it's very likely that they're using a DOM-aware library for rewriting URLs (for increased reliability), ie. parsing the document, modifying some attributes in the DOM, and then stringifying it again
01:09 ^🔗	joepie91	i0npulse: for these kind of operations, it's not typical to see errors in the original document reflected in the output; generally a parser will patch things 'on the fly' and only present you with a DOM / AST / whatever, without any annotations that indicate things like spacing, parsing errors, etc.
01:10 ^🔗	joepie91	so there's no way to accurately recreate the original code's structure from that parsed data
01:10 ^🔗	arkiver	we're discovering a rewriting URLs using custom software
01:10 ^🔗	joepie91	(this is an issue I've run into a few times with code rewriting projects)
01:11 ^🔗	joepie91	arkiver: can you rephrase that? syntax error in line 1 of that sentence :P
01:11 ^🔗	i0npulse	lol
01:11 ^🔗	arkiver	uh
01:11 ^🔗	i0npulse	i think a = and
01:11 ^🔗	arkiver	yeah
01:11 ^🔗	joepie91	ah, right. but if you're using a library for parsing HTML, you still end up with the same problem :P
01:12 ^🔗	joepie91	lots of parsers flat-out don't support annotating things like this
01:13 ^🔗	jrwr	Wow, the MAP tag
01:15 ^🔗	alembic	beautifulsoup is usually what python folk use to parse & prettify html... might be responsible for some of the changes you're seeing
01:15 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
01:16 ^🔗	i0npulse	I had rebuilt the old Unreal Technology website, and the original admins had case dyslexia, so files were linked as UnrealVersion.htm, unrealversion.htm, and unrealVersion.htm all through out the code. Which sent the original HTTrack grab by Hyper.nl on a wild ride appending -1, -2, -3 onto file names.
01:17 ^🔗	i0npulse	Then I was looking at the "corrected syntax" output from IA, and was like "ok WHICH is the real file name!?"
01:17 ^🔗	i0npulse	I already packed the site... so now the id_ tip from arkiver... wondering if I go back and fix anything or not lol
01:22 ^🔗	arkiver	I had a look at wayback
01:23 ^🔗	arkiver	images/unreal/epicsky2.jpg is currently rewritten to host relative
01:23 ^🔗		Sk1d has joined #archiveteam-bs
01:23 ^🔗	arkiver	since it is seen as an image and thus should have 'im_' appended to the timestamp
01:23 ^🔗	arkiver	/web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg
01:24 ^🔗	i0npulse	so the image path discrpancy between Hyper.nl and IA is probobly a difference in how Hyper.nl's copy was originally retrieved
01:24 ^🔗	arkiver	if we would keep it path relative we could not get the im_ in
01:24 ^🔗	arkiver	do you have an example
01:24 ^🔗	arkiver	post the example, will have a look tomorrow
01:25 ^🔗	*	arkiver is afk
01:25 ^🔗	i0npulse	I think the path in IA is accurate
01:25 ^🔗	arkiver	yeah it is, but we try to keep the path the way it is
01:25 ^🔗	arkiver	so path relative to path relative
01:25 ^🔗	arkiver	host relative to host relative
01:25 ^🔗	i0npulse	its just the syntax massaging that was being thrown off, case, quotes, tags
01:25 ^🔗	arkiver	but in this case we can't
01:25 ^🔗	arkiver	yes
01:25 ^🔗	arkiver	still looking into that
01:26 ^🔗	arkiver	the old wayback machine would rewrite all URLs to host relative
01:26 ^🔗	arkiver	and that was sometimes a problem with URLs embedded in scripts
01:26 ^🔗	arkiver	they would not be recognized, etc.
01:26 ^🔗	i0npulse	what do you mean by host vs path relative
01:26 ^🔗	arkiver	(fixed in the new wayback machine)
01:26 ^🔗	arkiver	path relative = images/unreal/epicsky2.jpg
01:27 ^🔗	arkiver	so that URL is a location in the current path on the website
01:27 ^🔗	arkiver	host relative = /web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg
01:27 ^🔗	arkiver	so URL is a location on host
01:27 ^🔗	i0npulse	OH yea, that is an issue that can be worked around on my end. At least the necessary info is there.
01:28 ^🔗	i0npulse	Nothing that sed can't solve for
01:28 ^🔗	arkiver	yeah
01:28 ^🔗	arkiver	anyway
01:28 ^🔗	*	arkiver is afk
01:28 ^🔗	i0npulse	cool! laters
01:35 ^🔗	godane	i'm finally capturing a home movie
02:08 ^🔗	SketchCow	wp494: S8+
02:24 ^🔗	Lord_Nigh	balrog: those magazines we have at the shop... what is the plan for those?
02:41 ^🔗	yipdw	arkiver: ah, ok. I may also give warcio a try soon
03:19 ^🔗	robogoat	archiveteam has a shop?
03:30 ^🔗	Lord_Nigh	no, balrog has space rented as a shop
03:30 ^🔗	Lord_Nigh	and has a pile of magazines boxed up
03:30 ^🔗	Lord_Nigh	as a machine shop for building stuff
03:34 ^🔗	Lord_Nigh	iirc that stuff is unscanned tech stuff from late 70s and 80s, which Ia does not have
04:32 ^🔗	robogoat	Ah,
04:36 ^🔗		usr has joined #archiveteam-bs
04:49 ^🔗	usr	quiet here
04:50 ^🔗	voltagex	Anyone pushing https://loc.gov/cds/downloads/MDSConnect to archive.org?
04:51 ^🔗	xmc	Kaz: wp494: my munin is all broken and stuff, i know
04:51 ^🔗	usr	voltagex: is that just metadata or does LoC include content too?
04:52 ^🔗	voltagex	25 million bibliographical data sets apparently
04:52 ^🔗	voltagex	Someone on HN will pay for a VPS for me but I have no idea where to start
04:52 ^🔗	usr	shame its not the contents as well, thatd be a nice haul
04:53 ^🔗	voltagex	100gb+ of metadata :p
04:55 ^🔗	usr	so question as a new person to the archive-team effort ... lots of work here to get data/data-sets, but not much about plugging the data back into a self-hosted source. if the source/website is open-source as well, would it not make sense to include that as well (not as a requirement to new projects, but as a stretch-goal) ?
04:56 ^🔗	usr	ex. i have the reddit comment data dump, but to me, it makes most sense with a local instance of reddit running
04:56 ^🔗	voltagex	The data is the irreplaceable bit
04:56 ^🔗	usr	agreed
04:56 ^🔗	usr	data is pri-0, gotta come first
04:56 ^🔗	voltagex	Most of the time it's a mad dash to save shit
04:57 ^🔗	voltagex	See the Imzy shut down - less than one month to grab everything
05:03 ^🔗	usr	i suppose my use case is a little bit off the main trail
05:04 ^🔗		bmcginty has quit IRC (Ping timeout: 246 seconds)
05:10 ^🔗		bmcginty has joined #archiveteam-bs
05:13 ^🔗	ranma	damn, IA doesn't have the audio for (nsfw) https://web.archive.org/web/20120615222011/http://soundcloud.com/htf-s
05:13 ^🔗	ranma	people recording their neighbors annoying having loud sex
05:13 ^🔗	usr	rofl
05:14 ^🔗	ranma	*annoyingly
05:50 ^🔗		usr has quit IRC (Quit: Leaving)
06:59 ^🔗		j08nY has joined #archiveteam-bs
07:01 ^🔗		SHODAN_UI has joined #archiveteam-bs
07:05 ^🔗		ndiddy has quit IRC ()
08:42 ^🔗		j08nY has quit IRC (Read error: Operation timed out)
08:53 ^🔗	JAA	jrwr: I don't know. We should though.
09:09 ^🔗		bwn has quit IRC (Ping timeout: 268 seconds)
09:21 ^🔗		bwn has joined #archiveteam-bs
09:47 ^🔗		bwn has quit IRC (Read error: Operation timed out)
10:04 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
10:13 ^🔗	voltagex	tfw you can't remember how to do https://launchpad.net/~voltagex/+archive/ubuntu/wget-lua
10:31 ^🔗		j08nY has joined #archiveteam-bs
11:12 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
11:42 ^🔗		SHODAN_UI has joined #archiveteam-bs
12:25 ^🔗		RichardG_ has joined #archiveteam-bs
12:25 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
12:48 ^🔗		RichardG_ has quit IRC (Read error: Connection reset by peer)
12:50 ^🔗		RichardG has joined #archiveteam-bs
13:14 ^🔗		schbirid has joined #archiveteam-bs
15:33 ^🔗		brayden_ has joined #archiveteam-bs
15:33 ^🔗		swebb sets mode: +o brayden_
15:39 ^🔗		brayden has quit IRC (Read error: Operation timed out)
15:59 ^🔗		Odd0002_ has joined #archiveteam-bs
15:59 ^🔗		Odd0002_ has quit IRC (Client Quit)
16:35 ^🔗		dashcloud has quit IRC (Remote host closed the connection)
16:40 ^🔗		dashcloud has joined #archiveteam-bs
18:06 ^🔗		Yoshimura has quit IRC (Quit: WeeChat 0.4.2)
18:42 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
18:51 ^🔗		aschmitz has joined #archiveteam-bs
19:11 ^🔗	schbirid	is there something smarter than an awkward --reject-regex like '(.*?/){15}' to avoid wpull trying to get deeply nested broken forum urls with lots of non-existent subdirectories?
19:13 ^🔗	schbirid	ie shit like https://pastebin.com/raw/139U38TG
19:30 ^🔗		BartoCH_ has joined #archiveteam-bs
19:31 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
19:51 ^🔗		BartoCH has joined #archiveteam-bs
19:51 ^🔗		BartoCH_ has quit IRC (Ping timeout: 260 seconds)
20:33 ^🔗		dashcloud has quit IRC (Remote host closed the connection)
20:36 ^🔗		SmileyG has joined #archiveteam-bs
20:42 ^🔗		Smiley has quit IRC (Read error: Operation timed out)
20:46 ^🔗		SHODAN_UI has joined #archiveteam-bs
20:57 ^🔗		bwn has joined #archiveteam-bs
21:55 ^🔗		bwn has quit IRC (Read error: Connection reset by peer)
22:10 ^🔗		SHODAN_UI has quit IRC (Quit: zzz)
22:10 ^🔗		bwn has joined #archiveteam-bs
22:37 ^🔗		Aranje has joined #archiveteam-bs
22:47 ^🔗		TheLovina has joined #archiveteam-bs
22:55 ^🔗		greenie has quit IRC (Read error: Operation timed out)
22:55 ^🔗		Smiley has joined #archiveteam-bs
22:56 ^🔗		SmileyG has quit IRC (west.us.hub irc.Prison.NET)
22:56 ^🔗		dashcloud has joined #archiveteam-bs
23:00 ^🔗		greenie has joined #archiveteam-bs
23:21 ^🔗		JensRex has quit IRC (Remote host closed the connection)
23:22 ^🔗		JensRex has joined #archiveteam-bs
23:35 ^🔗		TheLovina has quit IRC (Read error: Operation timed out)

irclogger-viewer