[00:02] *** j08nY has quit IRC (Quit: Leaving) [00:10] *** GLaDOS has quit IRC (Ping timeout: 194 seconds) [00:19] So, is IA massaging HTML in the web archive? Code is being "corrected" on render, with code tags being changed from upper-case to lower-case, quotes being added, and end tags being added. [00:21] It has been really frustrating diffing code from another archive againist content in web.archive.org due to the alterations in the code. I would think this goes against IA being so picky about everything having a warc file so it could preserve the current historical state of the file in its purest form. [00:22] I attempted to submit a feedback request on the matter, but I think internal standards with developers at IA and how serve the data to users has warped the presentation of historically relevant content. [00:23] I wish they would just leave the code broken and malformed like it originally was. [00:26] Original Code:

[00:26] IA code:

[00:34] *** Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in) [00:36] *** Xibalba has joined #archiveteam-bs [00:40] *** Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in) [00:45] *** Xibalba has joined #archiveteam-bs [00:45] *** BlueMaxim has joined #archiveteam-bs [00:48] i0npulse: Maybe this is a distinction between web.archive.org-the-browsable-version and the archive-as-you-can-download-it-via-API [00:48] *** GLaDOS has joined #archiveteam-bs [00:51] yipdw: yeah, python warc library is horrible [00:51] it needs to load the full record into memory [00:51] which just sucks [00:52] there is now https://github.com/webrecorder/warcio which does not do this afaik [00:52] but there was a problem with writing records that are not request or response records [00:52] I believe that was fixed a few weeks ago [00:52] so I will give warcio a second try soon [00:53] but until then I don't know if we can trust it enough to actually replace the deduplication warc stuff we are using now [00:56] so I hope we can have that tested soon and after that totally abandon https://github.com/internetarchive/warc [00:57] i0npulse: what URL did you use? [00:59] let me see [01:01] https://web.archive.org/web/19971010052947/http://www.epicgames.com/unreal.htm [01:01] and adding id_ gives the original version [01:01] https://web.archive.org/web/19971010052947id_/http://www.epicgames.com/unreal.htm [01:02] ok nice [01:03] yeah I see the problem [01:03]

[01:03] to [01:03]

[01:03] which is strange [01:04] thanks [01:04] Yea I originally noticed it when repaired an old SNK Official Homepage for the company back in 2001, that someone had archived way back then with wget. [01:04] Some files were missing, so I was patching corrections in via IA... and thats when I noticed the syntax discrepancies. [01:05] yeah [01:06] Eventually I will build code in node to reach IA's API, but at the moment my entire toolchain is in bash with gnu tools. [01:06] nice find ;) [01:06] :)* [01:08] will keep you informed on this [01:08] *** ndiddy has joined #archiveteam-bs [01:09] i0npulse: it's very likely that they're using a DOM-aware library for rewriting URLs (for increased reliability), ie. parsing the document, modifying some attributes in the DOM, and then stringifying it again [01:09] i0npulse: for these kind of operations, it's not typical to see errors in the original document reflected in the output; generally a parser will patch things 'on the fly' and only present you with a DOM / AST / whatever, without any annotations that indicate things like spacing, parsing errors, etc. [01:10] so there's no way to accurately recreate the original code's structure from that parsed data [01:10] we're discovering a rewriting URLs using custom software [01:10] (this is an issue I've run into a few times with code rewriting projects) [01:11] arkiver: can you rephrase that? syntax error in line 1 of that sentence :P [01:11] lol [01:11] uh [01:11] i think a = and [01:11] yeah [01:11] ah, right. but if you're using a library for parsing HTML, you still end up with the same problem :P [01:12] lots of parsers flat-out don't support annotating things like this [01:13] Wow, the MAP tag [01:15] beautifulsoup is usually what python folk use to parse & prettify html... might be responsible for some of the changes you're seeing [01:15] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [01:16] I had rebuilt the old Unreal Technology website, and the original admins had case dyslexia, so files were linked as UnrealVersion.htm, unrealversion.htm, and unrealVersion.htm all through out the code. Which sent the original HTTrack grab by Hyper.nl on a wild ride appending -1, -2, -3 onto file names. [01:17] Then I was looking at the "corrected syntax" output from IA, and was like "ok WHICH is the real file name!?" [01:17] I already packed the site... so now the id_ tip from arkiver... wondering if I go back and fix anything or not lol [01:22] I had a look at wayback [01:23] images/unreal/epicsky2.jpg is currently rewritten to host relative [01:23] *** Sk1d has joined #archiveteam-bs [01:23] since it is seen as an image and thus should have 'im_' appended to the timestamp [01:23] /web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg [01:24] so the image path discrpancy between Hyper.nl and IA is probobly a difference in how Hyper.nl's copy was originally retrieved [01:24] if we would keep it path relative we could not get the im_ in [01:24] do you have an example [01:24] post the example, will have a look tomorrow [01:25] * arkiver is afk [01:25] I think the path in IA is accurate [01:25] yeah it is, but we try to keep the path the way it is [01:25] so path relative to path relative [01:25] host relative to host relative [01:25] its just the syntax massaging that was being thrown off, case, quotes, tags [01:25] but in this case we can't [01:25] yes [01:25] still looking into that [01:26] the old wayback machine would rewrite all URLs to host relative [01:26] and that was sometimes a problem with URLs embedded in scripts [01:26] they would not be recognized, etc. [01:26] what do you mean by host vs path relative [01:26] (fixed in the new wayback machine) [01:26] path relative = images/unreal/epicsky2.jpg [01:27] so that URL is a location in the current path on the website [01:27] host relative = /web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg [01:27] so URL is a location on host [01:27] OH yea, that is an issue that can be worked around on my end. At least the necessary info is there. [01:28] Nothing that sed can't solve for [01:28] yeah [01:28] anyway [01:28] * arkiver is afk [01:28] cool! laters [01:35] i'm finally capturing a home movie [02:08] wp494: S8+ [02:24] balrog: those magazines we have at the shop... what is the plan for those? [02:41] arkiver: ah, ok. I may also give warcio a try soon [03:19] archiveteam has a shop? [03:30] no, balrog has space rented as a shop [03:30] and has a pile of magazines boxed up [03:30] as a machine shop for building stuff [03:34] iirc that stuff is unscanned tech stuff from late 70s and 80s, which Ia does not have [04:32] Ah, [04:36] *** usr has joined #archiveteam-bs [04:49] quiet here [04:50] Anyone pushing https://loc.gov/cds/downloads/MDSConnect to archive.org? [04:51] Kaz: wp494: my munin is all broken and stuff, i know [04:51] voltagex: is that just metadata or does LoC include content too? [04:52] 25 million bibliographical data sets apparently [04:52] Someone on HN will pay for a VPS for me but I have no idea where to start [04:52] shame its not the contents as well, thatd be a nice haul [04:53] 100gb+ of metadata :p [04:55] so question as a new person to the archive-team effort ... lots of work here to get data/data-sets, but not much about plugging the data back into a self-hosted source. if the source/website is open-source as well, would it not make sense to include that as well (not as a requirement to new projects, but as a stretch-goal) ? [04:56] ex. i have the reddit comment data dump, but to me, it makes most sense with a local instance of reddit running [04:56] The data is the irreplaceable bit [04:56] agreed [04:56] data is pri-0, gotta come first [04:56] Most of the time it's a mad dash to save shit [04:57] See the Imzy shut down - less than one month to grab everything [05:03] i suppose my use case is a little bit off the main trail [05:04] *** bmcginty has quit IRC (Ping timeout: 246 seconds) [05:10] *** bmcginty has joined #archiveteam-bs [05:13] damn, IA doesn't have the audio for (nsfw) https://web.archive.org/web/20120615222011/http://soundcloud.com/htf-s [05:13] people recording their neighbors annoying having loud sex [05:13] rofl [05:14] *annoyingly [05:50] *** usr has quit IRC (Quit: Leaving) [06:59] *** j08nY has joined #archiveteam-bs [07:01] *** SHODAN_UI has joined #archiveteam-bs [07:05] *** ndiddy has quit IRC () [08:42] *** j08nY has quit IRC (Read error: Operation timed out) [08:53] jrwr: I don't know. We should though. [09:09] *** bwn has quit IRC (Ping timeout: 268 seconds) [09:21] *** bwn has joined #archiveteam-bs [09:47] *** bwn has quit IRC (Read error: Operation timed out) [10:04] *** SHODAN_UI has quit IRC (Remote host closed the connection) [10:13] tfw you can't remember how to do https://launchpad.net/~voltagex/+archive/ubuntu/wget-lua [10:31] *** j08nY has joined #archiveteam-bs [11:12] *** BlueMaxim has quit IRC (Quit: Leaving) [11:42] *** SHODAN_UI has joined #archiveteam-bs [12:25] *** RichardG_ has joined #archiveteam-bs [12:25] *** RichardG has quit IRC (Read error: Connection reset by peer) [12:48] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [12:50] *** RichardG has joined #archiveteam-bs [13:14] *** schbirid has joined #archiveteam-bs [15:33] *** brayden_ has joined #archiveteam-bs [15:33] *** swebb sets mode: +o brayden_ [15:39] *** brayden has quit IRC (Read error: Operation timed out) [15:59] *** Odd0002_ has joined #archiveteam-bs [15:59] *** Odd0002_ has quit IRC (Client Quit) [16:35] *** dashcloud has quit IRC (Remote host closed the connection) [16:40] *** dashcloud has joined #archiveteam-bs [18:06] *** Yoshimura has quit IRC (Quit: WeeChat 0.4.2) [18:42] *** SHODAN_UI has quit IRC (Remote host closed the connection) [18:51] *** aschmitz has joined #archiveteam-bs [19:11] is there something smarter than an awkward --reject-regex like '(.*?/){15}' to avoid wpull trying to get deeply nested broken forum urls with lots of non-existent subdirectories? [19:13] ie shit like https://pastebin.com/raw/139U38TG [19:30] *** BartoCH_ has joined #archiveteam-bs [19:31] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:51] *** BartoCH has joined #archiveteam-bs [19:51] *** BartoCH_ has quit IRC (Ping timeout: 260 seconds) [20:33] *** dashcloud has quit IRC (Remote host closed the connection) [20:36] *** SmileyG has joined #archiveteam-bs [20:42] *** Smiley has quit IRC (Read error: Operation timed out) [20:46] *** SHODAN_UI has joined #archiveteam-bs [20:57] *** bwn has joined #archiveteam-bs [21:55] *** bwn has quit IRC (Read error: Connection reset by peer) [22:10] *** SHODAN_UI has quit IRC (Quit: zzz) [22:10] *** bwn has joined #archiveteam-bs [22:37] *** Aranje has joined #archiveteam-bs [22:47] *** TheLovina has joined #archiveteam-bs [22:55] *** greenie has quit IRC (Read error: Operation timed out) [22:55] *** Smiley has joined #archiveteam-bs [22:56] *** SmileyG has quit IRC (west.us.hub irc.Prison.NET) [22:56] *** dashcloud has joined #archiveteam-bs [23:00] *** greenie has joined #archiveteam-bs [23:21] *** JensRex has quit IRC (Remote host closed the connection) [23:22] *** JensRex has joined #archiveteam-bs [23:35] *** TheLovina has quit IRC (Read error: Operation timed out)