#archiveteam-bs 2017-05-29,Mon

↑back Search

Time Nickname Message
00:02 🔗 j08nY has quit IRC (Quit: Leaving)
00:10 🔗 GLaDOS has quit IRC (Ping timeout: 194 seconds)
00:19 🔗 i0npulse So, is IA massaging HTML in the web archive? Code is being "corrected" on render, with code tags being changed from upper-case to lower-case, quotes being added, and end tags being added.
00:21 🔗 i0npulse It has been really frustrating diffing code from another archive againist content in web.archive.org due to the alterations in the code. I would think this goes against IA being so picky about everything having a warc file so it could preserve the current historical state of the file in its purest form.
00:22 🔗 i0npulse I attempted to submit a feedback request on the matter, but I think internal standards with developers at IA and how serve the data to users has warped the presentation of historically relevant content.
00:23 🔗 i0npulse I wish they would just leave the code broken and malformed like it originally was.
00:26 🔗 i0npulse Original Code: <IMG ALIGN=left border=0 hspace=5 SRC="epicsky2.jpg" width="106" height="66">
00:26 🔗 i0npulse IA code: <img align="left" border="0" hspace="5" src="/web/19971010052947im_/http://www.epicgames.com:80/images/unreal/epicsky2.jpg">
00:34 🔗 Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
00:36 🔗 Xibalba has joined #archiveteam-bs
00:40 🔗 Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
00:45 🔗 Xibalba has joined #archiveteam-bs
00:45 🔗 BlueMaxim has joined #archiveteam-bs
00:48 🔗 timmc i0npulse: Maybe this is a distinction between web.archive.org-the-browsable-version and the archive-as-you-can-download-it-via-API
00:48 🔗 GLaDOS has joined #archiveteam-bs
00:51 🔗 arkiver yipdw: yeah, python warc library is horrible
00:51 🔗 arkiver it needs to load the full record into memory
00:51 🔗 arkiver which just sucks
00:52 🔗 arkiver there is now https://github.com/webrecorder/warcio which does not do this afaik
00:52 🔗 arkiver but there was a problem with writing records that are not request or response records
00:52 🔗 arkiver I believe that was fixed a few weeks ago
00:52 🔗 arkiver so I will give warcio a second try soon
00:53 🔗 arkiver but until then I don't know if we can trust it enough to actually replace the deduplication warc stuff we are using now
00:56 🔗 arkiver so I hope we can have that tested soon and after that totally abandon https://github.com/internetarchive/warc
00:57 🔗 arkiver i0npulse: what URL did you use?
00:59 🔗 i0npulse let me see
01:01 🔗 i0npulse https://web.archive.org/web/19971010052947/http://www.epicgames.com/unreal.htm
01:01 🔗 arkiver and adding id_ gives the original version
01:01 🔗 arkiver https://web.archive.org/web/19971010052947id_/http://www.epicgames.com/unreal.htm
01:02 🔗 i0npulse ok nice
01:03 🔗 arkiver yeah I see the problem
01:03 🔗 arkiver <A HREF="index.htm"><IMG ALIGN=left border=0 hspace=5 SRC="images/unreal/epicsky2.jpg">
01:03 🔗 arkiver to
01:03 🔗 arkiver <a href="index.htm"><img align="left" border="0" hspace="5" src="/web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg">
01:03 🔗 arkiver which is strange
01:04 🔗 arkiver thanks
01:04 🔗 i0npulse Yea I originally noticed it when repaired an old SNK Official Homepage for the company back in 2001, that someone had archived way back then with wget.
01:04 🔗 i0npulse Some files were missing, so I was patching corrections in via IA... and thats when I noticed the syntax discrepancies.
01:05 🔗 arkiver yeah
01:06 🔗 i0npulse Eventually I will build code in node to reach IA's API, but at the moment my entire toolchain is in bash with gnu tools.
01:06 🔗 arkiver nice find ;)
01:06 🔗 arkiver :)*
01:08 🔗 arkiver will keep you informed on this
01:08 🔗 ndiddy has joined #archiveteam-bs
01:09 🔗 joepie91 i0npulse: it's very likely that they're using a DOM-aware library for rewriting URLs (for increased reliability), ie. parsing the document, modifying some attributes in the DOM, and then stringifying it again
01:09 🔗 joepie91 i0npulse: for these kind of operations, it's not typical to see errors in the original document reflected in the output; generally a parser will patch things 'on the fly' and only present you with a DOM / AST / whatever, without any annotations that indicate things like spacing, parsing errors, etc.
01:10 🔗 joepie91 so there's no way to accurately recreate the original code's structure from that parsed data
01:10 🔗 arkiver we're discovering a rewriting URLs using custom software
01:10 🔗 joepie91 (this is an issue I've run into a few times with code rewriting projects)
01:11 🔗 joepie91 arkiver: can you rephrase that? syntax error in line 1 of that sentence :P
01:11 🔗 i0npulse lol
01:11 🔗 arkiver uh
01:11 🔗 i0npulse i think a = and
01:11 🔗 arkiver yeah
01:11 🔗 joepie91 ah, right. but if you're using a library for parsing HTML, you still end up with the same problem :P
01:12 🔗 joepie91 lots of parsers flat-out don't support annotating things like this
01:13 🔗 jrwr Wow, the MAP tag
01:15 🔗 alembic beautifulsoup is usually what python folk use to parse & prettify html... might be responsible for some of the changes you're seeing
01:15 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
01:16 🔗 i0npulse I had rebuilt the old Unreal Technology website, and the original admins had case dyslexia, so files were linked as UnrealVersion.htm, unrealversion.htm, and unrealVersion.htm all through out the code. Which sent the original HTTrack grab by Hyper.nl on a wild ride appending -1, -2, -3 onto file names.
01:17 🔗 i0npulse Then I was looking at the "corrected syntax" output from IA, and was like "ok WHICH is the real file name!?"
01:17 🔗 i0npulse I already packed the site... so now the id_ tip from arkiver... wondering if I go back and fix anything or not lol
01:22 🔗 arkiver I had a look at wayback
01:23 🔗 arkiver images/unreal/epicsky2.jpg is currently rewritten to host relative
01:23 🔗 Sk1d has joined #archiveteam-bs
01:23 🔗 arkiver since it is seen as an image and thus should have 'im_' appended to the timestamp
01:23 🔗 arkiver /web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg
01:24 🔗 i0npulse so the image path discrpancy between Hyper.nl and IA is probobly a difference in how Hyper.nl's copy was originally retrieved
01:24 🔗 arkiver if we would keep it path relative we could not get the im_ in
01:24 🔗 arkiver do you have an example
01:24 🔗 arkiver post the example, will have a look tomorrow
01:25 🔗 * arkiver is afk
01:25 🔗 i0npulse I think the path in IA is accurate
01:25 🔗 arkiver yeah it is, but we try to keep the path the way it is
01:25 🔗 arkiver so path relative to path relative
01:25 🔗 arkiver host relative to host relative
01:25 🔗 i0npulse its just the syntax massaging that was being thrown off, case, quotes, tags
01:25 🔗 arkiver but in this case we can't
01:25 🔗 arkiver yes
01:25 🔗 arkiver still looking into that
01:26 🔗 arkiver the old wayback machine would rewrite all URLs to host relative
01:26 🔗 arkiver and that was sometimes a problem with URLs embedded in scripts
01:26 🔗 arkiver they would not be recognized, etc.
01:26 🔗 i0npulse what do you mean by host vs path relative
01:26 🔗 arkiver (fixed in the new wayback machine)
01:26 🔗 arkiver path relative = images/unreal/epicsky2.jpg
01:27 🔗 arkiver so that URL is a location in the current path on the website
01:27 🔗 arkiver host relative = /web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg
01:27 🔗 arkiver so URL is a location on host
01:27 🔗 i0npulse OH yea, that is an issue that can be worked around on my end. At least the necessary info is there.
01:28 🔗 i0npulse Nothing that sed can't solve for
01:28 🔗 arkiver yeah
01:28 🔗 arkiver anyway
01:28 🔗 * arkiver is afk
01:28 🔗 i0npulse cool! laters
01:35 🔗 godane i'm finally capturing a home movie
02:08 🔗 SketchCow wp494: S8+
02:24 🔗 Lord_Nigh balrog: those magazines we have at the shop... what is the plan for those?
02:41 🔗 yipdw arkiver: ah, ok. I may also give warcio a try soon
03:19 🔗 robogoat archiveteam has a shop?
03:30 🔗 Lord_Nigh no, balrog has space rented as a shop
03:30 🔗 Lord_Nigh and has a pile of magazines boxed up
03:30 🔗 Lord_Nigh as a machine shop for building stuff
03:34 🔗 Lord_Nigh iirc that stuff is unscanned tech stuff from late 70s and 80s, which Ia does not have
04:32 🔗 robogoat Ah,
04:36 🔗 usr has joined #archiveteam-bs
04:49 🔗 usr quiet here
04:50 🔗 voltagex Anyone pushing https://loc.gov/cds/downloads/MDSConnect to archive.org?
04:51 🔗 xmc Kaz: wp494: my munin is all broken and stuff, i know
04:51 🔗 usr voltagex: is that just metadata or does LoC include content too?
04:52 🔗 voltagex 25 million bibliographical data sets apparently
04:52 🔗 voltagex Someone on HN will pay for a VPS for me but I have no idea where to start
04:52 🔗 usr shame its not the contents as well, thatd be a nice haul
04:53 🔗 voltagex 100gb+ of metadata :p
04:55 🔗 usr so question as a new person to the archive-team effort ... lots of work here to get data/data-sets, but not much about plugging the data back into a self-hosted source. if the source/website is open-source as well, would it not make sense to include that as well (not as a requirement to new projects, but as a stretch-goal) ?
04:56 🔗 usr ex. i have the reddit comment data dump, but to me, it makes most sense with a local instance of reddit running
04:56 🔗 voltagex The data is the irreplaceable bit
04:56 🔗 usr agreed
04:56 🔗 usr data is pri-0, gotta come first
04:56 🔗 voltagex Most of the time it's a mad dash to save shit
04:57 🔗 voltagex See the Imzy shut down - less than one month to grab everything
05:03 🔗 usr i suppose my use case is a little bit off the main trail
05:04 🔗 bmcginty has quit IRC (Ping timeout: 246 seconds)
05:10 🔗 bmcginty has joined #archiveteam-bs
05:13 🔗 ranma damn, IA doesn't have the audio for (nsfw) https://web.archive.org/web/20120615222011/http://soundcloud.com/htf-s
05:13 🔗 ranma people recording their neighbors annoying having loud sex
05:13 🔗 usr rofl
05:14 🔗 ranma *annoyingly
05:50 🔗 usr has quit IRC (Quit: Leaving)
06:59 🔗 j08nY has joined #archiveteam-bs
07:01 🔗 SHODAN_UI has joined #archiveteam-bs
07:05 🔗 ndiddy has quit IRC ()
08:42 🔗 j08nY has quit IRC (Read error: Operation timed out)
08:53 🔗 JAA jrwr: I don't know. We should though.
09:09 🔗 bwn has quit IRC (Ping timeout: 268 seconds)
09:21 🔗 bwn has joined #archiveteam-bs
09:47 🔗 bwn has quit IRC (Read error: Operation timed out)
10:04 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
10:13 🔗 voltagex tfw you can't remember how to do https://launchpad.net/~voltagex/+archive/ubuntu/wget-lua
10:31 🔗 j08nY has joined #archiveteam-bs
11:12 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:42 🔗 SHODAN_UI has joined #archiveteam-bs
12:25 🔗 RichardG_ has joined #archiveteam-bs
12:25 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
12:48 🔗 RichardG_ has quit IRC (Read error: Connection reset by peer)
12:50 🔗 RichardG has joined #archiveteam-bs
13:14 🔗 schbirid has joined #archiveteam-bs
15:33 🔗 brayden_ has joined #archiveteam-bs
15:33 🔗 swebb sets mode: +o brayden_
15:39 🔗 brayden has quit IRC (Read error: Operation timed out)
15:59 🔗 Odd0002_ has joined #archiveteam-bs
15:59 🔗 Odd0002_ has quit IRC (Client Quit)
16:35 🔗 dashcloud has quit IRC (Remote host closed the connection)
16:40 🔗 dashcloud has joined #archiveteam-bs
18:06 🔗 Yoshimura has quit IRC (Quit: WeeChat 0.4.2)
18:42 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
18:51 🔗 aschmitz has joined #archiveteam-bs
19:11 🔗 schbirid is there something smarter than an awkward --reject-regex like '(.*?/){15}' to avoid wpull trying to get deeply nested broken forum urls with lots of non-existent subdirectories?
19:13 🔗 schbirid ie shit like https://pastebin.com/raw/139U38TG
19:30 🔗 BartoCH_ has joined #archiveteam-bs
19:31 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
19:51 🔗 BartoCH has joined #archiveteam-bs
19:51 🔗 BartoCH_ has quit IRC (Ping timeout: 260 seconds)
20:33 🔗 dashcloud has quit IRC (Remote host closed the connection)
20:36 🔗 SmileyG has joined #archiveteam-bs
20:42 🔗 Smiley has quit IRC (Read error: Operation timed out)
20:46 🔗 SHODAN_UI has joined #archiveteam-bs
20:57 🔗 bwn has joined #archiveteam-bs
21:55 🔗 bwn has quit IRC (Read error: Connection reset by peer)
22:10 🔗 SHODAN_UI has quit IRC (Quit: zzz)
22:10 🔗 bwn has joined #archiveteam-bs
22:37 🔗 Aranje has joined #archiveteam-bs
22:47 🔗 TheLovina has joined #archiveteam-bs
22:55 🔗 greenie has quit IRC (Read error: Operation timed out)
22:55 🔗 Smiley has joined #archiveteam-bs
22:56 🔗 SmileyG has quit IRC (west.us.hub irc.Prison.NET)
22:56 🔗 dashcloud has joined #archiveteam-bs
23:00 🔗 greenie has joined #archiveteam-bs
23:21 🔗 JensRex has quit IRC (Remote host closed the connection)
23:22 🔗 JensRex has joined #archiveteam-bs
23:35 🔗 TheLovina has quit IRC (Read error: Operation timed out)

irclogger-viewer