#archiveteam-bs 2017-05-29,Mon

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***j08nY has quit IRC (Quit: Leaving) [00:02]
GLaDOS has quit IRC (Ping timeout: 194 seconds) [00:10]
i0npulseSo, is IA massaging HTML in the web archive? Code is being "corrected" on render, with code tags being changed from upper-case to lower-case, quotes being added, and end tags being added.
It has been really frustrating diffing code from another archive againist content in web.archive.org due to the alterations in the code. I would think this goes against IA being so picky about everything having a warc file so it could preserve the current historical state of the file in its purest form.
I attempted to submit a feedback request on the matter, but I think internal standards with developers at IA and how serve the data to users has warped the presentation of historically relevant content.
I wish they would just leave the code broken and malformed like it originally was.
Original Code: <IMG ALIGN=left border=0 hspace=5 SRC="epicsky2.jpg" width="106" height="66">
IA code: <img align="left" border="0" hspace="5" src="/web/19971010052947im_/http://www.epicgames.com:80/images/unreal/epicsky2.jpg">
[00:19]
***Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
Xibalba has joined #archiveteam-bs
Xibalba has quit IRC (ZNC 1.7.x-git-737-29d4f20-frankenznc - http://znc.in)
[00:34]
Xibalba has joined #archiveteam-bs
BlueMaxim has joined #archiveteam-bs
[00:45]
timmci0npulse: Maybe this is a distinction between web.archive.org-the-browsable-version and the archive-as-you-can-download-it-via-API [00:48]
***GLaDOS has joined #archiveteam-bs [00:48]
arkiveryipdw: yeah, python warc library is horrible
it needs to load the full record into memory
which just sucks
there is now https://github.com/webrecorder/warcio which does not do this afaik
but there was a problem with writing records that are not request or response records
I believe that was fixed a few weeks ago
so I will give warcio a second try soon
but until then I don't know if we can trust it enough to actually replace the deduplication warc stuff we are using now
so I hope we can have that tested soon and after that totally abandon https://github.com/internetarchive/warc
i0npulse: what URL did you use?
[00:51]
i0npulselet me see
https://web.archive.org/web/19971010052947/http://www.epicgames.com/unreal.htm
[00:59]
arkiverand adding id_ gives the original version
https://web.archive.org/web/19971010052947id_/http://www.epicgames.com/unreal.htm
[01:01]
i0npulseok nice [01:02]
arkiveryeah I see the problem
<A HREF="index.htm"><IMG ALIGN=left border=0 hspace=5 SRC="images/unreal/epicsky2.jpg">
to
<a href="index.htm"><img align="left" border="0" hspace="5" src="/web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg">
which is strange
thanks
[01:03]
i0npulseYea I originally noticed it when repaired an old SNK Official Homepage for the company back in 2001, that someone had archived way back then with wget.
Some files were missing, so I was patching corrections in via IA... and thats when I noticed the syntax discrepancies.
[01:04]
arkiveryeah [01:05]
i0npulseEventually I will build code in node to reach IA's API, but at the moment my entire toolchain is in bash with gnu tools. [01:06]
arkivernice find ;)
:)*
will keep you informed on this
[01:06]
***ndiddy has joined #archiveteam-bs [01:08]
joepie91i0npulse: it's very likely that they're using a DOM-aware library for rewriting URLs (for increased reliability), ie. parsing the document, modifying some attributes in the DOM, and then stringifying it again
i0npulse: for these kind of operations, it's not typical to see errors in the original document reflected in the output; generally a parser will patch things 'on the fly' and only present you with a DOM / AST / whatever, without any annotations that indicate things like spacing, parsing errors, etc.
so there's no way to accurately recreate the original code's structure from that parsed data
[01:09]
arkiverwe're discovering a rewriting URLs using custom software [01:10]
joepie91(this is an issue I've run into a few times with code rewriting projects)
arkiver: can you rephrase that? syntax error in line 1 of that sentence :P
[01:10]
i0npulselol [01:11]
arkiveruh [01:11]
i0npulsei think a = and [01:11]
arkiveryeah [01:11]
joepie91ah, right. but if you're using a library for parsing HTML, you still end up with the same problem :P
lots of parsers flat-out don't support annotating things like this
[01:11]
jrwrWow, the MAP tag [01:13]
alembicbeautifulsoup is usually what python folk use to parse & prettify html... might be responsible for some of the changes you're seeing [01:15]
***Sk1d has quit IRC (Ping timeout: 250 seconds) [01:15]
i0npulseI had rebuilt the old Unreal Technology website, and the original admins had case dyslexia, so files were linked as UnrealVersion.htm, unrealversion.htm, and unrealVersion.htm all through out the code. Which sent the original HTTrack grab by Hyper.nl on a wild ride appending -1, -2, -3 onto file names.
Then I was looking at the "corrected syntax" output from IA, and was like "ok WHICH is the real file name!?"
I already packed the site... so now the id_ tip from arkiver... wondering if I go back and fix anything or not lol
[01:16]
arkiverI had a look at wayback
images/unreal/epicsky2.jpg is currently rewritten to host relative
[01:22]
***Sk1d has joined #archiveteam-bs [01:23]
arkiversince it is seen as an image and thus should have 'im_' appended to the timestamp
/web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg
[01:23]
i0npulseso the image path discrpancy between Hyper.nl and IA is probobly a difference in how Hyper.nl's copy was originally retrieved [01:24]
arkiverif we would keep it path relative we could not get the im_ in
do you have an example
post the example, will have a look tomorrow
arkiver is afk
[01:24]
i0npulseI think the path in IA is accurate [01:25]
arkiveryeah it is, but we try to keep the path the way it is
so path relative to path relative
host relative to host relative
[01:25]
i0npulseits just the syntax massaging that was being thrown off, case, quotes, tags [01:25]
arkiverbut in this case we can't
yes
still looking into that
the old wayback machine would rewrite all URLs to host relative
and that was sometimes a problem with URLs embedded in scripts
they would not be recognized, etc.
[01:25]
i0npulsewhat do you mean by host vs path relative [01:26]
arkiver(fixed in the new wayback machine)
path relative = images/unreal/epicsky2.jpg
so that URL is a location in the current path on the website
host relative = /web/19971010052947im_/http://www.epicgames.com/images/unreal/epicsky2.jpg
so URL is a location on host
[01:26]
i0npulseOH yea, that is an issue that can be worked around on my end. At least the necessary info is there.
Nothing that sed can't solve for
[01:27]
arkiveryeah
anyway
arkiver is afk
[01:28]
i0npulsecool! laters [01:28]
godanei'm finally capturing a home movie [01:35]
....... (idle for 33mn)
SketchCowwp494: S8+ [02:08]
.... (idle for 16mn)
Lord_Nighbalrog: those magazines we have at the shop... what is the plan for those? [02:24]
.... (idle for 17mn)
yipdwarkiver: ah, ok. I may also give warcio a try soon [02:41]
........ (idle for 38mn)
robogoatarchiveteam has a shop? [03:19]
Lord_Nighno, balrog has space rented as a shop
and has a pile of magazines boxed up
as a machine shop for building stuff
iirc that stuff is unscanned tech stuff from late 70s and 80s, which Ia does not have
[03:30]
............ (idle for 58mn)
robogoatAh, [04:32]
***usr has joined #archiveteam-bs [04:36]
usrquiet here [04:49]
voltagexAnyone pushing https://loc.gov/cds/downloads/MDSConnect to archive.org? [04:50]
xmcKaz: wp494: my munin is all broken and stuff, i know [04:51]
usrvoltagex: is that just metadata or does LoC include content too? [04:51]
voltagex25 million bibliographical data sets apparently
Someone on HN will pay for a VPS for me but I have no idea where to start
[04:52]
usrshame its not the contents as well, thatd be a nice haul [04:52]
voltagex100gb+ of metadata :p [04:53]
usrso question as a new person to the archive-team effort ... lots of work here to get data/data-sets, but not much about plugging the data back into a self-hosted source. if the source/website is open-source as well, would it not make sense to include that as well (not as a requirement to new projects, but as a stretch-goal) ?
ex. i have the reddit comment data dump, but to me, it makes most sense with a local instance of reddit running
[04:55]
voltagexThe data is the irreplaceable bit [04:56]
usragreed
data is pri-0, gotta come first
[04:56]
voltagexMost of the time it's a mad dash to save shit
See the Imzy shut down - less than one month to grab everything
[04:56]
usri suppose my use case is a little bit off the main trail [05:03]
***bmcginty has quit IRC (Ping timeout: 246 seconds) [05:04]
bmcginty has joined #archiveteam-bs [05:10]
ranmadamn, IA doesn't have the audio for (nsfw) https://web.archive.org/web/20120615222011/http://soundcloud.com/htf-s
people recording their neighbors annoying having loud sex
[05:13]
usrrofl [05:13]
ranma*annoyingly [05:14]
........ (idle for 36mn)
***usr has quit IRC (Quit: Leaving) [05:50]
.............. (idle for 1h9mn)
j08nY has joined #archiveteam-bs
SHODAN_UI has joined #archiveteam-bs
ndiddy has quit IRC ()
[06:59]
.................... (idle for 1h37mn)
j08nY has quit IRC (Read error: Operation timed out) [08:42]
JAAjrwr: I don't know. We should though. [08:53]
.... (idle for 16mn)
***bwn has quit IRC (Ping timeout: 268 seconds) [09:09]
bwn has joined #archiveteam-bs [09:21]
...... (idle for 26mn)
bwn has quit IRC (Read error: Operation timed out) [09:47]
.... (idle for 17mn)
SHODAN_UI has quit IRC (Remote host closed the connection) [10:04]
voltagextfw you can't remember how to do https://launchpad.net/~voltagex/+archive/ubuntu/wget-lua [10:13]
.... (idle for 18mn)
***j08nY has joined #archiveteam-bs [10:31]
......... (idle for 41mn)
BlueMaxim has quit IRC (Quit: Leaving) [11:12]
....... (idle for 30mn)
SHODAN_UI has joined #archiveteam-bs [11:42]
......... (idle for 43mn)
RichardG_ has joined #archiveteam-bs
RichardG has quit IRC (Read error: Connection reset by peer)
[12:25]
..... (idle for 23mn)
RichardG_ has quit IRC (Read error: Connection reset by peer)
RichardG has joined #archiveteam-bs
[12:48]
..... (idle for 24mn)
schbirid has joined #archiveteam-bs [13:14]
............................ (idle for 2h19mn)
brayden_ has joined #archiveteam-bs
swebb sets mode: +o brayden_
[15:33]
brayden has quit IRC (Read error: Operation timed out) [15:39]
..... (idle for 20mn)
Odd0002_ has joined #archiveteam-bs
Odd0002_ has quit IRC (Client Quit)
[15:59]
........ (idle for 36mn)
dashcloud has quit IRC (Remote host closed the connection) [16:35]
dashcloud has joined #archiveteam-bs [16:40]
.................. (idle for 1h26mn)
Yoshimura has quit IRC (Quit: WeeChat 0.4.2) [18:06]
........ (idle for 36mn)
SHODAN_UI has quit IRC (Remote host closed the connection) [18:42]
aschmitz has joined #archiveteam-bs [18:51]
..... (idle for 20mn)
schbiridis there something smarter than an awkward --reject-regex like '(.*?/){15}' to avoid wpull trying to get deeply nested broken forum urls with lots of non-existent subdirectories?
ie shit like https://pastebin.com/raw/139U38TG
[19:11]
.... (idle for 17mn)
***BartoCH_ has joined #archiveteam-bs
BartoCH has quit IRC (Ping timeout: 260 seconds)
[19:30]
..... (idle for 20mn)
BartoCH has joined #archiveteam-bs
BartoCH_ has quit IRC (Ping timeout: 260 seconds)
[19:51]
......... (idle for 42mn)
dashcloud has quit IRC (Remote host closed the connection)
SmileyG has joined #archiveteam-bs
[20:33]
Smiley has quit IRC (Read error: Operation timed out)
SHODAN_UI has joined #archiveteam-bs
[20:42]
bwn has joined #archiveteam-bs [20:57]
............ (idle for 58mn)
bwn has quit IRC (Read error: Connection reset by peer) [21:55]
.... (idle for 15mn)
SHODAN_UI has quit IRC (Quit: zzz)
bwn has joined #archiveteam-bs
[22:10]
...... (idle for 27mn)
Aranje has joined #archiveteam-bs [22:37]
TheLovina has joined #archiveteam-bs [22:47]
greenie has quit IRC (Read error: Operation timed out)
Smiley has joined #archiveteam-bs
SmileyG has quit IRC (west.us.hub irc.Prison.NET)
dashcloud has joined #archiveteam-bs
greenie has joined #archiveteam-bs
[22:55]
..... (idle for 21mn)
JensRex has quit IRC (Remote host closed the connection)
JensRex has joined #archiveteam-bs
[23:21]
TheLovina has quit IRC (Read error: Operation timed out) [23:35]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)